Chapter 2 Data sources

Since the data on the source website is pretty messy, we found a dataset which is already cleaned up and includes all information we need. The dataset we are using is from the website NYC OpenData. The dataset is about restaurant inspection results in NYC from three years ago until present time. The dataset contains the restaurant names, the inspection results, the location of restaurants, the violation dates, the boroughs of restaurants and some other information.

Here is the link to the original dataset:

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

Records are included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Since thousands of restaurants start business and go out of business every year, only restaurants in an active status are included in the dataset.

Each restaurant was inspected more than once, including initial inspections, re-inspections and reopening inspections. Each restaurant can also be inspected more than once per day, with each violation as a seperate observation in the dataset. The grades are given as either “A” or “NA” for the initial inspections and “A”, “B”, “C”, “Z”, “P” for re-inspections or reopening inspections. The grades are given according to scores of violations, with lower scores resulting in better grades. Here “A”, “B”, and “C” represents score levels, “Z” means “grade pending” and “P” means “Grade Pending issued on re-opening following an initial inspection that resulted in a closure”. Higher scores for certain serious violation types can lead to closures of the restaurants.

There are some missing values in some columns of the dataset, but we will manage to handle this when we process the data.