Are you dining safe? Comparison of Yelp user ratings and health inspection data
Descriptive data mining project by: Karina Abreu, Ahmed Emam, Sarah Gorman, Jon Lubanski
Introduction & Problem Definition
Standards of food safety are important to consumers when deciding where to eat out, and this area is also a matter of
public safety. Public health programs, such as the DineSafe food safety inspection system, exist to protect and improve public health in Ontario;
however, when deciding where to eat, consumers are more likely to check Yelp for reviews then DineSafe for health inspection results.
In this project we wanted to answer the following question:
Is there a relationship between restaurants with high ratings on Yelp and restaurants that
are clean and hygienic?
We hypothesize that there might be a disconnect between public perceptions of cleanliness, as expressed through Yelp ratings,
and the result of Dinesafe inspections.
The datasets we will be using both rely on human judgment to some degree. The DineSafe data presents more of an “expert” measure of an
establishment compared with consumers' arbitrary impressions of that same establishment on Yelp. Our goal from such comparison is to see
whether there are associations between these two very different ways of making judgments.
Datasets
For this project we used a dataset from DineSafe and another from Yelp.
The DineSafe dataset consists of health inspection records sourced from the City of Toronto open data portal. The dataset consists of 90,521 records
for over 10,000 unique establishments and 55,589 inspections from 2016 to 2018. Each record represents the result of a single inspection for an
establishment. Some of the more meaningful attributes in this dataset are: Establishment status, infraction details, and severity.
The Yelp rating dataset was obtained from Yelp.ca, which offers a variety of different datasets open-source. We used their “Business” dataset which
consist of business record saved as json objects and was part of a much larger dataset. These are records of all kinds of establishments, not just eating
places, and these businesses are found all over Canada and the US. The dataset consists of 188,593 uniquely registered businesses. Some of the more
meaningful attributes in this dataset are: rating, address, name, neighbourhood, and category.
Data pre-processing
The biggest challenges we faced were understanding the data and figuring out how we could join our two datasets.
The first thing we did in preprocessing the data was to reduce the Yelp dataset to data about food-related establishments within the City of Toronto.
We used a Python script to parse the json file from Yelp and limit the dataset to the desired records. Secondly, we deleted irrelevant columns and tuples with
missing values from both datasets.
The DineSafe data was fairly clean and standardized and did not require much preprocessing, but it was difficult to join with the Yelp dataset.
The DineSafe data contained multiple inspection records per establishment, so the first thing we did was reduce the dataset to contain only one unique
record per establishment, just for the purpose of finding matchpoints between the datasets. We used Excel for this purpose.
After we found possible matchpoints between the two datasets, we brought the full set of DineSafe records back in.
To find the matchpoints, we first tried joining using GPS data (Latitude and Longitude) which was a common attribute of both datasets.
We worked with Azure Machine Learning Studio to join the datasets. However, we quickly learned that Latitude and Longitude coordinates are approximated, therefore,
we got very few matches because of small variations in the digits between the datasets. Next we tried to match on addresses but soon realized that the addresses on
Yelp are not standardized in any way. In addition, Yelp addresses contained unit numbers within buildings while the DineSafe addresses did not.
To deal with these issues, I took on the task of standardizing the addresses in both datasets down to the street number and the first 1 or 2 words of the street name.
This improved the accuracy of our matching. We also removed any establishments with shared addresses from DineSafe - we couldn’t match these with the
Yelp data without unit numbers, they would just throw off our results. We decided to restrict our analysis to the establishments that had data we
could work with: standalone establishments that are not part of food courts or plazas. We connected the datasets using a fuzzy match on the
establishment name and address, and kept results with over 80% accuracy.
Once we found matchpoints between the two datasets, we did an inner join using Azure Machine Learning Studio to get 22,546 records (so, about 25% of the DineSafe records we started
with and about 12% of the yelp business dataset).
Data Analysis Methods
Using Azure Machine Learning Studio, we started by using K-means clustering because it would let us broadly explore our joined dataset and identify
relationships between inspection results, establishment characteristics, and Yelp ratings. We wanted to group establishments into clusters that contain similar characteristics to help us
discover unexpected correlations that we might not logically derive by browsing the data.
To get the data ready for clustering, we needed to do further transform it in Azure. We changed data types of certain columns from numeric to
categorical, removed irrelevant columns and rows with missing data that had been missed dueing pre-processing, and converted the Inspection Result (Pass, Conditional Pass, Closed) and Neighbourhoods from categories into numeric features so the
algorithm could actually work with them.
We used the attributes Ratings, Neighbourhood, Establishment status to run 2 clustering experiments: one using the Parametre Range trainer
mode with the Sweep Clustering training model so it would choose the optimal set of hyperparameters for us, and another the the Single Parametre
trainer mode with the Train Clustering model.
Next we used association rule mining to find frequent patterns and inherent regularities in the data. To prepare our data, we binned the Yelp
ratings into three bins: “low” (0 to 2), “average” (2.1 to 3.5) and “high” (3.6 to 5) so we would end up with more focused rules. We standardized
the instruction details for inspection results, and cleaned up the establishment categories.
We ran 4 association rule experiment to answer the following questions:
What restaurant categories are most likely to receive a passing inspection result?
What Yelp rating is most frequently associated with establishments that has passed an inspection?
Are there any associations between Yelp rating, inspection outcome, neighbourhood, and infraction details?
Experimental Results
We hypothesized that there will not be a relationship between Yelp restaurant ratings and DineSafe inspection results, and this was
confirmed through our experiments. The resulting clustering models did not produce any comprehensible results. Our clusters overlapped each other,
we ran the experiment with different parameters and it was a variation of this every time.
With association rule mining, we found that:
Restaurants with a rating between 3.5 & 5 → Pass (38%, 94%)
Restaurants with a rating between 2.1 & 3.4 Rating → Pass (49%, 90%)
Restaurants with a rating between 0 & 2.1 Rating → Pass (4%, 89%)
So we can say with a bit more confidence that a restaurant with a higher Yelp rating will pass an inspection, but the difference the highest
rating and the lowest rating is only 5%. So all restaurant have a similarly chance of passing an inspection, regardless of their Yelp rating.
Moreover, we could not find any rules with more than 2 items that had more than 3% support. So essentially, there are no associations between more
than two items in our datasets.
In conclusion, public perceptions of the cleanliness and food safety in a particular restaurant do not not reflect the actual situation in the
kitchen. This could be because the general public might assess a restaurant’s food hygiene standards based solely on things like aesthetic,
quality of customer service, or price range.
Next Steps
Using both Dinesafe and Yelp data, build a decision tree classifier model for restaurant inspections that would predict a pass, conditional
pass or closed result
Balance the Dinesafe data to include more inspections with conditional pass and closed results; inspections were heavily skewed toward a Pass result