Food poisoning is a real public health threat. Every year, approximately 48 million people get sick from food poisoning, 128 thousand are hospitalized, and 3,000 die from foodborne illnesses. According to estimates published by the Center for Science in the Public Interest, about two-thirds of food poisonings occur through restaurants. To keep this threat in check, the City of New York’s Department of Health and Mental Hygiene operates a restaurant inspection system that monitors all food vendors – from street trucks to five-star restaurants – regularly, and their grades must be posted publicly. A vendor with too many violations, or violations that are too serious, will be shut down. The system tries to safeguard the public from restauranteurs who do not take food safety seriously, and allows consumers to weigh food safety when choosing restaurants.
This analysis looks for patterns in food safety by geographic area. It asks whether there are neighborhoods in which food safety seems generally more or less safe than others. In terms of geography, there is no immediately obvious pattern, except perhaps that inspection failures might be more infrequent in low-density areas. This may be partly due to lower rates of rodent infestation, which is a major source of problems in New York City. Further analysis is required before being able discern finer differences.
It should be noted: Readers should avoid the ecological fallacy of assuming that general patterns observed in neighorhoods apply to all restaurants in that area. Restaurants are rated individually, and should be judged as such. If anything, an “A” rating in a generally “C” neighborhood or cuisine is a gem that stands out from its peers.
Data
The analysis uses violations data collected by the New York City Department of Health and Mental Hygeine, and distributed by NYC Open Data. In this analysis, we will look at all restaurants that were inspected at least once in 2022. Our analysis will focus on every restaurant’s lowest overall grade for that year. Restaurants are evaluated on a demerit points system, described here.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# Load Data data.orig <- read.csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv") # Create a derivative set to manipulate data <- data.orig # Examine the variable names and top few rows in the set: names(data) head(data) # Noted missing inspection dates. prop.table(table(ifelse(data$INSPECTION.DATE == "01/01/1900", "Missing", "Reported"))) # Approx 2%. Might be inspections not yet conducted. Small %. Dropped data <- subset(data, INSPECTION.DATE != "01/01/1900") # Parse date data date.obj <- strptime(data$INSPECTION.DATE, format = "%m/%d/%Y") data$insp.day <- as.numeric(format(date.obj, "%d")) data$insp.month <- as.numeric(format(date.obj, "%m")) data$insp.year <- as.numeric(format(date.obj, "%Y")) rm(date.obj) # Restrict data to 2022 data <- subset(data, insp.year == 2022) # Checking geoloc data: long + lat summary(data$Longitude) #Sensible range sum(is.na(data$Longitude)) / length(data$Longitude) # <.001% missing Longitude data summary(data$Latitude) #Sensible range sum(is.na(data$Latitude)) / length(data$Latitude) # <.001% missing geo data # Dropping missing data$long <- data$Longitude data$lat <- data$Latitude # Zipcode variable: zip data$zip <- data$ZIPCODE data$zip <- as.numeric(data$zip) sum(is.na(data$zip)) / length(data$zip) # 1% missing zip data length(unique(data$zip)) # 219 unique zip codes. #Cross-reference zips with State list ziplist <- read.csv("New_York_State_ZIP_Codes-County_FIPS_Cross-Reference.csv") ziplist <- subset(ziplist, County.Name %in% c("Queens", "Kings", "Bronx", "New York", "Richmond")) subset(data, !(zip %in% unique(ziplist$ZIP.Code))) #All zips in NYC # ACTION ACTION ACTION ACTION: # Possible to impute missing zipcodes from long/lat # Will bracket here # TRY MAPQUEST rm(ziplist) # Data need to be collapsed, as there are multiple lines per inspection # Collapse by CAMIS + DATE table(data$CRITICAL.FLAG) data$critical.flag <- ifelse(data$CRITICAL.FLAG == "Critical", 1, 0) # Cleaning Up & Reordering Frame data <- data[c(1:3, 33,31,32,28:30,34,14:15, 21:24)] # Examine to develop duplicates strategy names(data)[1] <- paste("CAMIS") data <- data %>% arrange(CAMIS, insp.year, insp.month, insp.day) # Strategy: # (1) Go by restaurant for whole year, rather than by event. Trim inspection dates # (2) Define critical.flag as any critical # Note to students (and self in lecture): As soon as I wrote these comments here, I instantly moved up to the main text section to write up this feature of the data. I will go back and forth and work out the text as I develop this project. But coming up with that text was easy to do because it was front of mind. This is the great benefit of Markdown, in which you can write up your report and do your code more simultaneously. #Trim set and collapse identifying information temp.data <- unique(data[c(1:6, 13:16)]) # Calculate high/low scores temp.SCORES <- data %>% group_by(CAMIS) %>% summarize(best.score = min(SCORE), worst.score = max(SCORE), critical = max(critical.flag)) # Merge data temp.data <- merge(temp.data, temp.SCORES, by = "CAMIS") data <- temp.data rm(list=ls(pattern = "temp")) gc() # Check new variables summary(data$worst.score) # Calculate Low Grade data$low.grade <- ifelse(data$worst.score <= 13, "A", ifelse(data$worst.score <= 27, "B", ifelse(data$worst.score <= 999, "C", "Error"))) prop.table(table(data$low.grade)) # Found miscode in BORO data$BORO <- ifelse(data$BORO == "0", NA, data$BORO) # Electing not to impute missing zips. Too much missing SCORE data for it to lead to substantial quality improvement. # Wrangled: 1/30 |
Analysis
Below, we examine geographic patterns in both high- and low-scoring restaurants.
C-Grade Restaurants
Which zipcodes have the highest proportion of “C” rated restaurants, relative to their overall restaurant population? We only consider zip codes with at least 10 inspections. Note that we have top-coded the proportions at 30% to improve legibility. These outliers are described below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
data$c.grade <- ifelse(data$low.grade == "C", 1, 0) zip.data <- aggregate(c.grade~zip, data=data, mean) zip.count <- count(data, zip) zip.data <- merge(zip.data, zip.count, by = "zip") zip.data <- subset(zip.data, n >= 10) #Top-coding to improve legibility zip.data$c.grade0 <- zip.data$c.grade zip.data$c.grade <- ifelse(zip.data$c.grade > .3, .3, zip.data$c.grade) #Load Zip Code Boundary Shape File from NYC Open Data (https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u) nyc.zipshape <- st_read("ZIP_CODE_040114.shp") names(zip.data)[1] <- paste("ZIPCODE") zip.map <- merge(nyc.zipshape, zip.data, by = "ZIPCODE") map1 <- ggplot(zip.map) + geom_sf(aes(fill = c.grade)) + scale_fill_gradient( limits = c(0, 0.5), low = "green", high = "red", breaks = c(0.1, 0.2, .3, .4, .5) ) + labs( title = "Proportion of 'C' Grades", fill = "Percent" ) tab1 <- arrange(zip.data, -c.grade0)[1:10,] tab2 <- arrange(zip.data, c.grade0)[1:10,] map1 |
I could find no immediate patterns in the map, except that the incidence of C-grades is low in parts of the city that I know to be lower density, at least for the most part. First, consider Table 1 (below), which describes the zip codes in which C grades were most common:
Neighborhood | Borough | Zipcode | N | % C Grade |
---|---|---|---|---|
Harlem | Manhattan | 10030 | 23 | 0.421 |
Midtown | Manhattan | 10169 | 16 | 0.3125 |
Fresh Meadows | Queens | 11366 | 44 | 0.3056 |
South Richmond Hill | Queens | 11419 | 77 | 0.3056 |
East Harlem | Manhattan | 10035 | 66 | 0.3 |
Springfield Gardens | Queens | 11413 | 44 | 0.2973 |
Gravesend | Brooklyn | 11229 | 111 | 0.2941 |
Borough Park | Brooklyn | 11219 | 90 | 0.2857 |
Canarsie | Brooklyn | 11236 | 108 | 0.2796 |
Midtown | Manhattan | 10121 | 12 | 0.2727 |
I could not immediately discern a pattern here. The offending neighborhoods are spread across boroughs, and include zip codes that I know to be higher and lower income. Interestingly, the third-worst zip code in New York is just outside of Queens College campus. Make of that what you will.
Neighborhood | Borough | Zipcode | N | % C Grade |
---|---|---|---|---|
Midtown | Manhattan | 10118 | 10 | 0.0000 |
Tottenville | Staten Island | 10307 | 25 | 0.0000 |
Jamaica | Queens | 11433 | 20 | 0.0000 |
Jamaica | Queens | 11430 | 52 | 0.0196 |
Princes Bay | Staten Island | 10309 | 64 | 0.0317 |
Rockaway Park | Queens | 11694 | 30 | 0.0357 |
Arverne | Queens | 11693 | 23 | 0.0455 |
Cambria Heights | Queens | 11411 | 18 | 0.0588 |
East New York | Brooklyn | 11239 | 17 | 0.0625 |
Bayside | Queens | 11426 | 19 | 0.0625 |
From these figures, and across the least-offending neighborhoods, the only pattern that I could discern was that there was a plurality of neighborhoods that I know to be low-density, for example the neighborhoods in Staten Islands, around the Rockaways, and in the uppermost reaches of the Bronx. One reason that this pattern could exist is that these neighbhorhoods might be lower density, which may cause a lower incident of rodent infestations. Along with improper temperature storage, rodent infestation is a leading cause of restaurant violations in the city.
Most A-Grade Restaurants
Note that the proportions on this figure are top-coded at 80%, with outliers described in the table below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
data$a.grade <- ifelse(data$low.grade == "A", 1, 0) zip.data2 <- aggregate(a.grade~zip, data=data, mean) zip.count2 <- count(data, zip) zip.data2 <- merge(zip.data2, zip.count, by = "zip") zip.data2 <- subset(zip.data2, n >= 10) # Top coding percent grade to improve visualization: zip.data2$a.grade0 <- zip.data2$a.grade zip.data2$a.grade <- ifelse(zip.data2$a.grade > 0.8, 0.8, zip.data2$a.grade) names(zip.data2)[1] <- paste("ZIPCODE") zip.map2 <- merge(nyc.zipshape, zip.data2, by = "ZIPCODE") map2 <- ggplot(zip.map2) + geom_sf(aes(fill = a.grade)) + scale_fill_gradient( limits = c(0.4, 0.8), low = "red", high = "green", ) + labs( title = "Proportion of 'A' Grades", fill = "Percent Cases" ) tab3 <- arrange(zip.data2, -a.grade0)[1:10,] tab4 <- arrange(zip.data2, a.grade0)[1:10,] map2 |
Table 3 (below) describes the zip codes with the highest incidence of “A” grades.
Borough | Neighborhood | zipcode | N | % A Grade |
---|---|---|---|---|
Queens | Jamaica Estates | 11430 | 52 | 0.9019608 |
Manhattan | Garment District | 10118 | 10 | 0.8888889 |
Queens | Far Rockaway / Arverne | 11693 | 23 | 0.8636364 |
Bronx | City Island / Fordham | 10464 | 29 | 0.8518519 |
Queens | Beechhurst | 11360 | 12 | 0.8333333 |
Queens | East Elmhurst | 11370 | 19 | 0.8333333 |
Manhattan | Rockefeller Center / Diamond | 10020 | 29 | 0.8214286 |
Queens | Rockaway Peninsula | 11694 | 30 | 0.8214286 |
Brooklyn | Brownsville / Ocean Hill / E. | 11239 | 17 | 0.8125000 |
Queens | Jamaica | 11436 | 21 | 0.8125000 |
The main pattern observable to the naked eye is proximity to airports and low population density. Airports have many restaurants, and are surrounded by hotels with in-house or nearby restaurants. I speculate that these areas are likely to have many restaurants set in high-quality facilities.
Table 4 (below) describes the zipcodes with the lowest incidence of “A” grades.
Neighborhood | Borough | Zipcode | N | % A Grade |
---|---|---|---|---|
Harlem | Manhattan | 10039 | 23 | 0.4211 |
Upper East Side | Manhattan | 10128 | 91 | 0.4217 |
Queens Village | Queens | 11428 | 31 | 0.4286 |
Williamsbridge | Bronx | 10466 | 67 | 0.4340 |
Canarsie | Brooklyn | 11236 | 108 | 0.4516 |
Harlem | Manhattan | 10030 | 23 | 0.4737 |
Battery Park City | Manhattan | 10280 | 22 | 0.4737 |
Gravesend | Brooklyn | 11229 | 111 | 0.4804 |
Williamsbridge | Bronx | 10467 | 126 | 0.4825 |
South Richmond Hill | Queens | 11419 | 77 | 0.4861 |
Some of the same neighborhoods appear in Table 1, which depicted the highest incidence of failing grades: Harlem, South Richmond Hill, and Canarsie. Aside from these high-offending areas, there appears to be a mixutre of neighborhoods.
Findings
This analysis sough geographic patterns in the distribution of high and low grades in restaurant inspections. The data did not give us many clear and decisive guidelines about neighborhoods that are safe or unsafe for dining. It did suggest that dining might be safer around airports, which may be driven by high cleanliness scores in airports and hotels. In terms of unsafe areas, it seems that risk is spread across boroughs, and can occur in higher- and lower-income areas. It may be safer in lower density areas that are further from the city center, which may be due to fewer rodent problems. Further research is required for these speculations to be confirmed. The data do suggest that many restaurants fare poorly in Harlem, South Richmond Hill, or Canarsie, though this analysis does not discern why these problems exist.
- I am grateful for Seth Mandel’s help in finding these documents.↩︎