Restaurant Violations in NYC, 2022

Which zip codes have the cleanest and dirtiest restaurants?

Food poisoning is a real public health threat. Every year, approximately 48 million people get sick from food poisoning, 128 thousand are hospitalized, and 3,000 die from foodborne illnesses. According to estimates published by the Center for Science in the Public Interest, about two-thirds of food poisonings occur through restaurants. To keep this threat in check, the City of New York’s Department of Health and Mental Hygiene operates a restaurant inspection system that monitors all food vendors – from street trucks to five-star restaurants – regularly, and their grades must be posted publicly. A vendor with too many violations, or violations that are too serious, will be shut down. The system tries to safeguard the public from restauranteurs who do not take food safety seriously, and allows consumers to weigh food safety when choosing restaurants.

This analysis looks for patterns in food safety by geographic area. It asks whether there are neighborhoods in which food safety seems generally more or less safe than others. In terms of geography, there is no immediately obvious pattern, except perhaps that inspection failures might be more infrequent in low-density areas. This may be partly due to lower rates of rodent infestation, which is a major source of problems in New York City. Further analysis is required before being able discern finer differences.

It should be noted: Readers should avoid the ecological fallacy of assuming that general patterns observed in neighorhoods apply to all restaurants in that area. Restaurants are rated individually, and should be judged as such. If anything, an “A” rating in a generally “C” neighborhood or cuisine is a gem that stands out from its peers.

Data

The analysis uses violations data collected by the New York City Department of Health and Mental Hygeine, and distributed by NYC Open Data. In this analysis, we will look at all restaurants that were inspected at least once in 2022. Our analysis will focus on every restaurant’s lowest overall grade for that year. Restaurants are evaluated on a demerit points system, described here.¹

# Load Data
data.orig <- read.csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv")

# Create a derivative set to manipulate
data <- data.orig

# Examine the variable names and top few rows in the set:
names(data)
head(data)

# Noted missing inspection dates.  
prop.table(table(ifelse(data$INSPECTION.DATE == "01/01/1900", "Missing", "Reported")))
# Approx 2%.  Might be inspections not yet conducted.  Small %.  Dropped
data <- subset(data, INSPECTION.DATE != "01/01/1900")
# Parse date data 
date.obj <- strptime(data$INSPECTION.DATE, format = "%m/%d/%Y")
data$insp.day <- as.numeric(format(date.obj, "%d"))
data$insp.month <- as.numeric(format(date.obj, "%m"))
data$insp.year <- as.numeric(format(date.obj, "%Y"))
rm(date.obj)

# Restrict data to 2022
data <- subset(data, insp.year == 2022)

# Checking geoloc data: long + lat
summary(data$Longitude)  #Sensible range
sum(is.na(data$Longitude)) / length(data$Longitude)  # <.001% missing Longitude data
summary(data$Latitude) #Sensible range
sum(is.na(data$Latitude)) / length(data$Latitude)  # <.001% missing geo data
# Dropping missing
data$long <- data$Longitude
data$lat <- data$Latitude

# Zipcode variable: zip
data$zip <- data$ZIPCODE
data$zip <- as.numeric(data$zip)
sum(is.na(data$zip)) / length(data$zip)  # 1% missing zip data
length(unique(data$zip))    # 219 unique zip codes. 
#Cross-reference zips with State list
ziplist <- read.csv("New_York_State_ZIP_Codes-County_FIPS_Cross-Reference.csv")
ziplist <- subset(ziplist, County.Name %in% c("Queens", "Kings", "Bronx", "New York", "Richmond"))
subset(data, !(zip %in% unique(ziplist$ZIP.Code)))  #All zips in NYC
# ACTION ACTION ACTION ACTION:
# Possible to impute missing zipcodes from long/lat
# Will bracket here
# TRY MAPQUEST
rm(ziplist)

# Data need to be collapsed, as there are multiple lines per inspection
# Collapse by CAMIS + DATE
table(data$CRITICAL.FLAG)
data$critical.flag <- ifelse(data$CRITICAL.FLAG == "Critical", 1, 0)

# Cleaning Up & Reordering Frame
data <- data[c(1:3, 33,31,32,28:30,34,14:15, 21:24)]

# Examine to develop duplicates strategy
names(data)[1] <- paste("CAMIS")
data <- data %>% arrange(CAMIS, insp.year, insp.month, insp.day)
# Strategy:
# (1) Go by restaurant for whole year, rather than by event.  Trim inspection dates
# (2) Define critical.flag as any critical 
# Note to students (and self in lecture): As soon as I wrote these comments here, I instantly moved up to the main text section to write up this feature of the data.  I will go back and forth and work out the text as I develop this project.  But coming up with that text was easy to do because it was front of mind.  This is the great benefit of Markdown, in which you can write up your report and do your code more simultaneously.

#Trim set and collapse identifying information
temp.data <- unique(data[c(1:6, 13:16)])


# Calculate high/low scores
temp.SCORES <- 
  data %>%
  group_by(CAMIS) %>%
  summarize(best.score = min(SCORE),
            worst.score = max(SCORE),
            critical = max(critical.flag))

# Merge data
temp.data <- merge(temp.data, temp.SCORES, by = "CAMIS")

data <- temp.data
rm(list=ls(pattern = "temp"))
gc()

# Check new variables
summary(data$worst.score)

# Calculate Low Grade
data$low.grade <- ifelse(data$worst.score <= 13, "A",
                         ifelse(data$worst.score <= 27, "B",
                                ifelse(data$worst.score <= 999, "C", "Error")))
prop.table(table(data$low.grade))  

# Found miscode in BORO
data$BORO <- ifelse(data$BORO == "0", NA, data$BORO)

# Electing not to impute missing zips.  Too much missing SCORE data for it to lead to substantial quality improvement.
# Wrangled: 1/30

# Load Data

data.orig <- read.csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv")

# Create a derivative set to manipulate

data <- data.orig

# Examine the variable names and top few rows in the set:

names(data)

head(data)

# Noted missing inspection dates.

prop.table(table(ifelse(data$INSPECTION.DATE == "01/01/1900", "Missing", "Reported")))

# Approx 2%. Might be inspections not yet conducted. Small %. Dropped

data <- subset(data, INSPECTION.DATE != "01/01/1900")

# Parse date data

date.obj <- strptime(data$INSPECTION.DATE, format = "%m/%d/%Y")

data$insp.day <- as.numeric(format(date.obj, "%d"))

data$insp.month <- as.numeric(format(date.obj, "%m"))

data$insp.year <- as.numeric(format(date.obj, "%Y"))

rm(date.obj)

# Restrict data to 2022

data <- subset(data, insp.year == 2022)

# Checking geoloc data: long + lat

summary(data$Longitude) #Sensible range

sum(is.na(data$Longitude)) / length(data$Longitude) # <.001% missing Longitude data

summary(data$Latitude) #Sensible range

sum(is.na(data$Latitude)) / length(data$Latitude) # <.001% missing geo data

# Dropping missing

data$long <- data$Longitude

data$lat <- data$Latitude

# Zipcode variable: zip

data$zip <- data$ZIPCODE

data$zip <- as.numeric(data$zip)

sum(is.na(data$zip)) / length(data$zip) # 1% missing zip data

length(unique(data$zip)) # 219 unique zip codes.

#Cross-reference zips with State list

ziplist <- read.csv("New_York_State_ZIP_Codes-County_FIPS_Cross-Reference.csv")

ziplist <- subset(ziplist, County.Name %in% c("Queens", "Kings", "Bronx", "New York", "Richmond"))

subset(data, !(zip %in% unique(ziplist$ZIP.Code))) #All zips in NYC

# ACTION ACTION ACTION ACTION:

# Possible to impute missing zipcodes from long/lat

# Will bracket here

# TRY MAPQUEST

rm(ziplist)

# Data need to be collapsed, as there are multiple lines per inspection

# Collapse by CAMIS + DATE

table(data$CRITICAL.FLAG)

data$critical.flag <- ifelse(data$CRITICAL.FLAG == "Critical", 1, 0)

# Cleaning Up & Reordering Frame

data <- data[c(1:3, 33,31,32,28:30,34,14:15, 21:24)]

# Examine to develop duplicates strategy

names(data)[1] <- paste("CAMIS")

data <- data %>% arrange(CAMIS, insp.year, insp.month, insp.day)

# Strategy:

# (1) Go by restaurant for whole year, rather than by event. Trim inspection dates

# (2) Define critical.flag as any critical

# Note to students (and self in lecture): As soon as I wrote these comments here, I instantly moved up to the main text section to write up this feature of the data. I will go back and forth and work out the text as I develop this project. But coming up with that text was easy to do because it was front of mind. This is the great benefit of Markdown, in which you can write up your report and do your code more simultaneously.

#Trim set and collapse identifying information

temp.data <- unique(data[c(1:6, 13:16)])

# Calculate high/low scores

temp.SCORES <-

data %>%

group_by(CAMIS) %>%

summarize(best.score = min(SCORE),

worst.score = max(SCORE),

critical = max(critical.flag))

# Merge data

temp.data <- merge(temp.data, temp.SCORES, by = "CAMIS")

data <- temp.data

rm(list=ls(pattern = "temp"))

gc()

# Check new variables

summary(data$worst.score)

# Calculate Low Grade

data$low.grade <- ifelse(data$worst.score <= 13, "A",

ifelse(data$worst.score <= 27, "B",

ifelse(data$worst.score <= 999, "C", "Error")))

prop.table(table(data$low.grade))

# Found miscode in BORO

data$BORO <- ifelse(data$BORO == "0", NA, data$BORO)

# Electing not to impute missing zips. Too much missing SCORE data for it to lead to substantial quality improvement.

# Wrangled: 1/30

Analysis

Below, we examine geographic patterns in both high- and low-scoring restaurants.

C-Grade Restaurants

Which zipcodes have the highest proportion of “C” rated restaurants, relative to their overall restaurant population? We only consider zip codes with at least 10 inspections. Note that we have top-coded the proportions at 30% to improve legibility. These outliers are described below:

data$c.grade <- ifelse(data$low.grade == "C", 1, 0)

zip.data <- aggregate(c.grade~zip, data=data, mean)
zip.count <- count(data, zip)
zip.data <- merge(zip.data, zip.count, by = "zip")
zip.data <- subset(zip.data, n >= 10)

#Top-coding to improve legibility
zip.data$c.grade0 <- zip.data$c.grade
zip.data$c.grade <- ifelse(zip.data$c.grade > .3, .3, zip.data$c.grade)

#Load Zip Code Boundary Shape File from NYC Open Data (https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u)
nyc.zipshape <- st_read("ZIP_CODE_040114.shp")

names(zip.data)[1] <- paste("ZIPCODE")
zip.map <- merge(nyc.zipshape, zip.data, by = "ZIPCODE")

map1 <- ggplot(zip.map) +
  geom_sf(aes(fill = c.grade)) +
  scale_fill_gradient(
    limits = c(0, 0.5),
    low = "green", high = "red",
    breaks = c(0.1, 0.2, .3, .4, .5)
  ) +
  labs(
    title = "Proportion of 'C' Grades",
    fill = "Percent"
  )

tab1 <- arrange(zip.data, -c.grade0)[1:10,]
tab2 <- arrange(zip.data, c.grade0)[1:10,]

map1

data$c.grade <- ifelse(data$low.grade == "C", 1, 0)

zip.data <- aggregate(c.grade~zip, data=data, mean)

zip.count <- count(data, zip)

zip.data <- merge(zip.data, zip.count, by = "zip")

zip.data <- subset(zip.data, n >= 10)

#Top-coding to improve legibility

zip.data$c.grade0 <- zip.data$c.grade

zip.data$c.grade <- ifelse(zip.data$c.grade > .3, .3, zip.data$c.grade)

#Load Zip Code Boundary Shape File from NYC Open Data (https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u)

nyc.zipshape <- st_read("ZIP_CODE_040114.shp")

names(zip.data)[1] <- paste("ZIPCODE")

zip.map <- merge(nyc.zipshape, zip.data, by = "ZIPCODE")

map1 <- ggplot(zip.map) +

geom_sf(aes(fill = c.grade)) +

scale_fill_gradient(

limits = c(0, 0.5),

low = "green", high = "red",

breaks = c(0.1, 0.2, .3, .4, .5)

) +

labs(

title = "Proportion of 'C' Grades",

fill = "Percent"

)

tab1 <- arrange(zip.data, -c.grade0)[1:10,]

tab2 <- arrange(zip.data, c.grade0)[1:10,]

map1

I could find no immediate patterns in the map, except that the incidence of C-grades is low in parts of the city that I know to be lower density, at least for the most part. First, consider Table 1 (below), which describes the zip codes in which C grades were most common:

Neighborhood	Borough	Zipcode	N	% C Grade
Harlem	Manhattan	10030	23	0.421
Midtown	Manhattan	10169	16	0.3125
Fresh Meadows	Queens	11366	44	0.3056
South Richmond Hill	Queens	11419	77	0.3056
East Harlem	Manhattan	10035	66	0.3
Springfield Gardens	Queens	11413	44	0.2973
Gravesend	Brooklyn	11229	111	0.2941
Borough Park	Brooklyn	11219	90	0.2857
Canarsie	Brooklyn	11236	108	0.2796
Midtown	Manhattan	10121	12	0.2727

Table 1: Zip Codes with Most Common C Grades

I could not immediately discern a pattern here. The offending neighborhoods are spread across boroughs, and include zip codes that I know to be higher and lower income. Interestingly, the third-worst zip code in New York is just outside of Queens College campus. Make of that what you will.

Neighborhood	Borough	Zipcode	N	% C Grade
Midtown	Manhattan	10118	10	0.0000
Tottenville	Staten Island	10307	25	0.0000
Jamaica	Queens	11433	20	0.0000
Jamaica	Queens	11430	52	0.0196
Princes Bay	Staten Island	10309	64	0.0317
Rockaway Park	Queens	11694	30	0.0357
Arverne	Queens	11693	23	0.0455
Cambria Heights	Queens	11411	18	0.0588
East New York	Brooklyn	11239	17	0.0625
Bayside	Queens	11426	19	0.0625

Table 2: Zip Codes with Least C-Grades

From these figures, and across the least-offending neighborhoods, the only pattern that I could discern was that there was a plurality of neighborhoods that I know to be low-density, for example the neighborhoods in Staten Islands, around the Rockaways, and in the uppermost reaches of the Bronx. One reason that this pattern could exist is that these neighbhorhoods might be lower density, which may cause a lower incident of rodent infestations. Along with improper temperature storage, rodent infestation is a leading cause of restaurant violations in the city.

Most A-Grade Restaurants

Note that the proportions on this figure are top-coded at 80%, with outliers described in the table below.

data$a.grade <- ifelse(data$low.grade == "A", 1, 0)

zip.data2 <- aggregate(a.grade~zip, data=data, mean)
zip.count2 <- count(data, zip)
zip.data2 <- merge(zip.data2, zip.count, by = "zip")
zip.data2 <- subset(zip.data2, n >= 10)

# Top coding percent grade to improve visualization:
zip.data2$a.grade0 <- zip.data2$a.grade
zip.data2$a.grade <- ifelse(zip.data2$a.grade > 0.8, 0.8, zip.data2$a.grade)

names(zip.data2)[1] <- paste("ZIPCODE")
zip.map2 <- merge(nyc.zipshape, zip.data2, by = "ZIPCODE")

map2 <- ggplot(zip.map2) +
  geom_sf(aes(fill = a.grade)) +
  scale_fill_gradient(
    limits = c(0.4, 0.8),
    low = "red", high = "green",
    ) +
  labs(
    title = "Proportion of 'A' Grades",
    fill = "Percent Cases"
  )

tab3 <- arrange(zip.data2, -a.grade0)[1:10,]
tab4 <- arrange(zip.data2, a.grade0)[1:10,]

map2

data$a.grade <- ifelse(data$low.grade == "A", 1, 0)

zip.data2 <- aggregate(a.grade~zip, data=data, mean)

zip.count2 <- count(data, zip)

zip.data2 <- merge(zip.data2, zip.count, by = "zip")

zip.data2 <- subset(zip.data2, n >= 10)

# Top coding percent grade to improve visualization:

zip.data2$a.grade0 <- zip.data2$a.grade

zip.data2$a.grade <- ifelse(zip.data2$a.grade > 0.8, 0.8, zip.data2$a.grade)

names(zip.data2)[1] <- paste("ZIPCODE")

zip.map2 <- merge(nyc.zipshape, zip.data2, by = "ZIPCODE")

map2 <- ggplot(zip.map2) +

geom_sf(aes(fill = a.grade)) +

scale_fill_gradient(

limits = c(0.4, 0.8),

low = "red", high = "green",

) +

labs(

title = "Proportion of 'A' Grades",

fill = "Percent Cases"

)

tab3 <- arrange(zip.data2, -a.grade0)[1:10,]

tab4 <- arrange(zip.data2, a.grade0)[1:10,]

map2

Table 3 (below) describes the zip codes with the highest incidence of “A” grades.

Borough	Neighborhood	zipcode	N	% A Grade
Queens	Jamaica Estates	11430	52	0.9019608
Manhattan	Garment District	10118	10	0.8888889
Queens	Far Rockaway / Arverne	11693	23	0.8636364
Bronx	City Island / Fordham	10464	29	0.8518519
Queens	Beechhurst	11360	12	0.8333333
Queens	East Elmhurst	11370	19	0.8333333
Manhattan	Rockefeller Center / Diamond	10020	29	0.8214286
Queens	Rockaway Peninsula	11694	30	0.8214286
Brooklyn	Brownsville / Ocean Hill / E.	11239	17	0.8125000
Queens	Jamaica	11436	21	0.8125000

Table 3: Most “A” Grades

The main pattern observable to the naked eye is proximity to airports and low population density. Airports have many restaurants, and are surrounded by hotels with in-house or nearby restaurants. I speculate that these areas are likely to have many restaurants set in high-quality facilities.

Table 4 (below) describes the zipcodes with the lowest incidence of “A” grades.

Neighborhood	Borough	Zipcode	N	% A Grade
Harlem	Manhattan	10039	23	0.4211
Upper East Side	Manhattan	10128	91	0.4217
Queens Village	Queens	11428	31	0.4286
Williamsbridge	Bronx	10466	67	0.4340
Canarsie	Brooklyn	11236	108	0.4516
Harlem	Manhattan	10030	23	0.4737
Battery Park City	Manhattan	10280	22	0.4737
Gravesend	Brooklyn	11229	111	0.4804
Williamsbridge	Bronx	10467	126	0.4825
South Richmond Hill	Queens	11419	77	0.4861

Table 4: Fewest “A” Grades

Some of the same neighborhoods appear in Table 1, which depicted the highest incidence of failing grades: Harlem, South Richmond Hill, and Canarsie. Aside from these high-offending areas, there appears to be a mixutre of neighborhoods.

Findings

This analysis sough geographic patterns in the distribution of high and low grades in restaurant inspections. The data did not give us many clear and decisive guidelines about neighborhoods that are safe or unsafe for dining. It did suggest that dining might be safer around airports, which may be driven by high cleanliness scores in airports and hotels. In terms of unsafe areas, it seems that risk is spread across boroughs, and can occur in higher- and lower-income areas. It may be safer in lower density areas that are further from the city center, which may be due to fewer rodent problems. Further research is required for these speculations to be confirmed. The data do suggest that many restaurants fare poorly in Harlem, South Richmond Hill, or Canarsie, though this analysis does not discern why these problems exist.

I am grateful for Seth Mandel’s help in finding these documents.↩︎

Joseph Nathan Cohen

Department of Sociology, CUNY Queens College, New York, NY

Restaurant Violations in NYC, 2022

Data

Analysis

C-Grade Restaurants

Most A-Grade Restaurants

Findings

Related

Leave a Reply Cancel reply

Tags

Latest Stuff

Creating Podcast Episode Pages

Fast Transcription with Adobe Premiere

Notes: Analyzing Complex Survey Data

Assignment: Political Messaging Strategy

Notes: Basic Linear Regression

Contact

About this Site