What is “Longitudinal Data”?
Longitudinal data is data that covers subjects over multiple time periods. Longitudinal data is often contrasted with “cross-sectional data*, which measures subjects at a single point in time.
The data object dat (below) offers an example of longitudinal attendance data for Major League Baseball teams.
# To install baseballr package:
# library(devtools)
# install_github("BillPetti/baseballr")
library(baseballr)
team <- c("TOR", "NYY", "BAL", "BOS", "TBR",
"KCR", "CLE", "DET", "MIN", "CHW",
"LAA", "HOU", "SEA", "OAK", "TEX",
"NYM", "PHI", "ATL", "MIA", "WSN",
"MIL", "STL", "PIT", "CHC", "CIN",
"SFG", "LAD", "ARI", "COL", "SDP")
for (i in team){
temp <- team_results_bref(i, 2019)
temp <- temp[c(1, 2, 3, 4, 5, 18)]
assign(paste0("temp.dat.", i), temp)
}
temp.dat <- temp.dat.TOR
for (i in team[-1]){
temp <- get(paste0("temp.dat.",i))
temp.1 <- rbind(temp.dat, temp)
assign("temp.dat", temp.1)
}
dat <- temp.dat
rm(list=ls(pattern="temp"))
dat$Date <- paste0(dat$Date, ", 2019")
dat <- subset(dat, !(grepl("(", dat$Date, fixed= T)))
library(parsedate)
dat$Date <- as.Date(parse_date(dat$Date, default_tz=""))
head(dat, 10)
## # A tibble: 10 x 6 ## Gm Date Tm H_A Opp Attendance ## <dbl> <date> <chr> <chr> <chr> <dbl> ## 1 1 2019-03-28 TOR H DET 45048 ## 2 2 2019-03-29 TOR H DET 18054 ## 3 3 2019-03-30 TOR H DET 25429 ## 4 4 2019-03-31 TOR H DET 16098 ## 5 5 2019-04-01 TOR H BAL 10460 ## 6 6 2019-04-02 TOR H BAL 12110 ## 7 7 2019-04-03 TOR H BAL 11436 ## 8 8 2019-04-04 TOR A CLE 10375 ## 9 9 2019-04-05 TOR A CLE 12881 ## 10 10 2019-04-06 TOR A CLE 18429
Our data set includes date, home/away, opponent, and attendance data for the 2019 season. Data comes from Baseball Reference, downloaded using the excellent baseballr package. To see how I downloaded and prepared this data for analysis, download this page’s Markdown file here.
Basic Terminology
Some terminology:
- Units refer to the individual subjects that we are following across time. In the above data, our units are baseball teams.
- Periods refer to the time periods in which the subjects were observed. Above, our periods are dates.
- Cross-sections refer to comparisons across units within the same time period. A cross-section of our data set would only include attendance data for a single day.
- Time series refer to data series pertaining to the same unit over time. Were our data set only comprised of one team’s attendance data, it could be said to contain only one time series.
How is Longitudinal Data Useful?
Without longitudinal data, we are left to work with cross-sectional snapshots. A snapshot might tell us that 44,424 people came to see the Mets’ 2019 home opener, or that fans made 2,412,887 visits to the Citi Field that year.
Longitudinal data allows us to assess the effects of changes in our units of analysis and the environments in which they operate. Unpackaging phenomenon over time gives you new vantage points and bases for comparison:
library(ggplot2)
library(scales)
dat.nym$Date.P <- as.POSIXct(dat.nym$Date)
#Converting Date into POSIXct format. See below.
ggplot(dat.nym, aes(x = Date.P, y = Attendance)) +
geom_col() +
scale_x_datetime(breaks = date_breaks("7 days"), labels = date_format(format = "%m/%d")) +
xlab("Date") + ylab("Attendance") +
scale_y_continuous(breaks = seq(0,45000,5000), label = comma) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
ggtitle("New York Mets Home Game Attendance, 2019")
It allows us to unpackage time effects specifically, as with this table that gives us mean attendance by day:
dat.nym$day <- factor(as.POSIXlt(dat.nym$Date.P)$wday,
labels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
temp <- aggregate(Attendance ~ day, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Day")
temp
## Day Attendance ## 1 Monday 22184 ## 2 Tuesday 27660 ## 3 Wednesday 27406 ## 4 Thursday 30734 ## 5 Friday 30826 ## 6 Saturday 36087 ## 7 Sunday 32914
Or mean attendance by month:
dat.nym$month <- factor(as.POSIXlt(dat.nym$Date.P)$mo,
labels = c("April", "May", "June",
"July", "August", "September"))
temp <- aggregate(Attendance ~ month, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Month")
temp
## Month Attendance ## 1 April 28613 ## 2 May 28244 ## 3 June 30943 ## 4 July 35780 ## 5 August 34135 ## 6 September 26831
And, of course, the finer grained data allows us to test more relationships, like mean attendance by opponent. Here are the top five teams:
temp <- aggregate(Attendance ~ Opp, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Opponent")
temp <- temp[order(-temp$Attendance),]
rownames(temp) <- 1:18
head(temp, 5)
## Opponent Attendance ## 1 NYY 42736 ## 2 LAD 35627 ## 3 PIT 35565 ## 4 CHC 35511 ## 5 WSN 34885
And this finer-grained data allows us to develop models that incorporate consideration of time:
dat$month <- factor(as.POSIXlt(dat$Date)$mo,
labels = c("March", "April", "May", "June",
"July", "August", "September"))
dat$day <- factor(as.POSIXlt(dat$Date)$wday,
labels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
summary(lm(Attendance ~ month + day, dat))
## ## Call: ## lm(formula = Attendance ~ month + day, data = dat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -28160 -8806 -18 8168 30107 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 32092.0 1075.6 29.836 < 2e-16 *** ## monthApril -3977.2 1108.0 -3.590 0.000334 *** ## monthMay -3146.8 1102.0 -2.855 0.004316 ** ## monthJune -839.5 1100.5 -0.763 0.445598 ## monthJuly 136.7 1110.2 0.123 0.902040 ## monthAugust -1428.9 1100.5 -1.298 0.194190 ## monthSeptember -2943.5 1103.9 -2.666 0.007694 ** ## dayTuesday -5247.4 634.8 -8.266 < 2e-16 *** ## dayWednesday -5023.0 550.1 -9.131 < 2e-16 *** ## dayThursday -4592.4 554.6 -8.281 < 2e-16 *** ## dayFriday -3461.8 593.3 -5.834 5.76e-09 *** ## daySaturday 167.4 541.2 0.309 0.757108 ## daySunday 3257.9 536.2 6.076 1.33e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10660 on 4713 degrees of freedom ## Multiple R-squared: 0.09668, Adjusted R-squared: 0.09438 ## F-statistic: 42.03 on 12 and 4713 DF, p-value: < 2.2e-16
Special Considerations with Longitudinal Data
If you are getting into analyzing longitudinal data, there are three things to know from the outset:
- Special wrangling operations, particularly understanding how unit-time identifiers work and how to perform commonly-used data transformation operations associated with longitudinal analysis.
- Standard description methods employed with this kind of data
- Special modeling considerations to consider when using longitudinal data in regressions.
Each topic will be discussed in this module.