What is “Longitudinal Data”?

Longitudinal data is data that covers subjects over multiple time periods. Longitudinal data is often contrasted with “cross-sectional data*, which measures subjects at a single point in time.

The data object dat (below) offers an example of longitudinal attendance data for Major League Baseball teams.

# To install baseballr package:
# library(devtools)
# install_github("BillPetti/baseballr")

library(baseballr)

team <- c("TOR", "NYY", "BAL", "BOS", "TBR",
          "KCR", "CLE", "DET", "MIN", "CHW",
          "LAA", "HOU", "SEA", "OAK", "TEX",
          "NYM", "PHI", "ATL", "MIA", "WSN",
          "MIL", "STL", "PIT", "CHC", "CIN",
          "SFG", "LAD", "ARI", "COL", "SDP")
for (i in team){
  temp <- team_results_bref(i, 2019)
  temp <- temp[c(1, 2, 3, 4, 5, 18)]
  assign(paste0("temp.dat.", i), temp)
}
temp.dat <- temp.dat.TOR
for (i in team[-1]){
  temp <- get(paste0("temp.dat.",i))
  temp.1 <- rbind(temp.dat, temp)
  assign("temp.dat", temp.1)
}
dat <- temp.dat
rm(list=ls(pattern="temp"))

dat$Date <- paste0(dat$Date, ", 2019")

dat <- subset(dat, !(grepl("(", dat$Date, fixed= T)))

library(parsedate)
dat$Date <- as.Date(parse_date(dat$Date, default_tz=""))
head(dat, 10)
## # A tibble: 10 x 6
##       Gm Date       Tm    H_A   Opp   Attendance
##    <dbl> <date>     <chr> <chr> <chr>      <dbl>
##  1     1 2019-03-28 TOR   H     DET        45048
##  2     2 2019-03-29 TOR   H     DET        18054
##  3     3 2019-03-30 TOR   H     DET        25429
##  4     4 2019-03-31 TOR   H     DET        16098
##  5     5 2019-04-01 TOR   H     BAL        10460
##  6     6 2019-04-02 TOR   H     BAL        12110
##  7     7 2019-04-03 TOR   H     BAL        11436
##  8     8 2019-04-04 TOR   A     CLE        10375
##  9     9 2019-04-05 TOR   A     CLE        12881
## 10    10 2019-04-06 TOR   A     CLE        18429

Our data set includes date, home/away, opponent, and attendance data for the 2019 season. Data comes from Baseball Reference, downloaded using the excellent baseballr package. To see how I downloaded and prepared this data for analysis, download this page’s Markdown file here.

Basic Terminology

Some terminology:

How is Longitudinal Data Useful?

Without longitudinal data, we are left to work with cross-sectional snapshots. A snapshot might tell us that 44,424 people came to see the Mets’ 2019 home opener, or that fans made 2,412,887 visits to the Citi Field that year.

Longitudinal data allows us to assess the effects of changes in our units of analysis and the environments in which they operate. Unpackaging phenomenon over time gives you new vantage points and bases for comparison:

library(ggplot2)
library(scales)

dat.nym$Date.P <- as.POSIXct(dat.nym$Date)  
#Converting Date into POSIXct format.  See below.

ggplot(dat.nym, aes(x = Date.P, y = Attendance)) + 
  geom_col() +
  scale_x_datetime(breaks = date_breaks("7 days"), labels = date_format(format = "%m/%d")) +
  xlab("Date") + ylab("Attendance") + 
  scale_y_continuous(breaks = seq(0,45000,5000), label = comma) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("New York Mets Home Game Attendance, 2019")

plot of chunk unnamed-chunk-3

It allows us to unpackage time effects specifically, as with this table that gives us mean attendance by day:

dat.nym$day <- factor(as.POSIXlt(dat.nym$Date.P)$wday,
                      labels = c("Monday", "Tuesday", "Wednesday",
                                 "Thursday", "Friday", "Saturday", "Sunday"))
temp <- aggregate(Attendance ~ day, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Day")
temp
##         Day Attendance
## 1    Monday      22184
## 2   Tuesday      27660
## 3 Wednesday      27406
## 4  Thursday      30734
## 5    Friday      30826
## 6  Saturday      36087
## 7    Sunday      32914

Or mean attendance by month:

dat.nym$month <- factor(as.POSIXlt(dat.nym$Date.P)$mo,
                      labels = c("April", "May", "June",
                                 "July", "August", "September"))
temp <- aggregate(Attendance ~ month, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Month")
temp
##       Month Attendance
## 1     April      28613
## 2       May      28244
## 3      June      30943
## 4      July      35780
## 5    August      34135
## 6 September      26831

And, of course, the finer grained data allows us to test more relationships, like mean attendance by opponent. Here are the top five teams:

temp <- aggregate(Attendance ~ Opp, data = dat.nym, mean)
temp$Attendance = round(temp$Attendance, 0)
names(temp)[1] <- paste("Opponent")
temp <- temp[order(-temp$Attendance),]
rownames(temp) <- 1:18
head(temp, 5)
##   Opponent Attendance
## 1      NYY      42736
## 2      LAD      35627
## 3      PIT      35565
## 4      CHC      35511
## 5      WSN      34885

And this finer-grained data allows us to develop models that incorporate consideration of time:

dat$month <- factor(as.POSIXlt(dat$Date)$mo,
                      labels = c("March", "April", "May", "June",
                                 "July", "August", "September"))
dat$day <- factor(as.POSIXlt(dat$Date)$wday,
                      labels = c("Monday", "Tuesday", "Wednesday",
                                 "Thursday", "Friday", "Saturday", "Sunday"))
summary(lm(Attendance ~ month + day, dat))
## 
## Call:
## lm(formula = Attendance ~ month + day, data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28160  -8806    -18   8168  30107 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     32092.0     1075.6  29.836  < 2e-16 ***
## monthApril      -3977.2     1108.0  -3.590 0.000334 ***
## monthMay        -3146.8     1102.0  -2.855 0.004316 ** 
## monthJune        -839.5     1100.5  -0.763 0.445598    
## monthJuly         136.7     1110.2   0.123 0.902040    
## monthAugust     -1428.9     1100.5  -1.298 0.194190    
## monthSeptember  -2943.5     1103.9  -2.666 0.007694 ** 
## dayTuesday      -5247.4      634.8  -8.266  < 2e-16 ***
## dayWednesday    -5023.0      550.1  -9.131  < 2e-16 ***
## dayThursday     -4592.4      554.6  -8.281  < 2e-16 ***
## dayFriday       -3461.8      593.3  -5.834 5.76e-09 ***
## daySaturday       167.4      541.2   0.309 0.757108    
## daySunday        3257.9      536.2   6.076 1.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10660 on 4713 degrees of freedom
## Multiple R-squared:  0.09668,    Adjusted R-squared:  0.09438 
## F-statistic: 42.03 on 12 and 4713 DF,  p-value: < 2.2e-16

Special Considerations with Longitudinal Data

If you are getting into analyzing longitudinal data, there are three things to know from the outset:

  1. Special wrangling operations, particularly understanding how unit-time identifiers work and how to perform commonly-used data transformation operations associated with longitudinal analysis.
  2. Standard description methods employed with this kind of data
  3. Special modeling considerations to consider when using longitudinal data in regressions.

Each topic will be discussed in this module.

Leave a Reply

Your email address will not be published.