Joseph Nathan Cohen

Department of Sociology, CUNY Queens College, New York, NY

Download Baseball Data with baseballr

Downloading baseball statistics in R using the package 'baseballr'

Introduction

This post shows you how to download MLB baseball data into R using the baseballr package. For a helpful overview of the package and its functionality, click here. To find packages to analyze other sports and games, check out the SportDataverse on GitHub.

The package offers functions that fetch baseball data via APIs. It works with the MLBFanGraphsBaseball ReferenceBaseball Savent’s StatcastRetrosheet, and the Chadwick Bureau.

The package allows users to download a wide range of baseball data. Consult the package’s official site for a list of metrics available. Click here for an index of functions available in the package.

Note that outputs have been truncated for legibility.

Install baseballr

RToools Required. You can download the package from GitHub using the pacman package. Try installing the package through R:

RTools for Windows. Users wtih Windows devices may need to install RTools. Visit CRAN for instructions on how to install on Windows. I am not proficient on Apple, but believe that you might need to install Xcode.

Install baseballr. Using the pacman function p_load_current_gh() download the package from Github:

Player-Level Data

Player Identifiers

The Chadwick Baseball Bureau maintains a database of MLB players. To download the the entire player identification table to an object, use chadwick_player_lu(). These identifiers are useful for locating player identifiers and merging them across sets that use different identifiers.

The resulting object will have a table of player identifiers used in other data sets, including those we will use with baseballr-drawn data. Here’s an example of some data returned in the player registry:

Player Lookup. You can also look up particular players’ identifiers:

Or look up players using identifiers:

Player-Level Data

These player identifiers allow users to extract player-specific information. Here are some examples.

Seasonal Leaderboards. FanGraphs offer a very wide range of seasonal data. It also allows you to import multiple years.

Statcast Leaderboards. The package also allows users to download Statcast leaderboards. For example, here is data on exit velocity barrels by pitcher in 2022:

Player-Level Batting Game Logs. Below, I extract a batting game log for Shohei Ohtani (playerid = 19755) for year 2023:

Game-Level Data

Game Identifiers

Many MLB game-level data series rely on game identifiers. You can look up the game identifier using mlb_game_pks(). Below, I show how to get the game identifiers for all the MLB games played on June 6, 2023:

Compiling Multiple Game Identifiers

To get data from multiple games, I would loop over multiple dates to create a vector of game identifiers. You can subset that larger set (e.g., to focus on a particular team). There can be inconsistencies in how the data are stored, so I recommend using bind_rows() from dplyr instead of the base rbind()

Isolating Game Series

Use subset() to pare down your date-based game ID table to focus on a particular team, time, stadium, or any other variable in the game ID table. Below, I isolate the Mets games from the game identifiers constructed above.

Fetching Game-Level Data

You can download many game-level indicators via the baseballr package. Click here for a list of data extraction functions. Often, they will require that you work with game-level identifiers (see above)

For example, to get Mets’ game on March 30, 2023, you can use the mlb_batting_orders() function (using the game identifier codes from ‘mets_game_pks’ above):

What were the probable pitchers that day? What was the line score?

Iterating Over Multiple Games

You can extract and combine information from game data tables by using loops or apply functions. For example, the chunk below will use the lineup data from above (retrieved using the function mlb_batting_orders()) using the array of game identifiers for the Mets’ April 2023 games:

Other Metrics

Pitch-by-Pitch Data

You can get a pitch log for an individual game using the game codes above. Here’s the top of the pitch log for game #71774:

Sportrac Player Contracts

The package allows users to download payroll and salary data from Sportrac. Note that the API may only offer recent seasons’ data. For example:

Park Factors

FanGraphs offers park factor estimates for each year. Here’s an example for 2018:

Leave a Reply

Your email address will not be published. Required fields are marked *