Introduction
This post shows you how to download MLB baseball data into R using the baseballr package. For a helpful overview of the package and its functionality, click here. To find packages to analyze other sports and games, check out the SportDataverse on GitHub.
The package offers functions that fetch baseball data via APIs. It works with the MLB, FanGraphs, Baseball Reference, Baseball Savent’s Statcast, Retrosheet, and the Chadwick Bureau.
The package allows users to download a wide range of baseball data. Consult the package’s official site for a list of metrics available. Click here for an index of functions available in the package.
Note that outputs have been truncated for legibility.
Install baseballr
RToools Required. You can download the package from GitHub using the pacman package. Try installing the package through R:
1 2 |
<em># Installing pacman package, if not already installed</em> install.packages("pacman") |
RTools for Windows. Users wtih Windows devices may need to install RTools. Visit CRAN for instructions on how to install on Windows. I am not proficient on Apple, but believe that you might need to install Xcode.
Install baseballr. Using the pacman function p_load_current_gh() download the package from Github:
1 2 |
<em># Function to download package 'baseballr' from Bill Petti's GitHub site</em> pacman::p_load_current_gh("billpetti/baseballr") |
Player-Level Data
Player Identifiers
The Chadwick Baseball Bureau maintains a database of MLB players. To download the the entire player identification table to an object, use chadwick_player_lu(). These identifiers are useful for locating player identifiers and merging them across sets that use different identifiers.
1 2 |
<em># Getting Player data stored to object called 'players'</em> players <- chadwick_player_lu() |
The resulting object will have a table of player identifiers used in other data sets, including those we will use with baseballr-drawn data. Here’s an example of some data returned in the player registry:
1 2 3 4 5 6 7 8 9 10 11 |
<em># Showing top 10 entries, selected columns (to keep example compact)</em> head(players[,c(13,14,27,7,4,5)],5) ## # A tibble: 5 × 6 ## name_last name_first mlb_played_first key_fangraphs key_retro key_bbref ## <chr> <chr> <int> <int> <chr> <chr> ## 1 Schneider Davis 2023 23565 schnd001 schneda03 ## 2 Goodman Hunter 2023 NA goodh001 goodmhu01 ## 3 Acton Garrett 2023 27583 actog001 actonga01 ## 4 Rodriguez José 2023 24388 rodrj008 rodrijo09 ## 5 Waldron Matt 2023 25550 waldm003 waldrma01 |
Player Lookup. You can also look up particular players’ identifiers:
1 2 3 4 5 6 7 8 9 |
playerid_lookup("Bichette") ## # A tibble: 3 × 11 ## first_name last_name given_name name_suffix nick_name birth_year ## <chr> <chr> <chr> <chr> <chr> <int> ## 1 Dante Bichette Alphonse Dante "" "" 1992 ## 2 Dante Bichette Alphonse Dante "" "" 1963 ## 3 Bo Bichette Bo Joseph "" "" 1998 |
Or look up players using identifiers:
1 |
playername_lookup(666182) |
1 2 3 4 5 |
## # A tibble: 1 × 11 ## name_first name_last name_given name_suffix name_nick birth_year ## <chr> <chr> <chr> <chr> <chr> <int> ## 1 Bo Bichette Bo Joseph "" "" 1998 |
Player-Level Data
These player identifiers allow users to extract player-specific information. Here are some examples.
Seasonal Leaderboards. FanGraphs offer a very wide range of seasonal data. It also allows you to import multiple years.
1 2 |
season_bat_board <- fg_batter_leaders(startseason = "2023", endseason = "2023", sortstat = "WAR") season_bat_board[1:10, c(6, 7, 8, 12, 31, 19, 70)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## # A tibble: 10 × 7 ## PlayerName playerid Age G AVG HR WAR ## <chr> <int> <int> <int> <dbl> <int> <dbl> ## 1 Ronald Acuña Jr. 18401 25 159 0.338 41 8.29 ## 2 Mookie Betts 13611 30 152 0.306 39 8.26 ## 3 Freddie Freeman 5361 33 161 0.331 29 7.90 ## 4 Matt Olson 14344 29 162 0.283 54 6.70 ## 5 Shohei Ohtani 19755 28 135 0.304 44 6.62 ## 6 Marcus Semien 12533 32 162 0.276 29 6.31 ## 7 Corey Seager 13624 29 119 0.327 33 6.10 ## 8 Francisco Lindor 12916 29 160 0.254 31 6.02 ## 9 Corbin Carroll 25878 22 155 0.285 25 5.99 ## 10 Julio Rodríguez 23697 22 155 0.275 32 5.86 |
Statcast Leaderboards. The package also allows users to download Statcast leaderboards. For example, here is data on exit velocity barrels by pitcher in 2022:
1 2 |
pitcher_statboard_22 <- statcast_leaderboards(leaderboard = "exit_velocity_barrels", year = 2022, abs = 50, player_type = "pitcher") pitcher_statboard_22[1:10,c(2,3,4,5,8,12)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## # A tibble: 10 × 6 ## `last_name, first_name` player_id attempts avg_hit_angle avg_hit_speed ## <chr> <int> <int> <dbl> <dbl> ## 1 Gonzales, Marco 594835 623 14.5 86.7 ## 2 Alcantara, Sandy 645261 620 5.5 87.8 ## 3 Mikolas, Miles 571945 606 11 87.8 ## 4 Wainwright, Adam 425794 599 11 87.8 ## 5 Quantrill, Cal 615698 585 13.8 87.6 ## 6 Pérez, Martín 527048 576 8.1 88.2 ## 7 Irvin, Cole 608344 570 15.5 89.4 ## 8 Lyles, Jordan 543475 570 15.3 88.6 ## 9 Webb, Logan 657277 567 3.1 88.9 ## 10 Freeland, Kyle 607536 566 12.7 89.8 |
Player-Level Batting Game Logs. Below, I extract a batting game log for Shohei Ohtani (playerid = 19755) for year 2023:
1 2 |
<em># Player Game Logs: For Shohei Ohtani in 2023</em> fg_batter_game_logs(playerid = 19755, year = 2023) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 135 × 250 ## PlayerName playerid Date Team Opp season Age BatOrder Pos G ## <chr> <int> <chr> <chr> <chr> <int> <int> <chr> <chr> <dbl> ## 1 Shohei Oh… 19755 2023… LAA @OAK 2023 28 2 DH 1 ## 2 Shohei Oh… 19755 2023… LAA @OAK 2023 28 2 DH 1 ## 3 Shohei Oh… 19755 2023… LAA @OAK 2023 28 3 DH 1 ## 4 Shohei Oh… 19755 2023… LAA @PHI 2023 28 3 DH 1 ## 5 Shohei Oh… 19755 2023… LAA @PHI 2023 28 2 DH 1 ## 6 Shohei Oh… 19755 2023… LAA @PHI 2023 28 2 DH 1 ## 7 Shohei Oh… 19755 2023… LAA @NYM 2023 28 2 DH 1 ## 8 Shohei Oh… 19755 2023… LAA @NYM 2023 28 2 DH 1 ## 9 Shohei Oh… 19755 2023… LAA @NYM 2023 28 2 DH 1 ## 10 Shohei Oh… 19755 2023… LAA CIN 2023 28 2 DH 1 ## # ℹ 125 more rows |
Game-Level Data
Game Identifiers
Many MLB game-level data series rely on game identifiers. You can look up the game identifier using mlb_game_pks(). Below, I show how to get the game identifiers for all the MLB games played on June 6, 2023:
1 2 3 4 5 |
<em># Function to call all MLB games on date</em> gameids <- mlb_game_pks("2023-06-06", level_ids = 1) <em># Showing selected columns to readers</em> gameids[,c(1,50,42,40,32)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
## # A tibble: 15 × 5 ## game_pk teams.home.team.name teams.home.score teams.away.team.name ## <int> <chr> <int> <chr> ## 1 717882 Tampa Bay Rays 7 Minnesota Twins ## 2 717875 Philadelphia Phillies 1 Detroit Tigers ## 3 717876 Miami Marlins 6 Kansas City Royals ## 4 717881 Washington Nationals 5 Arizona Diamondbacks ## 5 717874 Pittsburgh Pirates 2 Oakland Athletics ## 6 717877 New York Yankees 2 Chicago White Sox ## 7 717871 Toronto Blue Jays 5 Houston Astros ## 8 717873 Cleveland Guardians 4 Boston Red Sox ## 9 717870 Cincinnati Reds 9 Los Angeles Dodgers ## 10 717872 Atlanta Braves 6 New York Mets ## 11 717867 Milwaukee Brewers 4 Baltimore Orioles ## 12 717868 Texas Rangers 6 St. Louis Cardinals ## 13 717869 Colorado Rockies 4 San Francisco Giants ## 14 717861 Los Angeles Angels 7 Chicago Cubs ## 15 717864 San Diego Padres 1 Seattle Mariners ## # ℹ 1 more variable: teams.away.score <int> |
Compiling Multiple Game Identifiers
To get data from multiple games, I would loop over multiple dates to create a vector of game identifiers. You can subset that larger set (e.g., to focus on a particular team). There can be inconsistencies in how the data are stored, so I recommend using bind_rows() from dplyr instead of the base rbind()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
<em># Set your parameters.</em> <em># The loop is designed to adapt if these values are changed:</em> start_year <- 2023 end_year <- 2023 start_mo <- 3 <em># March </em> end_mo <- 4 <em># October</em> start_day <- 30 end_day <- 30 <em># Format the dates using sprintf</em> start_date_str <- sprintf("%04d-%02d-%02d", start_year, start_mo, start_day) end_date_str <- sprintf("%04d-%02d-%02d", end_year, end_mo, end_day) <em># Convert string representations to Date objects</em> start_date <- as.Date(start_date_str) end_date <- as.Date(end_date_str) <em># Generate a sequence of dates from start to end</em> date_seq <- seq.Date(from = start_date, to = end_date, by = "day") game_ids <-data.frame() <em># Loop over each date to get game IDs for each data</em> <strong>for</strong> (date <strong>in</strong> date_seq) { <em># For some reasons, the dates in date_seq need to be re-recast as dates</em> datec <- as.Date(date) <em># Fetch the individual year data</em> temp <- mlb_game_pks(datec, level_ids = 1) <em># Bind it to your game ID data table</em> game_ids <- bind_rows(game_ids, temp) } |
Isolating Game Series
Use subset() to pare down your date-based game ID table to focus on a particular team, time, stadium, or any other variable in the game ID table. Below, I isolate the Mets games from the game identifiers constructed above.
1 2 3 4 5 |
<em># Isolating Mets games; Code found by looking up the team code on the 'game_ids' table:</em> mets_game_ids <- subset(game_ids, teams.away.team.id == 121 | teams.home.team.id == 121) <em># The first ten rows of the resulting data frame:</em> mets_game_ids[1:10,c(6,1,41,51)] |
1 2 3 4 5 6 7 8 9 10 11 |
## gameDate game_pk teams.away.team.name teams.home.team.name ## 10 2023-03-30T20:10:00Z 718774 New York Mets Miami Marlins ## 16 2023-03-31T22:40:00Z 718771 New York Mets Miami Marlins ## 32 2023-04-01T20:10:00Z 718755 New York Mets Miami Marlins ## 41 2023-04-02T17:40:00Z 718741 New York Mets Miami Marlins ## 51 2023-04-03T18:10:00Z 718731 New York Mets Milwaukee Brewers ## 73 2023-04-04T23:40:00Z 718712 New York Mets Milwaukee Brewers ## 85 2023-04-05T17:40:00Z 718698 New York Mets Milwaukee Brewers ## 99 2023-04-06T17:10:00Z 718693 Miami Marlins New York Mets ## 103 2023-04-07T17:10:00Z 718693 Miami Marlins New York Mets ## 123 2023-04-08T20:10:00Z 718661 Miami Marlins New York Mets |
1 2 3 |
<em># Create an array of game_pk codes to be used in other loops</em> mets_game_pks <- mets_game_ids$game_pk print(mets_game_pks) |
1 2 3 |
## [1] 718774 718771 718755 718741 718731 718712 718698 718693 718693 718661 ## [11] 718647 718634 718621 718610 718578 718565 718546 718532 718516 718508 ## [21] 718503 718480 718478 718455 718431 718412 718398 718394 718371 718361 |
Fetching Game-Level Data
You can download many game-level indicators via the baseballr package. Click here for a list of data extraction functions. Often, they will require that you work with game-level identifiers (see above)
For example, to get Mets’ game on March 30, 2023, you can use the mlb_batting_orders() function (using the game identifier codes from ‘mets_game_pks’ above):
1 2 3 4 5 6 7 8 |
<em># Fetch the batting order data</em> mets_order_game1 <- mlb_batting_orders(718774) <em>#Game ID from above</em> <em># Isolate only Mets</em> mets_order_game1 <- subset(mets_order_game1, teamID == 121) <em># Mets team ID</em> <em># Print results to show readers</em> print(mets_order_game1) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## # A tibble: 9 × 8 ## id fullName abbreviation batting_order batting_position_num team ... ## <int> <chr> <chr> <chr> <chr> <chr> ## 1 607043 Brandon… CF 1 0 away ## 2 516782 Starlin… RF 2 0 away ## 3 596019 Francis… SS 3 0 away ## 4 624413 Pete Al… 1B 4 0 away ## 5 643446 Jeff Mc… 2B 5 0 away ## 6 592192 Mark Ca… LF 6 0 away ## 7 596129 Daniel … DH 7 0 away ## 8 500871 Eduardo… 3B 8 0 away ## 9 553882 Omar Na… C 9 0 away |
What were the probable pitchers that day? What was the line score?
1 2 |
<em># Probable pitchers</em> mlb_probables(717774) |
1 2 3 4 5 6 |
## # A tibble: 2 × 8 ## game_pk game_date fullName id team team_id home_plate_full_name ## <int> <chr> <chr> <int> <chr> <int> <chr> ## 1 717774 2023-06-13 Jalen Beeks 656222 Tamp… 139 Quinn Wolcott ## 2 717774 2023-06-13 Shintaro Fujinami 660261 Oakl… 133 Quinn Wolcott |
1 2 3 |
<em># Line score</em> game_linescore <- mlb_game_linescore(717774) game_linescore[1:10, c(1,3,5:10, 12:14)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 10 × 11 ## game_pk home_team_name away_team_name num ordinal_num home_runs... ## <dbl> <chr> <chr> <int> <chr> <int> ## 1 717774 Oakland Athleti… Tampa Bay Rays 1 1st 0 ## 2 717774 Oakland Athleti… Tampa Bay Rays 2 2nd 0 ## 3 717774 Oakland Athleti… Tampa Bay Rays 3 3rd 0 ## 4 717774 Oakland Athleti… Tampa Bay Rays 4 4th 0 ## 5 717774 Oakland Athleti… Tampa Bay Rays 5 5th 0 ## 6 717774 Oakland Athleti… Tampa Bay Rays 6 6th 0 ## 7 717774 Oakland Athleti… Tampa Bay Rays 7 7th 1 ## 8 717774 Oakland Athleti… Tampa Bay Rays 8 8th 1 ## 9 717774 Oakland Athleti… Tampa Bay Rays 9 9th NA ## 10 NA <NA> <NA> NA <NA> NA |
Iterating Over Multiple Games
You can extract and combine information from game data tables by using loops or apply functions. For example, the chunk below will use the lineup data from above (retrieved using the function mlb_batting_orders()) using the array of game identifiers for the Mets’ April 2023 games:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<em># Creating empty data set to which we will add values in the loop</em> leadoff_hitters <- data.frame() <strong>for</strong> (i <strong>in</strong> mets_game_pks){ <em># Get batting order data for this iteration of the loop</em> game_order <- mlb_batting_orders(i) <em># Isolate Mets orders and batting_order==1</em> game_leadoff <- subset(game_order, batting_order == 1 & teamID == 121) leadoff_hitters <- bind_rows(leadoff_hitters, game_leadoff) } <em># Show readers sample of rows and columns from resulting data table</em> leadoff_hitters[1:10, 2:3] |
1 2 3 4 5 6 7 8 9 10 11 |
## fullName abbreviation ## 1 Brandon Nimmo CF ## 2 Brandon Nimmo CF ## 3 Brandon Nimmo CF ## 4 Tommy Pham LF ## 5 Brandon Nimmo CF ## 6 Brandon Nimmo CF ## 7 Brandon Nimmo CF ## 8 Brandon Nimmo CF ## 9 Brandon Nimmo CF ## 10 Brandon Nimmo CF |
Other Metrics
Pitch-by-Pitch Data
You can get a pitch log for an individual game using the game codes above. Here’s the top of the pitch log for game #71774:
1 2 3 |
game_pbp <- mlb_pbp(717774) game_pbp[1:10, c(7, 9, 10, 103, 108, 109)] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 10 × 6 ## type pitchNumber details.description pitchData.endSpeed ## <chr> <int> <chr> <dbl> ## 1 pitch 3 Swinging Strike 86.8 ## 2 pitch 2 Foul 77.8 ## 3 pitch 1 Called Strike 78.4 ## 4 action NA Offensive Substitution:... NA ## 5 pickoff NA Pickoff Attempt 1B NA ## 6 pitch 6 Ball 80.5 ## 7 pitch 5 Swinging Strike 79.7 ## 8 pitch 4 Ball 79.4 ## 9 pitch 3 Ball 75.5 ## 10 pitch 2 Ball 86 |
Sportrac Player Contracts
The package allows users to download payroll and salary data from Sportrac. Note that the API may only offer recent seasons’ data. For example:
1 2 |
<em># Team Payroll for the 2021 season:</em> sptrc_league_payrolls(year = 2021) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 30 × 12 ## year team team_abbr win_percent roster active_man... ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2021 Los Angeles Dodg… LAD 0.654 28 174661542 ## 2 2021 New York Yankees NYY 0.568 30 141518753 ## 3 2021 New York Mets NYM 0.475 29 176565754 ## 4 2021 Philadelphia Phi… PHI 0.506 27 148014046 ## 5 2021 Houston Astros HOU 0.586 28 146877726 ## 6 2021 Boston Red Sox BOS 0.568 28 141452731 ## 7 2021 Los Angeles Ange… LAA 0.475 29 133967088 ## 8 2021 San Diego Padres SD 0.488 28 125977584 ## 9 2021 San Francisco Gi… SF 0.66 28 134386796 ## 10 2021 Atlanta Braves ATL 0.547 30 115354620 ## # ℹ 20 more rows |
1 2 |
<em># Blue Jays Payroll for 2022</em> sptrc_team_active_payroll("TOR", year = 2022) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## # A tibble: 44 × 16 ## year team player_name age pos status base_salary ## 1 2022 TOR George Springer 32 CF Vet 28000000 ## 2 2022 TOR Kevin Gausman 31 SP Vet 21000000 ## 3 2022 TOR Yusei Kikuchi 31 RP Vet 16000000 ## 4 2022 TOR Matt Chapman 29 3B Vet 12000000 ## 5 2022 TOR Jose Berrios 28 SP Vet 10000000 ## 6 2022 TOR Teoscar Hernandez 29 DH Vet 10650000 ## 7 2022 TOR Vladimir Guerrero J… 23 1B Arb 1… 7900000 ## 8 2022 TOR Yimi Garcia 31 RP Vet 4000000 ## 9 2022 TOR Raimel Tapia 28 RF Arb 2 3950000 ## 10 2022 TOR Ross Stripling 32 SP Vet 3790000 ## # ℹ 34 more rows |
Park Factors
FanGraphs offers park factor estimates for each year. Here’s an example for 2018:
1 |
fg_park(2018) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## # A tibble: 30 × 16 ## season home_team basic_5yr `3yr` `1yr` single double triple hr ## <int> <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 2018 Angels 99 99 99 99 97 88 104 ## 2 2018 Orioles 100 102 99 100 96 90 106 ## 3 2018 Red Sox 104 103 104 102 115 108 97 ## 4 2018 White Sox 99 99 97 98 93 88 105 ## 5 2018 Indians 103 101 106 100 104 87 102 ## 6 2018 Tigers 102 103 97 101 100 129 101 ## 7 2018 Royals 102 101 103 103 106 113 92 ## 8 2018 Twins 101 101 100 101 104 105 99 ## 9 2018 Yankees 100 99 106 100 93 86 107 ## 10 2018 Athletics 96 97 92 98 102 102 93 ## # ℹ 20 more rows |