In our first assignment in DATA 333, you will create a piece of original baseball information.
Click below to download:
- The data for this exercise
- A PDF version of this instruction sheet
- The Markdown file used to generate this document
This document provides a walk-through of a simple, cursory data analysis in an applied context. The walk-through is designed to convey the general tasks involved in creating simple analytics-based informational products in an R-based workflow. Students are invited to follow along with what I do, and manipulate my code to get their own results.
Pre-Analysis
Scenario
We have been approached by the New York Mets to assess who are the best batters in Major League Baseball. They feel like they can win the league championship with one or two more elite batters, but they do not know whom to pursue in a trade or free agent signing
Project Conception
Although the task seems straightforward, the problem is that we do not have clear, uncontroversial answers about which player would be the best for the New York Mets. Hitters are good at different things, and we do not know what kind of hitter would be best for the Mets specifically. I think that our best strategy is to inform the Mets baseball professionals about who did a good job in different aspects of batting in order to support their decision-making.
Research Design
I propose that, as a team, each of us selects an indicator from 2023 seasonal batting data to get ideas on who did well at particular facets of hitting. We will look at leaderboards to determine who did best and worst among qualified hitters.
Data Wrangling
Data Acquisition
The data is stored on the Excel sheet provided in class. To import data from the first worksheet of this Excel workbook:
1 2 3 4 5 |
<em># Import data from Excel</em> data <- read_xlsx("MLB 2023 Batting Statistics.xlsx", sheet = 1) <em># Look at first few rows and columns</em> head(data, 5) |
1 2 3 4 5 6 7 8 9 10 |
## # A tibble: 5 × 18 ## Name Team G PA AVG R HR RBI OBP SLG SB K_pct BB_pct ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Rona… ATL 159 735 0.337 149 41 106 0.416 0.596 73 0.114 0.109 ## 2 Mook… LAD 152 693 0.307 126 39 107 0.408 0.579 14 0.154 0.139 ## 3 Fred… LAD 161 730 0.331 131 29 102 0.410 0.567 23 0.166 0.0986 ## 4 Matt… ATL 162 720 0.283 127 54 139 0.389 0.604 1 0.232 0.144 ## 5 Shoh… LAA 135 599 0.304 102 44 95 0.412 0.654 20 0.239 0.152 ## # ℹ 5 more variables: WPA <dbl>, WAR <dbl>, Earned <dbl>, PlayerId <dbl>, ## # MLBAMID <dbl> |
There are not major data cleaning issues for you to perform, as I pre-cleaned the data.
The data we will consider include:
- Name (Name): Player’s name
- Team (Team): Player’s team
- Games (G): Number of games in 2023 in which player appeared
- Plate Appearances (PA): Number of time in 2023 that player attempted an at bat
- Batting Average (AVG): Percent of times in which a plate appearance results in a hit
- Runs (R): Number of times that player cross home plate to score a point for team
- Home Runs (HR): Number of times player it it out of the part to score themselves and all players on base instantly.
- Runs Batted In (RBI): Number of runs scored due to player’s at bats
- On-Base Percentage (OBP): Percent of time that plate appearances result in player reaching base safely
- Slugging Average (SLG): Average number of bases that a player covers by hit, walk or some other means of hitting the ball.
- Stolen Bases (SB): Number of times player advance base by “stealing” base
- Strikeout Percentage (K_pct): Percent of plate appearances that result in strikouts
- Walk Percentage (BB_pct): Percent of plate appearances that result in walks.
- Win Probability Added (WPA): Player’s name
- Wins Above Replacement (WAR): Estimate of how many additional wins a team will receive by playing this player versus a low-level MLB player.
- Earnings (Earned): Estimates of money delivered to club by virtue of playing performance. In millions of dollars.
Analysis
So who was good at what? Let’s focus on the variable strikeout percentage. Strikeouts are bad because there is no possibilty of reaching base or advancing due to a defensive player error or fielder’s choice.
What counts as a good or bad score? Let’s look at the distribution of the statistic:
1 |
summary(data$K_pct) |
1 2 |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05511 0.17061 0.20976 0.20788 0.23853 0.32700 |
Let’s make ranked lists. This is how we get the top 20 performances in terms of strikeout percentage:
1 2 3 4 5 |
<em># Sort the data by Strikeout Percentage in ascending order to get those with the lowest K%</em> data <- data[order(data$K_pct),] <em># Select the top 20 players with the lowest strikeout percentages</em> head(data[,c("Name", "Team", "K_pct")], 20) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
## # A tibble: 20 × 3 ## Name Team K_pct ## <chr> <chr> <dbl> ## 1 Luis Arraez MIA 0.0551 ## 2 Jeff McNeil NYM 0.100 ## 3 Keibert Ruiz WSN 0.103 ## 4 Steven Kwan CLE 0.104 ## 5 José Ramírez CLE 0.106 ## 6 Ronald Acuña Jr. ATL 0.114 ## 7 Alex Bregman HOU 0.120 ## 8 Nico Hoerner CHC 0.121 ## 9 Kyle Tucker HOU 0.136 ## 10 Masataka Yoshida BOS 0.140 ## 11 Andrew Benintendi CHW 0.143 ## 12 Gleyber Torres NYY 0.146 ## 13 Marcus Semien TEX 0.146 ## 14 Vladimir Guerrero Jr. TOR 0.147 ## 15 Adley Rutschman BAL 0.147 ## 16 Alec Bohm PHI 0.154 ## 17 Mookie Betts LAD 0.154 ## 18 Alex Verdugo BOS 0.154 ## 19 Dominic Smith WSN 0.155 ## 20 Mark Canha - - - 0.156 |
Here are the worst performers:
1 2 3 4 5 |
<em># Sort the data by Strikeout Percentage in ascending order to get those with the lowest K%</em> data <- data[order(-data$K_pct),] <em># Select the top 20 players with the lowest strikeout percentages</em> head(data[,c("Name", "Team", "K_pct")], 20) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
## # A tibble: 20 × 3 ## Name Team K_pct ## <chr> <chr> <dbl> ## 1 Brent Rooker OAK 0.327 ## 2 Jack Suwinski PIT 0.322 ## 3 James Outman LAD 0.319 ## 4 Ryan McMahon COL 0.316 ## 5 Teoscar Hernández SEA 0.311 ## 6 Eugenio Suárez SEA 0.308 ## 7 Kyle Schwarber PHI 0.299 ## 8 Josh Jung TEX 0.293 ## 9 Luis Robert Jr. CHW 0.289 ## 10 Matt Chapman TOR 0.284 ## 11 MJ Melendez KCR 0.282 ## 12 J.D. Davis SFG 0.278 ## 13 Anthony Volpe NYY 0.278 ## 14 Cal Raleigh SEA 0.278 ## 15 Trent Grisham SDP 0.277 ## 16 Adolis García TEX 0.277 ## 17 Jake Burger - - - 0.276 ## 18 Nick Castellanos PHI 0.276 ## 19 Ezequiel Tovar COL 0.270 ## 20 Max Muncy LAD 0.264 |
To get Max Muncy’s information
1 |
data[data$Name == "Max Muncy", ] |
1 2 3 4 5 6 |
## # A tibble: 1 × 18 ## Name Team G PA AVG R HR RBI OBP SLG SB K_pct BB_pct ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Max … LAD 135 579 0.212 95 36 105 0.333 0.475 1 0.264 0.147 ## # ℹ 5 more variables: WPA <dbl>, WAR <dbl>, Earned <dbl>, PlayerId <dbl>, ## # MLBAMID <dbl> |
To only look at some of Max Muncy’s data:
1 |
data[data$Name == "Max Muncy", c("Name", "Team", "AVG", "HR", "K_pct")] |
1 2 3 4 |
## # A tibble: 1 × 5 ## Name Team AVG HR K_pct ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Max Muncy LAD 0.212 36 0.264 |
To look at all of the Mets:
1 |
subset(data, Team == "NYM") |
1 2 3 4 5 6 7 8 9 |
## # A tibble: 4 × 18 ## Name Team G PA AVG R HR RBI OBP SLG SB K_pct BB_pct ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Pete… NYM 154 658 0.217 92 46 118 0.318 0.504 4 0.229 0.0988 ## 2 Bran… NYM 152 682 0.274 89 24 68 0.363 0.466 3 0.214 0.109 ## 3 Fran… NYM 160 687 0.254 108 31 98 0.336 0.470 31 0.199 0.0961 ## 4 Jeff… NYM 156 648 0.270 75 10 55 0.333 0.378 10 0.100 0.0602 ## # ℹ 5 more variables: WPA <dbl>, WAR <dbl>, Earned <dbl>, PlayerId <dbl>, ## # MLBAMID <dbl> |
To look at players with K_pct that are below 15%
1 |
subset(data, K_pct < 0.13) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
## # A tibble: 8 × 18 ## Name Team G PA AVG R HR RBI OBP SLG SB K_pct ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Nico Hoern… CHC 150 688 0.283 98 9 68 0.346 0.383 43 0.121 ## 2 Alex Bregm… HOU 161 724 0.262 103 25 98 0.363 0.441 3 0.120 ## 3 Ronald Acu… ATL 159 735 0.337 149 41 106 0.416 0.596 73 0.114 ## 4 José Ramír… CLE 156 691 0.282 87 24 80 0.356 0.475 28 0.106 ## 5 Steven Kwan CLE 158 718 0.268 93 5 54 0.340 0.370 21 0.104 ## 6 Keibert Ru… WSN 136 562 0.260 55 18 67 0.308 0.409 1 0.103 ## 7 Jeff McNeil NYM 156 648 0.270 75 10 55 0.333 0.378 10 0.100 ## 8 Luis Arraez MIA 147 617 0.354 71 10 69 0.393 0.469 3 0.0551 ## # ℹ 6 more variables: BB_pct <dbl>, WPA <dbl>, WAR <dbl>, Earned <dbl>, ## # PlayerId <dbl>, MLBAMID <dbl> |
Statistics is Only Part of the Job
Getting R to process and report results is only part of the job. Ultimately, our task is to inform someone else’s decision-making process, and the value of your work is partly tied to its success in strengthening decision-makers. This means that how you interpret and communicate your statistical findings is important. When reporting your findings, be sure to:
- Organize your report around key findings and their relevance to the decision that your project serves. Were you in the position of your client, which findings would strike you as relevant to the decision at hand, and why?
- Show your relevant statistical findings. You do not need to show everything. Only show what you think will be useful to the client.
- Explain how to read and interpret your statistical reports. Assume your reader is intelligent but not versed in statistics. Give plain-language instructions on how to read and interpret your tables, graphs, and other findings.
- Deliver polished, professional, and aesthetically attractive reports, but do not make them ornate. Aim for clean and professional, not decorated.
Functions: A Preview
A function is like an R command that you program. It is a routine of commands that perform a practical operation using a few points of user input, like the name of the data object or a variable.
Here is a function to get a specific player’s percentile score in strikeout percentage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
get_strikeout_percentile <- <strong>function</strong>(data, playerName, teamName) { <em># Filter data for the specific player</em> player_data <- subset(data, Name == playerName & Team == teamName) <em># Check if player data exists</em> <strong>if</strong>(nrow(player_data) == 0) { <strong>return</strong>(paste("No data found for", playerName, "in", teamName)) } <em># Calculate the player's strikeout percentage rank among all players</em> player_rank <- sum(data$K_pct < player_data$K_pct) + 1 <em># Calculate the percentile</em> player_percentile <- (player_rank / nrow(data)) * 100 <em># Return the player's percentile score</em> <strong>return</strong>(paste(playerName, "from", teamName, "is in the", round(player_percentile, 2), "percentile for strikeout percentage.")) } get_strikeout_percentile(data, "Max Muncy", "LAD") |
1 |
## [1] "Max Muncy from LAD is in the 85.71 percentile for strikeout percentage." |
So let’s get started figuring out who is good or bad to give our clients names to consider.