Joseph Nathan Cohen

Department of Sociology, CUNY Queens College, New York, NY

Guidance: Creating Baseball Information

Create baseball information using data analysis.

In our first assignment in DATA 333, you will create a piece of original baseball information.

Click below to download:

This document provides a walk-through of a simple, cursory data analysis in an applied context. The walk-through is designed to convey the general tasks involved in creating simple analytics-based informational products in an R-based workflow. Students are invited to follow along with what I do, and manipulate my code to get their own results.



We have been approached by the New York Mets to assess who are the best batters in Major League Baseball. They feel like they can win the league championship with one or two more elite batters, but they do not know whom to pursue in a trade or free agent signing

Project Conception

Although the task seems straightforward, the problem is that we do not have clear, uncontroversial answers about which player would be the best for the New York Mets. Hitters are good at different things, and we do not know what kind of hitter would be best for the Mets specifically. I think that our best strategy is to inform the Mets baseball professionals about who did a good job in different aspects of batting in order to support their decision-making.

Research Design

I propose that, as a team, each of us selects an indicator from 2023 seasonal batting data to get ideas on who did well at particular facets of hitting. We will look at leaderboards to determine who did best and worst among qualified hitters.

Data Wrangling

Data Acquisition

The data is stored on the Excel sheet provided in class. To import data from the first worksheet of this Excel workbook:

There are not major data cleaning issues for you to perform, as I pre-cleaned the data.

The data we will consider include:

  • Name (Name): Player’s name
  • Team (Team): Player’s team
  • Games (G): Number of games in 2023 in which player appeared
  • Plate Appearances (PA): Number of time in 2023 that player attempted an at bat
  • Batting Average (AVG): Percent of times in which a plate appearance results in a hit
  • Runs (R): Number of times that player cross home plate to score a point for team
  • Home Runs (HR): Number of times player it it out of the part to score themselves and all players on base instantly.
  • Runs Batted In (RBI): Number of runs scored due to player’s at bats
  • On-Base Percentage (OBP): Percent of time that plate appearances result in player reaching base safely
  • Slugging Average (SLG): Average number of bases that a player covers by hit, walk or some other means of hitting the ball.
  • Stolen Bases (SB): Number of times player advance base by “stealing” base
  • Strikeout Percentage (K_pct): Percent of plate appearances that result in strikouts
  • Walk Percentage (BB_pct): Percent of plate appearances that result in walks.
  • Win Probability Added (WPA): Player’s name
  • Wins Above Replacement (WAR): Estimate of how many additional wins a team will receive by playing this player versus a low-level MLB player.
  • Earnings (Earned): Estimates of money delivered to club by virtue of playing performance. In millions of dollars.


So who was good at what? Let’s focus on the variable strikeout percentage. Strikeouts are bad because there is no possibilty of reaching base or advancing due to a defensive player error or fielder’s choice.

What counts as a good or bad score? Let’s look at the distribution of the statistic:

Let’s make ranked lists. This is how we get the top 20 performances in terms of strikeout percentage:

Here are the worst performers:

To get Max Muncy’s information

To only look at some of Max Muncy’s data:

To look at all of the Mets:

To look at players with K_pct that are below 15%

Statistics is Only Part of the Job

Getting R to process and report results is only part of the job. Ultimately, our task is to inform someone else’s decision-making process, and the value of your work is partly tied to its success in strengthening decision-makers. This means that how you interpret and communicate your statistical findings is important. When reporting your findings, be sure to:

  • Organize your report around key findings and their relevance to the decision that your project serves. Were you in the position of your client, which findings would strike you as relevant to the decision at hand, and why?
  • Show your relevant statistical findings. You do not need to show everything. Only show what you think will be useful to the client.
  • Explain how to read and interpret your statistical reports. Assume your reader is intelligent but not versed in statistics. Give plain-language instructions on how to read and interpret your tables, graphs, and other findings.
  • Deliver polished, professional, and aesthetically attractive reports, but do not make them ornate. Aim for clean and professional, not decorated.

Functions: A Preview

A function is like an R command that you program. It is a routine of commands that perform a practical operation using a few points of user input, like the name of the data object or a variable.

Here is a function to get a specific player’s percentile score in strikeout percentage:

So let’s get started figuring out who is good or bad to give our clients names to consider.

Leave a Reply

Your email address will not be published. Required fields are marked *