When you learn how to pick teams for Deadball in R, you'll have access to tools to learn about baseball using the language R. This article will use the Lahman database (package for R) to select teams randomly.
What the Heck Id Deadball?
Deadball is a baseball simulation game. The rules are described in a book of the same name created by W.M. Akers. He created the game to fill in for the dead time after the baseball season over. Yes, you have to be a diehard baseball addict to understand this!
Deadball also refers to the period in baseball before Babe Ruth and other power hitters. This period is when people didn’t hit that many homeruns. They relied on the strategy.
For the most part, we’ll stick with the first definition as it’s more appropriate.
You can see how the game is played from the following video. It shows the rules for basic play.
The game gives you a choice to play with any players you want, made up or real players. The book shows you how to generate fake players (there are online tools for this).
The real players will likely add a sense of authenticity based on the way the rules are defined. The players stats are part of the play. That’s one reason why I am writing this tutorial. To give you a way to generate real players programmatically. It will save you time.
You can certainly choose real players manually. This would require you determine some way to choose these players. Then, you’d have to look up their stats.
NOTE: If you don't care about the coding and just want to generate the teams, you can use the following link:
https://datasciencereview.com/run-r-online/
It contains the code to generate teams for 2018. The Lahman database was not updated at the time of this writing for 2019.
The Motivation Behind this Tutorial
This tutorial shows you how to program a computer using the R language to pick the teams for you.
I could just give you the code and be done with it. But since this website is about data science, I felt that going through how I created the code, as well as any challenges that emerged, would help budding data scientist learn the issues that can arise. Besides, it’s not yet baseball season, so you can take advantage of playing.
This tutorial assumes preliminary knowledge of R programming. If you have solid experience, you can still use this to pick your teams. You would be able to come up with the code on your own. But now you can simply take the code and use it or tweak it to your own purposes.
For an excellent (and free) tutorial on how to use R with baseball data, see the following:
https://www.udemy.com/course/baseball1/
(It was free at the time of this writing)
For those who just want the code without all the explanation, here you go:
# Step 1 - Install the Lahman dataset - comment to run again as it will be already loaded.
#install.packages("Lahman")
# Step 2 - load the library
library(Lahman)
year = 2018
# Step 3 - ensure that the year is at least 2018
# max(Batting$yearID)
# Step 4 - View the Batting table
#View(Batting)
# Step 5 - create the filter for year (Batting and Pitching)
battingSubset <- Batting[Batting$yearID == year, ]
pitchingSubset <- Pitching[Pitching$yearID == year, ]
# Step 5a - View the new battingSubset table
#View(battingSubset) # feel free to do this with pitchingSubset too
# Step 6 - Add the batting average (BA) column
battingSubset$BA = round(battingSubset$H / battingSubset$AB * 100, 0)
# step 7 - Filter out the NaN (see blog text)
battingSubset <- battingSubset[battingSubset$AB > 0, ]
# Step 8 - filter out players how have not played enough games (see blog text)
battingSubset <- battingSubset[battingSubset$G > 70, ]
# Step 9 - select the players at random and split into two teams
playerPicks <- sample(1:nrow(battingSubset), 18, replace=FALSE)
homeTeam <- battingSubset[playerPicks[1:9], ]
awayTeam <- battingSubset[playerPicks[10:18], ]
# Step 10 - add player name to home and away teams
homeIndex <- match(homeTeam$playerID, People$playerID)
awayIndex <- match(awayTeam$playerID, People$playerID)
homeTeam$playerName <- paste(People[homeIndex, "nameFirst"], People[homeIndex, "nameLast"])
awayTeam$playerName <- paste(People[awayIndex, "nameFirst"], People[awayIndex, "nameLast"])
homeTeam[, c("playerName", "BA")]
awayTeam[, c("playerName", "BA")]
Cleaning Up Your Data
The tutorial can help you understand the challenges associated with data. Most data sources are not given in the format that can help with your analysis. The data either contains errors, or there are too many exceptions that would require you to account for them, or both.
Some people choose to do a minimal of cleaning for their data. That’s okay, but it usually adds to your challenges when analyzing your data or using it. In most cases, it pays to spend some time wrangling your data. It makes the end result that much easier and future uses of the data easier, too.
Why I Chose the R Language
The main reason I chose the R language for this task is because an R library exists for baseball. Sports reporter Sean Lahman updates his website with the latest baseball stats every year. The 2019 season was not available at the time of this writing, but once you have the functions set up (from this tutorial) you can apply them to the 2019 data when it’s ready.
The Lahman database sometimes changes structures. If this is the case, you’ll need to adjust your code.
How This Works with Deadball
The basic play for Deadball has two requirements for data. The first is that you have batting averages for players, and the second is that you have ERA for pitchers. To keep the game simple there is no provision for fielding. Therefore, you won’t need to consider a player’s fielding stats.
I’ll show you how to use R to compose nine players for each team. You can choose players from individual teams in a particular year, or you can simply choose 9 players randomly. This can be by year or throughout most of baseball’s history. Your choice!
In this tutorial, I will show you how to choose random players for the 2018 season. I have 2019 data from another source, but decided against using it as it is not included as a library for R. It would take too much explanation on how to load up this data. When you learn how to pick teams for 2018, you can use the functions for 2019 and future (or even past) years.
For those readers who know about GitHub, you’ll find the data there. You will have to load each table, though.
Steps for Choosing Teams
The overall concept is to filter and clean baseball data so that you can randomly select 18 players. The first nine are the home players and the second are the away. You can switch them if you’d like. Then, you replace the ninth player and the eighteenth with randomly selected pitchers and obtain their pitching and batting stats.
The basic game does not consider a designated hitter. Although, you could combine the DH and pitching stats for the purposes of the game. Or, you can simply choose pitchers from the National League.
The choice used in this tutorial is to select nine batters per team. Then, I’ll have a separate object for the pitcher. The pitcher object will exist solely for the calculations that deal with ERA. You can choose to add the pitchers as the ninth and eighteenth players if you choose, assuming they have batting stats.
I use RStudio for this tutorial. If you don’t, that’s okay. There aren’t many (if any) aspects of the solution that require RStudio functionality. But you will need to know how to add packages in the environment you’re using.
Download the full code here (right click and save as).
Let’s Get Started!
Step 1 – Install the Lahman library
install.packages(“Lahman”)
Step 2 – Load the library
library(Lahman)
Step 3 – Check to see that 2018 exists
max(Batting$yearID)
The result will be the most recent year. If it shows 2019, that’s even better. But it should be at least 2018. If it’s less than 2018, update your Lahman database. You’re welcome to use earlier years, too.
Step 4 – Use the View() command to see the Batting table
View(Batting)
If you haven’t seen the Lahman tables before, the data may look a bit cryptic. You can find a listing of each one at the following resource:
https://rdrr.io/cran/Lahman/man/Batting.html
For this tutorial, we’ll deal with the fields we need. You’ll need to know about the playerID, yearID, G, AB, H.
The playerID and yearID should be obvious, although the playerID may seem a bit cryptic. For most players, it is the last name plus part of the first and a number to distinguish between two players with the same name.
For instance, there are several players in baseball’s history with the name Bill Smith (some are Billy Smith). Each of these players has the playerID starting with “smithbi”. The number system is used to separate these players. smithbi01, smithbi02, etc.
For players that don’t share their names, there is still a 01 appended to the playerID. Let’s see if you can guess the following player based on these rules:
ruthba01
Hint: it’s one of baseball’s all-time great players!
The yearID is the season (year) the player played. The other three fields should be obvious, too, but here are the definitions to remove any doubt:
G – The number of games played for the year
AB – The number of at bats for the year
H – The number of hits for the year.
What about the average?
Deadball requires the batting averages of players. Yet, it does not seem to appear in the Batting table. You are correct to wonder about this. It doesn’t appear anywhere in Lahman tables.
Have no fear! R makes it excruciatingly easy to add columns to an object.
However, before we do that, let’s apply our first filter, the year.
Step 5 – Filter the Batting and Pitching tables by the year 2018
Let’s create two objects derived from the Batting and Pitching tables.
batting2018 <- Batting[Batting$yearID == 2018, ]
pitching2018 <- Pitching[Pitching$yearID == 2018, ]
If this coding looks strange to you, then you aren’t as familiar with R coding as you’ll need for this tutorial. My suggestion is to put aside this tutorial and find a decent (and free) tutorial on R programming. See the Resources section for some possible courses to take.
Feel free to use the code as is shown and run it in R. I have tested the code an know that it works. The two lines of code above will subdivide the Batting and Pitching databases into records for the year 2018.
Step 5a (optional but recommended) – Run the View command on the new batting2018 table.
View(batting2018)
I like to refer to the table when deciding on how to cleanse the data. It shows a portion of the data in the view and you can scroll down to see more. It can help you define what aspects to cleanse.
Step 6 – Add the Batting Average to the batting2018.
I mentioned before that adding columns in R is wickedly easy. This is because R supports a concept known as elementwise processing. What that means is R will apply any transformations made to the entire table just from one instruction. We’ll add the batting average (BA) which is the number of hits (H) divided by the number of at bats (AB).
Note: because we subdivided the data into our desired year, we don’t have to factor the year into the calculation.
batting2018$BA <- round(batting2018$H / batting2018$AB * 100, 0)
That is all you need to do to add the batting average (BA) to all the rows.
In programming languages that don’t support elementwise transformations, you would need to set up loops and arrays or some other container construct. R makes this very easy, indeed. One line of code!
Why did I multiply by 100 and round it to the nearest whole number? It will make it easier when you play the game of Deadball. It coincides with the batting average rules. You are more than welcome to leave it raw and then do the multiplying and rounding later.
For each step, reload the table being transformed by reapplying the View() command as follows:
View(batting2018)
Step 7 – Remove the NaN from BA field
Here is where the data scientist in you should emerge, although it isn’t necessary for picking teams. Try to figure out why there is strange information in your data. This insight can lead you to better decisions when cleansing.
We could just remove any row that contains NaN in the BA column. NaN means Not a Number. This usually occurs when trying to divide by 0. The components that make up BA are Hits (H) and At Bats (AB). NaN occurred because we tried to divide when AB was 0.
Extra analysis – it is possible to have H = 0 while AB > 0. This would mean the batter never got any hits all season but did get up to the plate. It’s hard to imagine though, that there would ever be H > 0 and AB = 0. This would be something you’d want to investigate as it suggests the player had hits but no at bats. This doesn’t seem like a plausible scenario. These are the kinds of questions you should ask yourself whenever you do data science analysis.
Since we know that this NaN occurs whenever AB = 0, we’ll filter accordingly:
batting2018 <- batting2018[batting2018$AB > 0, ]
Step 8 – Houston, we have a problem!
When you scan the data in the View() command, you’ll notice that there are several records where BA = 0. This is actually not the problem. The problem is there are also BA = 100. This is a problem. What this means is that players are batting 1000, which is near impossible odds.
The numbers are not wrong, though. Think about it this way. If you played one game for one inning and you got on base, your batting average would be 1000, too (which translates to BA = 100).
Beware the Survivorship Bias
We can easily clobber these records, and it will be the path I take in this tutorial. However, we have to be careful here. Do we want to choose the best players to play or do we want player selection to be truly random?
In statistics, there is something called a survivorship bias. Essentially, it is a bias where we choose the records that would help our situations.
Investment managers like to use survivorship bias to show they are doing better than they really are. For instance, suppose a money manager had a company in his portfolio that went bankrupt ten years ago. It took a hit on the portfolio. The portfolio manager decides to report only on the last five years which won’t contain the bankrupt statistics. The manager’s stats will be elevated as a result.
The same can be said with choosing our teams. We can filter out until we are only left with the best players.
This may be a viable strategy for you, too. It’s just something you need to know exists.
With our situation, we’ll filter out any player who hasn’t played more than 75 games. Why 75? Why not? You can choose a cutoff point that feels right to you.
I chose to use Games (G) as a filter, because this will be less subject to survivorship bias. The likelihood of not getting hits the more games you play decreases. Players who don’t get hits after many games will likely be benched. Of course, once again, we are survivorship-biasing away these players. Make sense?
batting2018 <- batting2018[batting2018$G > 75, ]
Since we analyzed the reasons why the data had problems, we came up with a solution that took care of the hits being a low number and the BA being too high a number. The one command did it all for us.
We could have filtered for BA = 0 and BA = 100. But what would happen if BA = 99 or BA = 98 or even BA = 65? BA = 45 would be an exceptional hitter in baseball!
We could even have also simply did BA > 50 and maybe that would take care of the problem. But what about the lower end? How low do we go there?
If you are wondering why the BA is mostly two digits, it has to do with the multiplying by 100 and rounding to the integer. Feel free to create another column to see the number without these transformations applied. In our case, a BA of 35 would equate to a batting average of three hundred fifty something, where the something could be .350, .351, .352, etc. It’s the first two numbers.
A little analysis helped us to remove the outliers without having to create a compound conditional statement. You won’t always have this luxury, but when you find advantages due to analysis, take them!
This situation illustrates that knowing something about the data is important for data scientists. For instance, suppose you didn’t know much about the game of baseball. How would you know what makes a valid range for batting average?
Baseball fans know that anything over say, .270 is pretty good. Some of the top superstars can reach into the .300s and the mega stars will approach .400. Anything above that and people start to get suspicious (like players juicing, etc.)
A batting average of .250 is okay, but not wonderful. And, anything lower than, say, .220 probably is going to land the player on the bench, followed by the minor leagues.
You should run a summary() command for your data, which can tell you the valid ranges just discussed. The summary for our batting2018 data suggests pretty much what I mentioned before. The mean and the median are around 25 (.250). The 1st quartile is 23 (.230) and the 3rd quartile is 27 (.270). But someone who isn’t familiar with baseball, may not even look at the valid ranges of batting average.
In data science, this is known as subject matter expertise.
NOTE: You could apply other transformation and filters. I chose to stop at this point. I’ve done as much filtering as we need for this tutorial.
Step 9 – selecting the players at random.
We need 18 players from the batting2018 table. These players will be randomly selected with no replacement.
R has a command that can generate these numbers, and it has a switch that can let us tell it not to use the numbers again. The function is called sample(). The sample() function generates integers in a range. We'll use this to generate the index of the players in our batting2018. The range will be the total number of players defined in our subset (more below).
Before we get into how to use sample() to select our players at random, we need to determine how many players we have to choose from. To find that out, we can use the dim() function.
dim(batting2018)
Your numbers may be different if you are using a different year or you applied another filter. The number of rows for this example is 322.
Now, we’ll create 18 buckets (integers) using 1 through 322. This will serve as an index to the batting2018 table.
playerPicks <- sample(1:322, 18, replace = FALSE)
You could also accomplish this using nrow() which returns the total number of rows.
playerPicks <- sample(1:nrow(batting2018), 18, replace=FALSE)
This will return 18 unique integers that represent the index of the players selected.
Now, let’s separate them into home and away teams.
homeTeam <- batting2018[playerPicks[1:9], ]
awayTeam <- batting2018[playerPicks[10:18], ]
Step 10 – Get the player names
As mentioned, the Batting table (and by extension the batting2018 table) identifies players using an ID (playerID). It would be incredibly frustrating to try to figure out the players from their playerID. When I showed you playerID ruthba01, I am going to assume you figured out it is Babe Ruth. But, what about ones that don’t seem as obvious?
You can certainly play the game with these playerID fields. But you'll want to see how the real players do in Deadball and that requires knowing who they are. What’s the point of having to look up the player names when you can have the computer do that for you, right?
There are several ways to approach this. You could have joined the data in the beginning. However, that would require knowledge of joining in R. It’s not overly difficult, but does require a bit of explanation. You could just take it on a leap of faith that it works and just use what I give you for code. In this case, though, I think keeping it as a separate function makes sense. It’s more intuitive from a case study perspective, which this tutorial can be seen as a mini case study of sorts.
The People table is where the names of all people in Lahman exist. This include managers and players. To keep things consistent, a manager would have a playerID, even though that refers to the manager. This makes referencing that manager as just another entity in the People table.
The main fields we need from the People table are the nameFirst and nameLast. Feel free to extract whatever other fields you want.
Since the data exists in another table (People), this step is slightly trickier than most of the other steps, but it’s not too bad. The technique I chose is easier than trying to explain joining tables. The technique is also easier than creating a function with a loop to return all the rows we need.
We’ll uses the match() function that will match all the batting2018 playerID keys with the playerIDs for the People table. We'll store that in an index of matches. Then, we’ll create a new field in batting2018 called playerName that is the concatenation of the nameFirst and the nameLast fields in People. The paste() function can be used to concatenate the two fields.
homeIndex <- match(homeTeam$playerID, People$playerID)
awayIndex <- match(awayTeam$playerID, People$playerID)
homeTeam$playerName <- paste(People[homeIndex, "nameFirst"], People[homeIndex, "nameLast"])
awayTeam$playerName <- paste(People[awayIndex, "nameFirst"], People[awayIndex, "nameLast"])
You now have the playerName as part of your home and away teams tables.
As an aside, if you use the paste0() function, it will take out the space between the nameFirst and nameLast columns for you. In this case, we’ll want the space, which is why I used the paste() function.
You could choose to add the playerName in the very beginning. This way, it would be available for both the homeTeam and the awayTeam when they got created. I'll leave the for you as an exercise.
Step 11 – Print out the fields to use in the game.
If you type the following:
homeTeam
awayTeam
Both of these will print all the columns from the tables. At the most basic level of gameplay in Deadball, you’ll need the player name and the batter’s average (BA). You can accomplish this with the following:
homeTeam[ , c(“playerName”, “BA”)]
awayTeam[ , c(“playerName”, “BA”)]
Results should be different each time you run full code.
When you need other fields, you can simply add them as another item in the vector. For instance, if you wanted the yearID to be added, replace the
c(“playerName”, “BA”)
with:
c(“playerName”, “BA”, “yearID”)
and so on.
Next Steps…
You’ll notice that I didn’t code out the pitcher2018 information. This is by design. You have enough background based on this tutorial to tackle it, now.
I’ll give you the approach at a high level, which is quite similar to the batting2018. You only need the ERA for the Deadball game. However, pitching is complicated and you could come up with intricate filters, if you so choose.
To keep it simple, filter by the number of games (G) which I set a cutoff amount to 50. ERA is already in the pitcher2018 table, so no calculated transformation (like BA in batting2018) is needed.
Start with just the number of games (G) and you can add in other measurements later. You don't want to be spending days on this. Keep it simple is the best advice.
Use the sample() function to choose two pitchers. The format would be:
pitcherPicks <- sample(1:nrow(pitching2018), 2, replace = FALSE)
Use the match function to add the playerName to pitcherPicks. You’ll need to return the index first. You can combine it all into one command if you like, but it’s best to keep things simple.
Print out the playerName and the ERA.
When you decide to add more components, like fielding positions and whether the batter is lefty or righty, you can add that easily. The positions for players are located in the Fielding table. We already transformed it for the year 2018, i.e., pitching2018. Batting lefty or righty is located in the People table.
Deadball is a diced-based game. You now have the ability to create rolls of the dice based on the instructions in the book. You can use the sample() command to get the right type of rolls for the situations in the book. Experiment with them to see what types of results they produce.
Conclusion
Hopefully, this tutorial/case study can help you level up with your R coding. If nothing else, it will save you time when creating teams for playing Deadball. The instructional aspects of the tutorial should give you the skills you need to answer questions about the Lahman database, too.
Resources
Free R Coding Courses
https://www.udemy.com/courses/search/?price=price-free&q=r%20programming&src=ukw
https://www.listendata.com/2014/06/getting-started-with-r.html