How to Pick Teams for Deadball in R

When you learn how to pick teams for Deadball in R, you'll have access to tools to learn about baseball using the language R. This article will use the Lahman database (package for R) to select teams randomly.

What the Heck Id Deadball?

Deadball is a baseball simulation game. The rules are described in a book of the same name created by W.M. Akers. He created the game to fill in for the dead time after the baseball season over. Yes, you have to be a diehard baseball addict to understand this!

Deadball also refers to the period in baseball before Babe Ruth and other power hitters. This period is when people didn’t hit that many homeruns. They relied on the strategy.

For the most part, we’ll stick with the first definition as it’s more appropriate.

You can see how the game is played from the following video. It shows the rules for basic play.

The game gives you a choice to play with any players you want, made up or real players. The book shows you how to generate fake players (there are online tools for this).

The real players will likely add a sense of authenticity based on the way the rules are defined. The players stats are part of the play. That’s one reason why I am writing this tutorial. To give you a way to generate real players programmatically. It will save you time.

You can certainly choose real players manually. This would require you determine some way to choose these players. Then, you’d have to look up their stats.

NOTE: If you don't care about the coding and just want to generate the teams, you can use the following link:
https://datasciencereview.com/run-r-online/

It contains the code to generate teams for 2018. The Lahman database was not updated at the time of this writing for 2019. 

The Motivation Behind this Tutorial

This tutorial shows you how to program a computer using the R language to pick the teams for you.

I could just give you the code and be done with it. But since this website is about data science, I felt that going through how I created the code, as well as any challenges that emerged, would help budding data scientist learn the issues that can arise. Besides, it’s not yet baseball season, so you can take advantage of playing.

This tutorial assumes preliminary knowledge of R programming. If you have solid experience, you can still use this to pick your teams. You would be able to come up with the code on your own. But now you can simply take the code and use it or tweak it to your own purposes.

For an excellent (and free) tutorial on how to use R with baseball data, see the following:

https://www.udemy.com/course/baseball1/
(It was free at the time of this writing)

For those who just want the code without all the explanation, here you go:


# Step 1 - Install the Lahman dataset - comment to run again as it will be already loaded.
#install.packages("Lahman")

# Step 2 - load the library
library(Lahman)

year = 2018


# Step 3 - ensure that the year is at least 2018
# max(Batting$yearID)

# Step 4 - View the Batting table
#View(Batting)

# Step 5 - create the filter for year (Batting and Pitching)
battingSubset <- Batting[Batting$yearID == year, ]

pitchingSubset <- Pitching[Pitching$yearID == year, ]

# Step 5a - View the new battingSubset table
#View(battingSubset) # feel free to do this with pitchingSubset too

# Step 6 - Add the batting average (BA) column
battingSubset$BA = round(battingSubset$H / battingSubset$AB * 100, 0)

# step 7 - Filter out the NaN (see blog text)
battingSubset <- battingSubset[battingSubset$AB > 0, ]

# Step 8 - filter out players how have not played enough games (see blog text)
battingSubset <- battingSubset[battingSubset$G > 70, ]

# Step 9 - select the players at random and split into two teams
playerPicks <- sample(1:nrow(battingSubset), 18, replace=FALSE)

homeTeam <- battingSubset[playerPicks[1:9], ]

awayTeam <- battingSubset[playerPicks[10:18], ]

# Step 10 - add player name to home and away teams
homeIndex <- match(homeTeam$playerID, People$playerID)

awayIndex <- match(awayTeam$playerID, People$playerID)

homeTeam$playerName <- paste(People[homeIndex, "nameFirst"], People[homeIndex, "nameLast"])

awayTeam$playerName <- paste(People[awayIndex, "nameFirst"], People[awayIndex, "nameLast"])

homeTeam[, c("playerName", "BA")]

awayTeam[, c("playerName", "BA")]

Cleaning Up Your Data

The tutorial can help you understand the challenges associated with data. Most data sources are not given in the format that can help with your analysis. The data either contains errors, or there are too many exceptions that would require you to account for them, or both.

Some people choose to do a minimal of cleaning for their data. That’s okay, but it usually adds to your challenges when analyzing your data or using it. In most cases, it pays to spend some time wrangling your data. It makes the end result that much easier and future uses of the data easier, too.

Why I Chose the R Language

The main reason I chose the R language for this task is because an R library exists for baseball. Sports reporter Sean Lahman updates his website with the latest baseball stats every year. The 2019 season was not available at the time of this writing, but once you have the functions set up (from this tutorial) you can apply them to the 2019 data when it’s ready.

The Lahman database sometimes changes structures. If this is the case, you’ll need to adjust your code.

How This Works with Deadball

The basic play for Deadball has two requirements for data. The first is that you have batting averages for players, and the second is that you have ERA for pitchers. To keep the game simple there is no provision for fielding. Therefore, you won’t need to consider a player’s fielding stats.

I’ll show you how to use R to compose nine players for each team. You can choose players from individual teams in a particular year, or you can simply choose 9 players randomly. This can be by year or throughout most of baseball’s history. Your choice!

In this tutorial, I will show you how to choose random players for the 2018 season. I have 2019 data from another source, but decided against using it as it is not included as a library for R. It would take too much explanation on how to load up this data. When you learn how to pick teams for 2018, you can use the functions for 2019 and future (or even past) years.

For those readers who know about GitHub, you’ll find the data there. You will have to load each table, though.

Steps for Choosing Teams

The overall concept is to filter and clean baseball data so that you can randomly select 18 players. The first nine are the home players and the second are the away. You can switch them if you’d like. Then, you replace the ninth player and the eighteenth with randomly selected pitchers and obtain their pitching and batting stats.

The basic game does not consider a designated hitter. Although, you could combine the DH and pitching stats for the purposes of the game. Or, you can simply choose pitchers from the National League.

The choice used in this tutorial is to select nine batters per team. Then, I’ll have a separate object for the pitcher. The pitcher object will exist solely for the calculations that deal with ERA. You can choose to add the pitchers as the ninth and eighteenth players if you choose, assuming they have batting stats.

I use RStudio for this tutorial. If you don’t, that’s okay. There aren’t many (if any) aspects of the solution that require RStudio functionality. But you will need to know how to add packages in the environment you’re using.

Download the full code here (right click and save as).

Let’s Get Started!

Step 1 – Install the Lahman library

install.packages(“Lahman”)

Step 2 – Load the library

library(Lahman)

Step 3 – Check to see that 2018 exists

max(Batting$yearID)

Steps One Through Three

The result will be the most recent year. If it shows 2019, that’s even better. But it should be at least 2018. If it’s less than 2018, update your Lahman database. You’re welcome to use earlier years, too.

Step 4 – Use the View() command to see the Batting table

View(Batting)

Step 4 View Batting

If you haven’t seen the Lahman tables before, the data may look a bit cryptic. You can find a listing of each one at the following resource:

https://rdrr.io/cran/Lahman/man/Batting.html

For this tutorial, we’ll deal with the fields we need. You’ll need to know about the playerID, yearID, G, AB, H.

The playerID and yearID should be obvious, although the playerID may seem a bit cryptic. For most players, it is the last name plus part of the first and a number to distinguish between two players with the same name.

For instance, there are several players in baseball’s history with the name Bill Smith (some are Billy Smith). Each of these players has the playerID starting with “smithbi”. The number system is used to separate these players. smithbi01, smithbi02, etc.

For players that don’t share their names, there is still a 01 appended to the playerID. Let’s see if you can guess the following player based on these rules:

ruthba01

Hint: it’s one of baseball’s all-time great players!

Babe Ruth

Source: Wikicommons

<