Why should you use baseball data to learn statistics? Because it's a fun way to grasp the concepts. Having said that, you should realize that the data available for baseball analysis may not be rich enough to tackle every concept in statistics or by extension, data science. But, you'd be surprised at how much you will learn from the data that is available. The key is to have fun while learning - the best way to learn.
Of course, if you don't like baseball, then you probably won't get as much out of learning statistics. But, it still is worth going through the process because it's a relatively easy way to get your start. If you truly dislike the game, then feel free to skip this article.
Data Tells a Story
You may listen to your favorite baseball commentator who will describe certain aspects about a team or players. While they usually provide a fair amount of information, no one can present every aspect of the game in a concise and timely manner. What this means for you is there are plenty of opportunities to learn more about the story than is presented, and that's cool in my book. You'll get to learn what these commentators don't tell you for one reason or another.
You'll have plenty of data to work with, too. Did you know that several data providers give access to baseball data from as far back as 1871? That's incredible. That's almost 150 years at the time of this writing.
Asking the Right Questions
Statistics is about posing questions and finding information that may help answer those questions. Often, statistics will lead to more questions, but that is part of the fun of the so-called paper chase. Most of us won't chase paper anymore, but the process of finding data is part of the fun. You may also find yourself doing some actual paper chasing when information is not available online.
Which questions should you ask? That is mostly up to you. Unless you work for someone who requires certain answers from your data, you have a blank canvas available as to what types of questions you should ask.
Want a great way to learn about baseball, its teams and players? Why not pick up your copy of Out of the Park Baseball? It is a baseball simulator that has predicted correctly the winners of the World Series of a the past two seasons. You can have that same power within your grasp. Purchasing from this link results in site owner receiving commission. However, the price is the same as if you purchased it direct. The small commission helps fund this website!
Baseball Data Is Accessible
Data is readily available. Here, though, you need to be careful. If you think you can use the data that you download however you like, you're in for a nasty surprise. You may have lawyers pounding on your door demanding that you pay large sums of money for misusing the data that you download.
There is free data that you can access. But, you have to be careful how you use the data. This means reading the terms of service carefully.
If you think I am being overly dramatic about this issue, I am not. It's better to err on the side of caution. When in doubt, ask the data provider what you are allowed to do with the data. Call them if necessary. But, make sure you get it in writing. Also, if you need extensive use of data for some reason, speak to an attorney.
If you are only using the data to learn about statistics or to gain insight into your favorite player or team, you don't have much to worry about. It's people who decide to publish information about the data. Worse, if you try to pass off the data as your own, you'll hit a legal snare.
Going forward, I am going to assume you are using the data within the confines of the terms of service and that you have no nefarious motives for its use!
How to Become a Baseball Stat Head
If you are ready to get your geek on and become a stat head, you've come to the right place. I must warn you, though, once you get started learning about baseball statistics, you'll see the game in a whole new light. While most believe that is good, you may become a bit obsessed. It happens!
Baseball has a lot of data. It is overwhelming for most of us in the beginning. Therefore, I would suggest starting out with the basics. The following resource gives a good (quick) summary of statistics used in baseball.
https://www.pbs.org/kenburns/baseball/beginners/stats.html
These are the bare bones basics of statistics and modern day baseball stats have more advanced calculations. But, you have to start somewhere. This is a good place to start as it shows the basics.
https://www.wikihow.com/Read-Baseball-Statistics
When aspect you will notice when you search for baseball stats is most providers use the same names for the fields included. That is wonderful since you only need to learn this once.
The Wikihow resource does an okay job of describing the stats. I found the last section a bit cryptic and they don't seem to describe the WHIP statistic well. A simpler definition is giving in Wikipedia:
https://en.wikipedia.org/wiki/Walks_plus_hits_per_inning_pitched
Here is another resource (from Wikipedia) that gives definitions:
https://en.wikipedia.org/wiki/Baseball_statistics
Feel free to do an online search for other definitions. These will give you a good start.
Major League Baseball
Major League Baseball's website gives you access to a bunch of stats. It will fluster you when you first start out. To keep it manageable, start out with statistics that make sense for you. Revisit the website as the season progresses and use your baseline of stats. For instance, suppose you only want to learn more about homeruns for a team or player. That is what you will concentrate your efforts on in the beginning. As you learn, you can start to add more stats into your toolbox.
Stepping It Up with Tools
At a minimum, you should learn how to use Excel (or an equivalent like Google sheets). Most data that you can download will require a program to load that data. The formats of the data will vary, but many use a Comma-Separated Values (CSV) format. As the name implies, each row in the file contains a record separated by commas. If you have ever seen such a file in a text editor, it is not easy to decipher. However, Excel and other spreadsheet programs read CSV files naturally.
Enter your text here...
I don't know about you, but the CSV File in Excel looks much easier to work with than the text editor. Plus, when you want to add calculations (like batting average which often isn't included in data), you can easily do this in a program like Excel.
It's beyond the scope of this article to describe the various formats available for data and how to load them. There are plenty of tutorials on how to do this. I will explore those in future columns on this website.
Other Data Sources
Baseball data is available at varying levels of granularity starting with season data. For instance, Sean Lahman maintains data at this high level on his website. It shows player and team stats for a given season (by player, by league, by team, by year). This data is refreshed yearly, usually a month before the start of the new season and reflects the previous season.
http://www.seanlahman.com/baseball-archive/statistics
I would suggest starting with this data set since it's at high level. You can get data at the game level and even the play level. But, these are appropriately more detailed which makes them more complicated. Get used to the high level from the Lahman database first, then start to venture at the granular levels later.
Conclusion
This article should give you a good primer in baseball statistics. It won't make you an expert but then again, it will help you learn what is available and serve as a jumping off point in your analysis. I have a tutorial that can help you go further with your analysis and it is my intent to add several more. You can view it here:
https://datasciencereview.com/using-dplyr-in-r-to-subset-baseball-data
Enter your text here...