Joins are a powerful tool for data analysts to use. They permit analysts to bring together (hence the name join) two or more tables, having a common set of fields. The keys to the joins can be as simple as one field or a composite field, consisting of many fields. Joins can get complicated, though, which is why I decided to publish an alternative using Match to avoid joins in R.
For those who know Microsoft Excel spreadsheets well, this is the loose equivalent of doing a VLookup.
Joins have their benefits. For complicated queries, there isn't much that can match the speed of a well-optimized join. But, for many situations, joins can be overkill, and require lots of debugging juice spewing out of your brain. Let's apply the Occam's Razor Principle here, shall we?
I don't want to bog this article down with too much theory about anything. With that said, let's get started.
The tutorial assumes basic knowledge of R programming, specifically, familiarity with the indexing aspects of the language. Why R when Python is the leader for data science? First, R is still pretty popular.
Second, the Lahman baseball database is a great one to use to demonstrate this technique. Many people love baseball and the database has related tables, so why not?
Finally, I know R better than Python. I do know Python, too. But, R is my first DS language.
As mentioned, we'll use the Lahman database. The database has several tables, but the ones we'll use for this example are Batting and People. The Batting contains the batting stats for professional players all the way back to the year 1871.
The People database replaced the Master database. It contains the information about each player and manager, including their birthdates and date of death (blank if still alive).
Most tables have the common link (known as a key) called the playerID. There may be other joining factors such as yearID, teamID, and lgID. For this tutorial, we'll only need the common key of the playerID.
The People table contains the players full name. The goal is to bring in the full name from the People table and add it to the Batting table.
The functions in R we'll use for this are match() and paste().
Step 1 - Match the playerID in Batting with People and save to an index.
playerIndex <- match(Batting$playerID, People$playerID)
Step 2 - compose the full name of the player and add to the Batting table.
Batting$playerName <- paste(People[playerIndex, "nameFirst"], People[playerIndex, "nameLast")
Step 3 - View() the table to see the result and make sure it worked.
Feel free to skip this section if the above makes sense to you. It's intuitive enough to get without explanation. For those who want a bit more insight into how it works, read on.
If you are new to R, you may find the code a bit offsetting. How can all the records of Batting be matched up with People with one command? The reason is that R uses a concept known as elementwise processing. Essentially, this allows R to do much with little coding.
Programming languages that don't support elementwise processing would require using loops and arrays (or other container structure).
When we applied the match() function, it's as if it created a loop through all of the Batting records, then looked up the player in the People table, then stored the associated index in an array of only the ones that matched between the Batting and People tables. When it finished processing all the records in the loop, it would return the array of matched elements. R does this all with the one match() statement.
R indexing will process only the records that it finds in the index that we created (playerIndex). Suppose you had some routine that returned the first, third, and seventh items in a ten item list. The index would contain (1, 3, 7). Suppose we call this index ind. It would be the same as writing:
ind = c(1, 3, 7)
When we use ind in any structure, for instance, a vector of names,
namevect <- c('Jim', 'Tom', 'Nancy', 'Christina', 'Dan', 'Bill', 'Mary', 'Ariel', 'Mike', 'Hope')
If we used namevect[ind], this will return Jim, Nancy, Mary.
For our baseball example, the matches will likely be for all records. However, had we filtered the Batting data to a subset for the current year, the match() would return a subset. The People table would have the complete number of records.
Why is this important? The order of the match() table matters. The first column of match() drives the second column. You'll get a different result if you used People first with a filtered Batting table.
Should You Learn About Joins?
As you increase your coding chops, having a powerhouse like joins will come in handy. It takes a bit to get used to the different joins that are available. But it's also not impossible to learn. Using is learning and the more you use the easier they become.
The match() function probably won't cut it for more advanced use. For instance, I could not find an easy way to search on multiple fields. If you wanted to grab the team of a player, for instance, you'd need to use the yearID and the playerID.
There is a further complication that a player can be on more than one team in a given year.
The base R didn't have a great provision for joining. Then, along came the DPLYR package from Hadley Wickham. This library makes joining a breeze. For a great (and free) tutorial on how to use DPLYR including joins, you can take the following course:
Even better, it shows how to manipulate baseball data. I have taken the course and it is a great primer on DPLYR and SQL.
NOTE: the course is a good primer but don't expect to learn intricate joins with the tutorial. The course does what it advertised. I will gather resources to help you learn more about joins in a future update on this website.