If you have been following this blog, you no doubt have come across the post I wrote about Don't Fear the Kaggle. In it, I described how Kaggle is a great website for helping people learn data science by applying the techniques to projects. Read the post if you haven't already.
Even when you do visit Kaggle after reading the post, you may still feel overwhelmed because there is a lot to the website. For this reason, you'll want to look for the easier challenges when starting out. Many people are directed to start with the Kaggle Titanic challenge as it is an easy challenge which is the reason why it's highly recommended. I concur with this recommendation.
Disclaimer: Site owner may receive compensation for purchases made through the links and banners on this website.
But, even with this, you'll likely wonder how to begin. I know I did. I found a tutorial that was somewhat helpful. However, the instructor didn't give much explanation as to why he took the actions he took. Granted, I was able to submit based on his recommendation and it scored high in the competition. But, I wasn't able to replicate the results because I was just following instructions.
Data science is about more than simply following instructions. You need to have a goal in mind which is usually derived from a question or questions. The goal of the Titanic exercise is to determine who survives.
I Found a Better Tutorial
For a while, I put the Titanic exercise on the back burner as I was busy with work and learning about data science. I wasn't finding much information on tutorials that explained why. Then, I just happen to notice something offered from r-bloggers.com. that caught my attention. It was a tutorial on how to submit the Titanic challenge to Kaggle.
I went through the tutorial and it did give me more of what I was looking for. It still didn't explain the motivations of this instructor's actions as much as I would have like. But, it was still better than the previous tutorial. At least with this tutorial, you'll be given step-by-step instructions.
I Recommend the Kaggle Titanic Challenge as is Given in r-bloggers.com. This challenge will help you understand the Kaggle process, but will also give you a glimpse of solving problems using data science techniques.
The idea behind the challenge is to train a machine learning algorithm to determine who will live and die based on the features given. The features include sex, age, traveling with families, how many children, ticket class, and fare.
The tutorial takes you through iterations of the process.
How to Proceed with the Tutorial
While the tutorial is simple enough, I wanted to give you some pointers that helped me when I went through it. Sign up for Kaggle. Trust me when I tell you not to fear the platform. They don't bite. Then, whenever the tutorial suggests you upload your results to Kaggle, please do so. The whole point is to get you to use your skills on Kaggle.
One other note. You may at first be confused as to why you would submit survived for all women and not survived for all men. The point is it's just a starting prediction. You can technically do anything you want. But, since there is an overwhelming number of women who survived and an overwhelming number of men who didn't, it's a good a guess as any to start with all women survived and all men met their demise.
After you get your score from Kaggle and see that it needs improvement, then the model will be refined with further attributes. If you tried to submit a bunch of attributes from the start, how would you know which ones affected the model and which would help with the score?
What You Will Learn
Hopefully, if you go through the entire tutorial and do what is asked, you'll have learned how to submit predictions to the Titanic challenge. You will receive a score from doing so and a ranking. You will also learn a bit about decision trees and random forest trees. These both will serve to improve you score on Kaggle.
I would suggest going through the tutorial a few times. The first time you should simply familiarize yourself with the process. Don't try to figure out too much of what is being taught on this first iteration. Instead, use it to understand the process itself. Then, when you have a good grasp of the process, you can analyze the steps that are shown in the tutorial.
Overall Value of Tutorial
If you follow my advice on how to approach this tutorial, you will find that you have notched up your data science learning by a few points.
I realize that after the tutorial you'll have what you need to solve this problem yourself. Therefore, perhaps the instructor felt it would be a good exercise to solve it. However, since all the work you have done up to that point was on the Kaggle interface. This means you would need to recreate the project on your local instance to get to the point where you could submit the last results to Kaggle.
Get your start on this great tutorial that shows how to solve the Titanic challenge.