The growing popularity of data science requires that you learn a programming language if you don't already know one. But, which do you choose? There are several to choose from.
The snarky answer is you should know as many as you need to get the job your want. That's nice. However, anyone who has learned a programming language before will tell you that this isn't the best advice. That is, unless you don't want to learn a language well!
The good news is the industry seems to be solidifying the choices on a few select languages. If you had to learn multiple languages, at least it won't be more than two or three. The biggest two contenders are currently R and Python.
If you are currently a programmer (or were in the past), R should not present too much trouble for you in learning the language. In fact, it has several intuitive features that you may just fall in love with. The first is the element-wise operations that you can perform on its data structures.
For instance, suppose you defined a vector, which is like a one dimensional array. The vector (v1) contains the numbers 2,4,6. Now suppose you want to add another vector (v2) to this one. In other languages, you would have to loop through both vectors, and figure out which elements to add together. You would also have to store that vector somewhere. Otherwise, when the loop finishes, your references will be gone.
In R, all you need to do is add them together, i.e., v1 + v2 . It will take the first element in v1 and add it to the first element in v2. Then, it will go right on down the line and add the second element of v1 to the second element of v2, etc. One operation, no looping, and no storing of the data. It will print to the command line. Of course, if you needed the result of this operation later, you can store it in another vector. Hence, v3 = v1 + v2. However, this is still rather simple, wouldn't you say?
Subsetting is another great feature of R. In other languages, when you subset an index, you do it by the index number. Some languages are sophisticated enough to allow indexing by a column name.
R will allow you to subset a vector with conditions. For instance, if you were working with a baseball database and you wanted to see all batters with home runs > 10 for a particular season, you can do the following:
Batting[Batting$HR > 10, ]
The above will return all players who had more than 10 home runs for each year the players played. Most other languages would require you to loop through the Batting table and check each item individually. This is for every year for the number of years available. Many baseball database go back to the late 1800's.
As an aside, R does have looping constructs defined in the language. The above features make looping unnecessary for many situations.
I admit to not being up-to-speed with Python. I am in the process of learning it. The strengths are it is very easy to learn and has great libraries for data processing. From what I have read, it is a great general purpose language.
Both languages are growing and seem to be in competition with one another. This is great for us the users. None of this answer the question of which to learn.
If you are seeking a job for a Python data scientist but you are strong in R, you may not the job and vice versa. Therefore, knowing both is in your best interests. As already mentioned, this is easier said than done. It is difficult to simultaneously come up to speed on two languages. You will be forever mixing up the constructs of both of the languages if you choose to learn in this manner.
The choice of language also depends on the discipline within the data science industry. If you are migrating towards a full data science specialty, you will need to get up to speed with statistics, machine learning, and AI concepts. Since R was developed as a statistical language, this is probably a good choice for this specialty.
On the other hand, if you are gearing up for a data engineer role, you should focus more on the programming languages and technology concepts associated with data science.
Probably the best advice I have seen is if you have already gone down the rabbit hole in whatever language you are working with, stick with that.
There are other languages that companies are using for their data science and machine learning needs. If you know SQL, you can still find a fair number of job listings. Also, both R and Python are interpreted languages, which means they are slow. If a company requires fast processing, they will likely choose someone with a C/C++ background. If you have never worked with these languages, they are not for the timid. These are hardcore languages.
There are languages that are geared towards statistics. They include SAS, SPSS, and MatLab. The biggest problem with these is licensing for these packages are expensive. They are not languages that someone can learn from the comforts of their home. They usually obtain the skills for these in their jobs.
There you have it. At the time of this writing the two most popular languages for data science are Python and R. If you chose to learn one over the other you should be okay but knowing both can only benefit you. Depending on your discipline, you may also need to become proficient in math and statistics, and have a great understanding of concepts such as machine learning and AI, among others.
James is a data science writer who has several years' experience in writing and technology. He helps others who are trying to break into the technology field like data science. If this is something you've been trying to do, you've come to the right place. You'll find resources to help you accomplish this.