What if you tossed a coin ten times and it came up heads seven out of the ten? Would you conclude the coin is unfair? To answer that question, you would likely assume that the coin is fair and then run experiments to counter the fair argument. But why even bother with any of this? What does it have to do with programming in the R language, or most other languages, for that matter?
At some point, you'll be called upon to run simulations. A core part of data science is generating random numbers. These random numbers serve as the foundation of simulations. Tossing a coin serves as a good introduction to randomness, as it's one of the easiest experiments to implement.
The Coin Toss Experiment
When you toss a coin and it repeats the outcome of the previous toss, you suspect something may not be right. You are left with experimenting to determine if the coin is fair. But this requires many flips that most people would not endure.
Computers can do the heavy lifting for us, and the R language makes is quite easy to create the experiment. As you'll see, you can increase the number of tosses to explore what is happening.
Installing R or Running on the Web
Feel free to install a local instance of the R language on your computer. However, if you don't have the program installed and you want to run the examples in this article, you can use an online R interpreter. It contains all the libraries needed to work with the example code.
If you are planning on using R beyond this article, you may want to install a graphical user interface (GUI) that makes coding in R easier. It's not necessary (especially for this code), but it helps when coding. The most popular GUI for R is R Studio.
This article uses the online R program to run the examples. This makes it easier to follow along. The dynamics are the same for the local version and the online version, however.
Getting Familiar with the sample() Function
The sample function in R is versatile, yet simple. For instance, to generate a random number, you can use the following:
sample(1)
Calling this function will result in the number one each time it is run. The first argument can take either an integer or a vector. When passing an integer, the function will convert it into a sequence. For our example, it's the same as if we used:
sample(1:1)
When you pass an integer larger than 1, it will return that many trials (or samples). The set of numbers will be equal to 1:the number specified. For instance, if you pass in the number 5:
sample(1:5)
the results will be five distinct numbers from one to five in random order. Each time you run the same code, it's likely to produce a different order of the numbers, but still distinct from one to five.
Passing Vectors
It's possible to pass a set of items (not necessarily numbers) as vectors, which provides for an interesting scenario. Starting with a vector of numbers:
sample(c(1, 2, 3))
will result in three distinct numbers in random order. You can also run the function with a vector of strings:
sample(c("dogs", "cats"))
sample(c("blue", "green"))
sample(c("heads", "tails"))
I think you may see where this is going with that last example.
Other Sample Parameters
If you run the sample with the vector of "heads" and "tails", you'll see that it is close to what we can use for our experiment, but it is not complete. While it will randomly generate "heads" and "tails" each time it's run, it won't generate more than two trials. This is where the next parameter can help, which is the number of trials per run.
Try the following:
sample(c("heads", "tails"), 10)
This command will generate an error, because the replacement argument is set to FALSE be default. Once it generates the two samples from the vector, it has no more to work with. It has exhausted its pool of items. You'll solve this by setting the replace = TRUE, as follows:
sample(c("heads", "tails"), 10, replace = TRUE)
The Law of Large Numbers
In statistics, there is a concept known as the Law of Large Numbers. Essentially, this "law" suggests that as the number of trials increases, the probability of the trials will approach the expected probability. That is at the heart of our experiment.
What this means is if you increase the number of trials and the probability gets close to the expected probability, you can conclude with reasonable certainty that the coin is fair.
While a computer simulation may not represent a true coin, you can get an intuitive sense when an actual coin is not fair. For instance, if you flip a coin thirty times and the results are all heads, you should start to suspect that something is not right with the coin. The probability of this happening is quite small.
The final step in our experiment is to increase the number of trials and compare the probability to the expected probability (50%). To do this, we'll need one other command:
table(sample(c("heads","tails"), 10, replace=TRUE))
The table() function tabulates the frequencies of the random numbers. Run it several times to see how it changes.
To display the results as percentages, wrap the table() command in prop.table():
prop.table(table(sample(c("heads","tails"), 10, replace=TRUE))))
If you run this command several times, you'll it doesn't always generate a 50/50 split between heads and tails. You'll see a mix of 40/60, 60/40, 50/50, 30/70, 70/30, and maybe even a few 20/80 or 80/20 splits.
The next step is to increase the number of trials to 1000, instead of 10:
prop.table(table(sample(c("heads","tails"), 1000, replace=TRUE))))
As you run this command several times, you'll notice the precision change from one decimal point to three. The percentages are converging to the 50% expected probabilities. You'll see some variance because the sample size (1000) is still small. Next, try 10,000:
prop.table(table(sample(c("heads","tails"), 10000, replace=TRUE)))
Run this several times and you'll notice that the percentages are getting even closer to 50%. For the final step, run it with the number of trials at 1,000,000:
prop.table(table(sample(c("heads","tails"), 1000000, replace=TRUE)))
Once again, you'll see that the convergence to the expected probability of 50% is even greater.
You could try to increase the number of trials even more, but you won't see too much variance when you do. The online interpreters will likely timeout as the numbers are too big.
Conclusion
This tutorial touches on the basics of how to implement a coin toss experiment in R. The concepts will be similar in other languages. Find the instructions to generate random numbers and create frequencies on the results.