Remember that statistics class that you thought was useless back in college? Bet you didn’t think you’d ever use it once you finished, right? You had to take it as part of the curriculum, but once it was done, you were done with statistics forever, right? It turns out, some of the techniques they taught in that class are quite useful and can be used to generate simulated sales data.
Why Simulate Sales Data?
You may be thinking, why would anyone want to simulate sales data? If you have built up sales data for your business in the past several years, you would use that instead of mock data, and you should.
But what if you were tasked with evaluating a new analytics package and you haven’t accumulated much in the way of data? The more data you have with these tools, the better your evaluation will be for the software. By simulating realistic data, you can put the tool to the test.
NOTE: the link referred to in the video brings you to this article that you are currently reading. Here is the link, anyway, just in case it seems confusing:
Another reason to generate sales data is to prove out a concept for your company. You could generate what-if scenarios using data to support your hypothesis. Realistic data will go a long way in that support.
If you are a data consultant, you could generate simulated data to show potential clients how their data could be used for analysis. The more realistic this data, the higher the chances that they will accept your proposal.
I am confident you could find other use cases for simulated sales data. It’s likely you’ll need to generate mock data at some point.
Related: How to Find Your Use Cases
Before we continue, I want to point out the methods used in this article were all written in the R language. As you’ll read later, R handles distributions in a much easier way than Excel does. If you don’t know R, my suggestion would be to check out training on the basics of the language.
You can downloadthe code for this project: Download the Code on My GitHub Page
Can We Talk Normal?
In a previous post, I wrote about how to generate mock sales data. In the post, I described a method that can generate simulated sales data. The method I described worked, but it wasn’t efficient. Statistics gurus would probably look at the method and ask why I didn’t just use normal distributions for my random number generation? They’d be right, too. In many cases, the normal distribution is a much better way to handle this capability.
Why a Normal Distribution Works Better
The technique I described in that last article had you guessing as to the probabilities of each of the components. For instance, to generate the number of products per line item, I set the probabilities at 70% for a one product purchase, 17% for quantity two of a product, 10% for quantity three, and 3% for quantity four. Are these quantities correct? Who knows? Although these probabilities can be adjusted, there is no provision for purchasing quantities greater than four of a product.
With a normal distribution, there is no need to worry about probabilities as they are implied by the mean/standard deviation. For instance, for the product quantities, you can simply use quantity of 1 as the mean. You can experiment with the standard deviation to disperse the numbers generated beyond the mean. The smaller the standard deviation, the more numbers will be generated towards the mean.
An alternative to using the normal distribution for something like product quantity is to use a Poisson distribution. This is usually a better fit for small numbers like quantity purchased (on a daily basis). The random generator is easy enough to swap. Just use rpois instead of rnorm.
Using a normal distribution for your random number generation can help answer the problem of purchasing product quantities larger than 4. But how does it help in generating the most popular products?
To answer this question, start by assigning sequential numbers to your products. This will serve as an index to the products table. You are welcome to assign an alternative field that is not numeric (for instance, SKU numbers). However, it is usually a good idea to index your data using a sequential number. Companies often change how they designate their keys for tables.
When you designate an arbitrary index such as sequence numbers, you can make all the changes you want to other fields that serve as secondary keys. The primary keys won’t change. If you depend on indicators (such as SKUs) as your primary keys, you would need to create a mapping table if changes to the schema were required. Trust me when I tell you this is a maintenance nightmare!
Assigning a numeric sequence to your data also has the advantage that the mean for your normal distribution can represent the numeric sequence of your top-selling product. For instance, suppose your company has five products, Widget-V, Widget-W, Widget-X, Widget-Y, and Widget-Z. Further, suppose Widget-W is the best seller, followed by Widget-Z, then followed by Widget-Y. While you may be tempted to assign the sequence number in order (Widget-V = 1, Widget-W = 2, etc.) you are better off assigning numbers in clusters for the top sellers. Here is what I mean by this:
You have five numeric numbers to assign and they should be sequential. Here is how you may choose to assign the numbers (followed by the reasons why for this schema):
Widget-V = 1
Widget-Z = 2 (second top seller)
Widget-W = 3 (top seller)
Widget-Y = 4 (third top seller)
Widget-X = 5
You would then assign the mean to be the sequence of the top-selling product (in our case Widget-W = 3). When you generate the normal distribution, it will naturally select Widget-W the most number of times (most of the time), due to the nature of normal distributions. The second top products, Widget-Z and Widget-Y, would appear more frequently than Widget-V and Widget X.
The only drawback to this approach is when Widget-Z and Widget-Y have large disparities in the number sold or the price. However, if there are large disparities, you may have to use a different method of generating random numbers than the normal distribution. In most cases, for the purposes of generating mock sales data, the normal distribution should serve you well. Be aware of this bit of a twist, though.
Once you have the mean, experiment with the standard deviation. If you want the numbers to stay close to the mean, run the normal random generator with lower standard deviations. If you want them more spread out, simply increase the number.
Just like the last article on mock number generation, I tried to make this work in Excel. You could use the NORM.INV() function, but you’d have more work to do than running this command one time. In R, you can run the rnorm() function and get it to do much of the work you need.
With the base R functionality, you would have to adjust rnorm() when the items go beyond the bounds of your sequence numbers. For instance, for our product number generation above, the sequence is from 1-5. However, the normal distribution with a larger standard deviation may (and probably will) generate numbers less than one and greater than 5. If you were to simply truncate the numbers, then you won’t have enough numbers for your sales data.
For instance, if you were to generate 1000 products sold to be added into your orders (or sales) table, by truncating the outliers, you’ll have less than 1000 numbers. Your goal of having 1000 numbers to add to your orders table will fall short.
One solution that I tried was to take the total number of outliers and increase the frequency of the mean by that number. Then, when I truncated the lower and upper outliers, the number of records generated matches the number of orders desired.