How to Generate Mock Sales Data

Search for mock data or mockup data, and you’ll have plenty of options for items like customers or regions. Mockaroo is one popular choice among many for generating such data and will generate up to 1000 records for free. When you try try to find sales data that is worth using, that is another story altogether. The choices are much more limited. This article can show you how to generate meaningful mock sales data.

Challenge

Generating mock sales data is challenging because there are many components to it. The data itself is quite simple and often includes the transaction date, the customer number, the products and quantities of those products. Perhaps it contains the store number to, but to keep it simple, we’ll use only the core components.

The three main tables of most sales schemas are the sales (sometimes called orders), customers, and products. 

Other tables, like stores, regions, and returns, advertising and marketing, can also be integrated, but these are handled in a similar fashion to how products and customers work.

For this tutorial, I will concentrate on just the sales generation, as it’s the most involved procedure to generate records. For products and customers, you can easily find generators online or data that you can download.

Be careful on how you use any data that you find and check to make sure you have rights to use it. If you have a list of common first names and you create an Excel macro to combine them with common last names (Smith, Jones, etc.) then you should be fine. However, this article should not be misconstrued as offering any legal advice. Please seek the advice of a legal professional for this.

The proper format for sales data is a set of pointers to other tables. It is a set of keys and dates. For instance, the sales data I will show you how to generate in this tutorial will contain the following:

Order Entry

Note: due to the tool I was using for my analysis after generating data, the headers in the above sample file are different than what will be generated from the algorithm. Feel free to name them whatever names work for you.

With the exception of the date field, the other fields are numeric pointers. The first column (Order ID, which is orderNum ) is generated as a sequence. The other numeric fields are keyed to other tables that contain descriptions of those keys.

Methodology for Generating Orders

I tried as hard as I could to come up with procedures to generate the orders table completely from Excel and with no macros. I tried to use random number generation followed by IF statements. But I had to do a lot of manually data entry as a result that is proved to be not useful as an order generation table.

I started implementing VBA code (code that results from recording macros). But even that proved too difficult. Instead, I created the routines in R language, which was much easier due to its extensive tools available in the language.

Generating the code in R had its own set of challenges, but they were manageable. For instance, R does not support pass-by-reference variable access. But you can get around this problem by using lists. That is the solution I adopted (more on this later).

My first intuition was to determine the number of orders ahead of time and then try to fill in each of the columns based on a set of rules. In trying to recreate how sales would be generated in a real setting, the following had to be considered:

  • Multiple customers can buy on any given day.
  • Any customer can buy one or many products. Customers who browse and don’t buy don’t need to be considered for sales data. That would be more like foot traffic data, which is difficult to model. For online stores, traffic is easier to model through analytics data. But this is not considered here (perhaps a future enhancement?)
  • To determine the quantity ordered or purchased, customers usually buy one quantity of most products. For this model, I choose to use probabilities for purchasing 1, 2, 3, or 4 quantities of a product. One (1) has the highest probability with four (4) having the smallest. I realize that customers could buy even more than 4, but I had to draw the line somewhere. Feel free to adjust this with more numbers or different probabilities.
  • Some customers are repeat buyers. There are a few ways to handle this scenario. One of the easiest is to have more orders with less available customers – those customers would be repeated by random selection. In other words, if there are 100 orders and 100 customers, the chances of repeats are much less than 100 orders and only five customers in the customer table.
  • When generating order numbers, a customer who buys multiple items should have the same order number for all the items purchased on a given transaction date. This scenario does not take into account time of day. Theoretically, a customer can return to the store on the same day and make multiple purchases. That’s rare enough and unimportant enough not to include in a model.

One feature that won’t be included in this tutorial is associating related products. For example, while a customer may not buy high-end bikes frequently, they may return to a bike store to purchase accessories. Also, it’s likely that salespeople will upsell related products. For instance, a new bike owner may not own a helmet. The salesperson will advise the newbie on the importance of having a helmet. The salesperson may also sell water bottles that fit easily on the new bike. These purchasing habits would alter the sales data significantly. It adds a level of complexity that I will tackle as an enhancement.

Getting Started

One caveat I should mention is that this tutorial will not cover R coding. The constructs used in the tutorial are standard and can be learned from several tutorials on the web. A good place to start is with Data Camp’s R tutorial. They let you try out the first chapter before you need to upgrade. The first chapter covers quite a bit of R coding and is a great overview.

Coding Script

We’re going to create a routine called GenerateOrders that returns a data frame. When you save a data frame using the write.csv() command, it will automatically generate the appropriate columns for you into a Comma-Separated File (CSV). A CSV is one of the most popular formats to use. The file will contain the following fields:

  • OrderNum
  • TransactionDate
  • CustomerID
  • ProductID
  • Quantity

As mentioned above, these column names are different than what appears in the image of the orders table spreadsheet.

I include the R constructs that I used to create this project. This way, you can find information online if you aren’t familiar with them. You can focus your efforts on these constructs, although it won’t hurt to learn as much as you can about the complete language. Here are the constructs used:

  • Data Frames
  • Random number generators (mostly sample() with probabilities associated with them)
  • Function creation
  • Lists (for workaround on lack of pass-by-reference support in R)
  • For loops
  • The rbind() command to add new data frame objects.
  • Reading (future update) and writing CSV files.
  • Working directories

All these constructs are standard within the core R language. In fact, there aren’t any libraries included in the code that I provide. If you already have a basic knowledge of R, you should find this tutorial relatively easy to follow along. If you are new to R, you should consider taking a few tutorials on the language.

You are welcome to bypass learning the language and simply use the code I provide to create sales data for your use. However, if you want to make adjustments to suit your needs, you’ll have a difficult time knowing what to do without a basic knowledge of the R language.

The GenerateOrders() function takes the following form:

GenerateOrders <- function(startingDate, customerMax, productMax, numDates)

Here is a description of the parameters:

startingDate - you can choose whatever date you like for the startingDate parameter. You’ll want to choose a date sometime in the past. With the data I generated for demonstration, I used 11/28/2017.

customerMax - is used to enter the largest Customer Key in your schema. Depending on your requirements, you may want to have this information predetermined. If you are going to use it in a BI reporting tool, each of the Customer IDs will need to be valid.

Note: the algorithm assumes that each number from 1 to the customerMax exists in the customer table. This may not be a valid assumption. For instance, if your customer table contains 1, 2, 3, 5, 6 as valid customer numbers, the number 4 is missing. This algorithm may include (randomly) the number 4. One workaround to this problem is to re-sequence a new field in Excel for customers and use that re-sequence as the key. This would also solve the problem if your customer key contains alphanumeric characters instead of sequential numbers.

As an enhancement, I plan on reading customer tables and extracting the sequence numbers directly. This would eliminate the need to enter a value for customerMax. The same is true for productMax (next).

productMax – this field is the highest number product id, assuming that product id’s are sequential and all numeric. See the above explanations for customerMax for possible enhancements and workarounds.

numDates – the algorithm will sequence a series of dates starting with the starting date. It will use this sequence of dates as the main iterator in the GenerateCustomer routine. Bear in mind that the total number of items generated is likely to be much greater than the number you enter for this parameter. For demonstration purposes, I chose 1000 dates to be initially generated. Due to the rules specified previously, the number of items that were generated from the algorthim was over 5,000.

Algorithm

GenerateOrders <- function(startingDate, customerMax, productMax, numDates) {

  df <- data.frame(orderNum=NULL, transactionDate=NULL, customerID=NULL, productID=NULL, quantity=NULL) 

  oNum <- 1

  dateArr <- seq(as.Date(startingDate), by = "day", length.out = numDates)

  dateArr <- as.character(dateArr)

  for(dt in dateArr) {

    dfTemp <- GenerateCustomers(dt, customerMax, productMax, oNum)

    df <- rbind(df, dfTemp[[2]])

    oNum <- dfTemp[[1]] # this is needed as R does not support by reference variables

  }

  return (df)
}

The first line initializes the main data frame that gets returned with the series of orders attached. It sets the column names and sets the values to NULL. Setting all the values to NULL sets up this data frame as a placeholder without creating a record.

oNum is short for order number. This counter will be needed to continually increase as orders are added to the data frame. As mentioned, R has no native ability to pass parameters by reference. That is why GenerateCustomers returns a list instead. The list contains the oNum and the data frame containing the records for each data. The oNum is retrieved and updated after the call so that the continuous sequence for order numbers is maintained. Without this, oNum would reset to 0 after each call to GenerateCustomers.

Next, we create a sequence of dates that will be used as the iterator of the loop. This I name dateArr. I chose to change the dates to strings as when I first created this, it was changing the values to the numeric equivalent of the dates. For some reason, it wasn’t correct, though. The day and month were right, but the year was for 1950. CSV files don’t understand date formats anyway. Making it a string (using as.character()) is a safer way to approach this.

The next step is the central focus of the algorithm, i.e., to pass each date to the GenerateCustomer() function. The GenerateCustomers() function creates a data frame using the rules described previously. More detail about the code itself will be described in the next section. The function returns a list containing the order number (oNum) and the data frame filled with orders based on the rules. This gets added to the master data frame that will be returned from GenerateOrders().

GetProductQuantity <- function() {

  x = c(1, 2, 3, 4)

  px = c(0.70, 0.17, 0.1, 0.03)

  #set.seed(123)

  sample(x, size = 1, replace = TRUE, prob = px)

}

As the name suggests, the GetProductQuantity() function is used to generate the quantity field of the orders table. The probabilities weight the purchase of one quantity per product as the highest. The purchase of four of a product occurs at the lowest probabilities. Feel free to adjust the probabiliti