Do you have a subscription service that you track in a relational database? Businesses that capture this membership information often want to know their members' starting dates and last dates. This can be useful to track duration trends and retention rates.
Memberships that provide strong benefits will keep subscribers longer than those offering weaker benefits. Although this quick tutorial won’t cover retention and duration, both will require finding the starting date for members. That’s what this tutorial will cover.
Using the Database
I have created a database with one table (both database and table named Membership). I chose SQLite because I can include the entire database on my GitHub account. To use it, download and open the database file (in either DBBrowser for SQLite or SQLiteonline).
Don't know SQL? No worries. Learn about a fantastic resource that teaches you data science techniques, including SQL. You are in good hands with your training when you join!
----> Get started now!
The table we’ll work with in this tutorial contains information about members in a subscription payment table. The table contains the member’s ID, monthly payment date, monthly payment amount, and the number of downloads for that month.
The following video will explain why it may be prudent to move your data collection to a SQL database.
NOTE: the link mentioned in the video leads you to this tutorial. There is no need to click on the link. It was included to bring viewers from other video websites to here.
Finding the MIN() Date
The membership table contains monthly payment dates, which represents the date members paid their subscription fee. That means the first monthly payment date is the date associated with the member subscribing. The MIN() function can be used to get the first payment date, which represents the start of the membership.
If you know SQL, you probably are wondering why I would even create a tutorial on this topic. After all, can’t you use the MIN() function on the date you are tracking and be done with it?
Yes. The MIN() is the right function to use to find the start date of a member. You can select the member ID and the MIN() function on the date. You’ll also need a GROUP BY clause on the member ID:
SELECT memberID, MIN(PaymentDate) as MinPaymentDate
FROM Membership
GROUP BY MemberID
However, if you are planning on tracking retention or duration, you’ll need the MIN() value to coexist with other columns, including each payment date. That is not possible with the GROUP BY clause. This clause rolls up the data into an aggregated value. If you include more non-aggregated columns, you’ll need to include these in the GROUP BY clause. This changes how the groupings work, which won’t satisfy the requirements.
What About Subqueries?
Subqueries will work. You may hear or have heard that subqueries are not efficient. However, the optimizers of most SQL engines will rework the query to be efficient. Therefore, if you are comfortable with subqueries, feel free to use them.
The following is a subquery that could be used to get the first payment date (while including other columns). The idea is to join on a subquery that includes the minimum aggregation:
SELECT memberID, downloads, PaymentAmount, PaymentDate
FROM Membership
JOIN (SELECT MIN(PaymentDate) MinPayDate
FROM membership GROUP BY memberID) as FirstDate
ON FirstDate.MinPayDate = Membership.PaymentDate
ORDER BY MemberID;
This works because we are using an INNER JOIN (the default JOIN). An INNER JOIN will return rows that are contained in both tables and will bypass rows that are either in one or the other but not both. The minimum date in the subquery constrains the date to only that one date, which is what we need to satisfy the requirements.
Adding Another Aggregation
Suppose you also wanted to include another aggregation, for instance, the last payment date for each customer. You could try to implement another join, but that would further constrain the records to only records where the minimum and maximum payment dates are the same. In other words, members who only joined for one month and quit. In the membership table, there are only five records that match that criteria.
Related: Why Data Scientists Need to Learn SQL
You could try different types of joins, but that starts to get a bit complicated. There is a much better method of handling these types of situations.
Do You Do Windows?
A window function in SQL allows the analyst to frame out a certain segment of the dataset and apply conditions before the dataset is returned. Window functions use functions like MIN(), MAX(), SUM(), etc. They also have dedicated functions for the window purpose, like ROW_NUMBER() and RANK(), etc.
The OVER() keyword is what specifies your intent to use a window function in SQL. You include an aggregation or internal function (ROW_NUMBER, etc.), followed by the OVER() command. If you don't include any parameters within OVER(), it will frame the entire dataset.
Related: How to Bucket Data in SQL
Window functions are not difficult but are a bit involved. Coverage of this topic could (and should) take up a full tutorial on its own. It’s better to cover the aspects of the method that we’ll use for this tutorial. Then, you can continue to apply that subset of functionality until you master it.
We’ll first try the MIN() and MAX() functions for the window, which will work. You can think of a window as a way of framing a section of the result set. You apply certain functions to the window (or frame). In this case, min and max will be the functions.
We’ll keep the same set of columns from the main table and two window functions for min and max as follows:
SELECT memberID, downloads, PaymentAmount, PaymentDate,
MIN(PaymentDate) OVER(PARTITION BY MemberID Order BY PaymentDate) AS
MinPayDate,
MAX(PaymentDate) OVER(PARTITION BY MemberID Order BY PaymentDate DESC) AS MaxPayDate
FROM Membership
ORDER BY memberID;
The OVER() statement can be used with or without parameters. If you used it without it, then the window or frame would be the entire result set. In our case, we are partitioning (PARTITION BY) the MemberID. We are ordering by PaymentDate for the minimum and PaymentDate DESC for the maximum.
Conceptually, this accomplishes that for each member ID, find the minimum date based on the PaymentDate sorted in ascending order. Also, find the maximum date for each member ID with the PaymentDate sorted in descending order. When you get to the next member, start this analysis process again. The min and max will be based on the dates per user. Had you left out the PARTITION BY and only left the ORDER BY, it would process the entire result set (constrained by conditions in the WHERE clause, etc.)
Related: How Window Functions Work
Our query does provide the minimum and maximum, but as you can see, it repeats. It will require more processing to filter for only one row per member, which is our requirement. There may be other requirements where you would need repeated rows. That is the power of window functions. You have the flexibility and control.
The window functions have a construct called Row_Number() that can be used in place of the MIN() function. We’ll replace both the MIN and MAX functions with this:
SELECT memberID, downloads, PaymentAmount, PaymentDate,
ROW_NUMBER() OVER(PARTITION BY MemberID Order BY PaymentDate) AS rnMin,
ROW_NUMBER() OVER(PARTITION BY MemberID Order BY PaymentDate DESC) AS rnMax
FROM Membership
This certainly gets closer to our goal. We could use specify in the WHERE clause to check for PaymentDate = MinPayDate. However, we have a problem in trying to make this work. At this point, trying to use MinPayDate won’t work because it does not yet exist in the query. The whole query would need to be evaluated first, and the processing would then have to return to this point, which is not possible.
We’ll need to create an intermediate step, either a temporary table or a CTE. A temporary table can exist after its creation for the session or connection, depending on whether it’s a local or global temporary table, respectively. Conversely, a CTE must be used immediately after it is defined. Its scope goes away when the main query is processed.
A CTE is also a good choice, and it’s the one I chose for this tutorial. After you define the CTE, the aliases become available to the main query as follows:
With MinMaxDates AS (
SELECT memberID, downloads, PaymentAmount, PaymentDate,
MIN(PaymentDate) OVER(PARTITION BY MemberID Order BY PaymentDate) AS MinPayDate,
MAX(PaymentDate) OVER(PARTITION BY MemberID Order BY PaymentDate DESC) AS MaxPayDate
FROM Membership
)
SELECT * from MinMaxDates WHERE PaymentDate = MinPayDate;
Are We Done Yet?
The previous solution will work, and you can choose to stop after executing it. The query solves the problem defined. However, it’s nice to know about alternatives, which is what I’ll present now.
Instead of the min and max, we can use the ROW_NUMBER() to order the query. In this case, we won’t even need to change the parameters in the OVER() clause.
With FirstLastDates AS (
SELECT memberID, downloads, PaymentAmount, PaymentDate,
ROW_NUMBER() OVER(PARTITION BY MemberID Order BY PaymentDate) AS rnMin,
ROW_NUMBER() OVER(PARTITION BY MemberID Order BY PaymentDate DESC) AS rnMax
FROM Membership
)
Select * from FirstLastDates
WHERE rnMin = 1;
After you run the query, you’ll notice two columns that are row numbers – one is for the PaymentDate ascending, and the other is for the PaymentDate descending. After wrapping this query inside a CTE, we search for the rnMin = 1 in the outer query.
But we would still have to provide further processing for the max date. The row numbers for the maximum dates are likely to be different for members. Even if they were the same, you wouldn’t know ahead of time what this number is like you do with the minimum row number (which is always 1). To get the maximum row number, you would likely need to use a subquery that searches for the date where the min row number = max row number.
Instead of doing this, why not simply bring back the MAX(PaymentDate) in the window? The minimum would remain as the row numbers, but the max would be the actual last date, which would give us what we needed:
With FirstLastDates AS (
SELECT memberID, downloads, PaymentAmount, PaymentDate,
ROW_NUMBER() OVER(PARTITION BY MemberID Order BY PaymentDate) AS rnMin,
MAX(paymentDate) OVER(PARTITION BY MemberID Order BY PaymentDate DESC) AS MaxPaymentDate
FROM Membership
)
Select * from FirstLastDates
WHERE rnMin = 1;
When you run this, you can use the PaymentDate as the minimum, and the MaxPaymentDate will be part of the query, too. It is simply a matter of performing date math on the two dates to find out the duration of memberships (you could average them, etc.)
Conclusion
When you start working with window functions, your brain will start thinking of use cases that you can apply them to. You’ll find they are quite powerful and can streamline the code. Learn this one concept well, and when you tackle other aspects of window functions, it will come easier from this learning effort.