by James 

How to Bucket Data in SQL

0 Comments

How to Bucket Data in SQL

December 8, 2020 in SQL

Are your employees wondering if they are being paid fairly in your organization? Suppose your HR department wants you to create a tool that will help employees determine how their salaries measure compared to other employees. This tutorial shows one method to help accomplish this task.

How to Bucket Data in SQL

Purchases made from the links in this article may result in commissions generated to the site owner!

The HR department has a strict requirement not to show individual salaries, for obvious reasons. How can we solve this problem? In other words, we need to show employees salary guidelines without showing actual salaries.

Don't Know SQL? Get the training you need from the premiere resource on the topic. Learn how to become a SQL master, which can elevate your success in the data science field!


One solution is to take the average salary for all employees and publish it to your employees. However, this has several problems associated with it. First, executives usually get paid hefty salaries in comparison to most other employees. They will skew the average to the high side. Your employees will likely fall below this inflated (skewed) average. They'll believe they are making much less than the average and they won't realize the average is skewed.

Related: Create Lookup Tables from Fact Tables in Power BI

Second, it doesn’t take into account the department or field the employees work in. For instance, accounting and technology workers often make higher salaries than other departments, all equal. In this case, the low-paying departments are likely to drag the overall average down (if you back out executive pay).

You may think this is good as it will make your employees appear as though they are getting paid higher than the average. However, if they happen to find out, they won’t appreciate the deception.

NOTE: The reference to clicking on the link in the video is for this tutorial. You don't have to click on any links. You are already here! The reference is for visitors viewing the link on YouTube!

How to Bucket Data in SQL

One way to handle this situation is to include a department category in the employees table. Then, it would be as simple as using a GROUP BY statement by department. You could take the average salary by department and publish the results. 

However, sometimes the data needed to use GROUP BY is not available in the tables that you have available. What if department is not included in the employees table? Or it could be in the table but you don't have access to that column due to security constraints.

It's still possible to create custom categories that you can use as part of a GROUP BY statement. That's what this tutorial is about.

For simplicity, we'll create a set of categories that creates buckets for the salaries. 

Let’s assume that any salary over $155,000 is considered an executive salary. Salaries > $110,000 and <= $155,000 are considered high paid. Salaries > $85,000 and <= $110,000 are considered above average, and all salaries $85,000 or below are low paid.

Buckets for Salary

Perhaps the category descriptions need improvement. You could change the categorizations to 'Executive', 'Upper Management', 'Middle Management', and 'Worker'. These categories can be anything that makes sense to your situation.

The goal is to break down the salaries into these buckets and then group by the buckets to get the average salary in that category. One way to do this is to use a CASE statement in SQL.

The CASE statement is a great tool and one that every SQL analyst should learn and use. Trust me, you will use this construct quite a bit, once you know how to implement it. It is powerful.

The syntax for case is:

CASE WHEN condition1 THEN output1

          WHEN condition 2 THEN output2

          WHEN condition 3 THEN output3

                              …

          WHEN condition N then output

ELSE default condition END as Alias Name

You can use the CASE statement in many areas of the SELECT statement. It could even be used in an INSERT statement. However, this tutorial will show it being used in the columns selection only. Be aware that other possibilities exist.

Here is how the case statement would be implemented with the conditions described previously:

SELECT name, salary,

CASE WHEN salary > 155000 THEN 'Executive'

           WHEN salary <= 155000 AND salary > 110000 THEN 'High Paid'

           WHEN salary <= 110000 AND salary > 85000 THEN 'Above Average'

           WHEN salary <= 85000 THEN 'Low Pay'

           ELSE 'Unknown' END

FROM employees

ORDER BY salary DESC

Here is the script and results:

Case Statement Bucketing

Feel free to adjust the ranges.

Related: More on Bucketing in SQL...

Rolling Up the Salaries by Buckets

The previous query still contains individual salaries, which is something the HR department stated is not allowed. Grouping by the buckets will remove the individual salaries from the query. However, as it stands, we cannot group by these newly created buckets in the same SELECT statement as they don’t yet exist at the point of the group by. What’s needed is to create this bucketing ahead of time and then find a way to use the bucketing later with the GROUP BY clause.

You could solve this in a few ways. You could create a temporary table or a Common Table Extension (CTE). Since we don’t need this to persist for the duration of the connection or the session, we’ll choose the CTE. You create a CTE with a WITH…AS clause:

WITH SalaryBuckets AS (

SELECT name, salary,

CASE WHEN salary > 155000 THEN 'Executive'

           WHEN salary <= 155000 AND salary > 110000 THEN 'High Paid'

           WHEN salary <= 110000 AND salary > 85000 THEN 'Above Average'

           WHEN salary <= 85000 THEN 'Low Pay'

           ELSE 'Unknown' END as SalaryCategory

FROM employees

ORDER BY salary DESC

)

SELECT SalaryCategory, AVG(salary) AS avgSalary

FROM SalaryBuckets

GROUP BY SalaryCategory

order by avgSalary DESC

Here are the results:

Rollup by New Buckets

As you can see, we simply wrapped the code that we created with the case statement as a CTE (which was contained within the parenthesis). Then, we used the newly created CTE as if it were a table. At this point in the process, we can use a GROUP BY statement on the buckets.

This method is certainly not perfect, but it does satisfy the requirements of the HR department. It is shown more to illustrate the power of the CASE statement in SQL. In case you’re wondering, companies have adopted methods to give employees an indication as to where their salaries fit. Each company is likely to come up with their own methods for determining this.

Common Table Expressions (CTE)

If you have never seen a CTE before, it is a shorthand method of using a pre-conditioned SQL statement to perform specific tasks in the main, or outer query. You can think of it (loosely) as a cross between a temporary table and a subquery.

A few rules when creating a CTE:

  • Start with the WITH statement, name the CTE, and then include AS
  • The SELECT statement must exist within parenthesis following the AS statement
  • A CTE cannot exist without a finishing (or resolving) SQL SELECT statement, which actually uses the CTE.
  • It’s possible to nest multiple CTEs together. To continue with subsequent nested CTEs, include the name and the AS() statement (do not include the WITH again)

Here is a template of a nested CTE:

WITH Salary1 AS (

)

Salary2 AS (

)

SELECT … (using Salary1 or Salary2 or both) etc.

Notice the subsequent CTEs do not have a WITH statement.

As mentioned, the CTE is no longer in scope when the final query finishes executing. It only exists to help provide its work to the final query.

How to Create Data for the Example

The data used for this article was generated randomly. I used the data creation utility on GenerateData.com. It can create the data in MySQL format, although it added single quotes (‘) which MySQL complained about.

The SQL in the tutorial was written to be platform-agnostic, although the CREATE TABLE does have some MySQL specific constructs that may or may not work in other platforms. When using another platform, you may need to adjust the parameters.

The main two fields are the name field and salary field, which are the ones required for this tutorial. Feel free to create the table with just these fields and populate them with data that you generate, if you are having too much trouble with getting the SQL script to work.

The SQL script is on GitHub.

Not-So-Great Use Cases for Using Case Clause

As powerful as the CASE clause is in SQL, it's not appropriate for every situation. If you have data that needs several categorizations or you have data that changes often, you may want to find alternatives to using CASE. 

For instance, suppose you had access to departments in the employee table. If you only had a few departments to work with, using the CASE clause may be appropriate. However, if you had hundreds of departments, you may find the CASE statement gets rather unwieldy.

It is possible to structure you CASE clause to use the LIKE %% construct. This could help you to clump together like departments, for example. But it starts to get rather fuzzy when the departments are split across divisions, regions, etc. Each of these subsets within the organization may categorize departments differently, even if they name the the same.

To illustrate what I mean by the CASE with LIKE %%, here is an example:

Suppose you had the following:

Employees with Departments

You want to group Accounting and Tax Accounting into one group. You also want to group the two Shipping departments. Here is code you could use the following:

SELECT *,
CASE WHEN department LIKE '%Accounting%' THEN 'Accounting'
WHEN department LIKE '%Shipping%' THEN 'Shipping'
ELSE department END AS modDepartments
FROM employeeswithdepartment;

This produces the following results:

Department Groupings

Again, for a few departments, this could work well. But for hundreds of departments across several regions or districts, it can get ugly and unmanageable. Also, what if your company has divisions that use departments local to their language?

Bucketing is not perfect and often requires a judgement call. For instance, what if you had several departments that had "International" in its title and you needed to group by International or Domestic? The Shipping department would not be correct for this scenario.

Become a SQL Guru!

The right resources can make all the difference in learning how to master the SQL language. Click on the link below this text and you'll have the proper resources to learn EVERYTHING you need to get started using SQL.


Data analysts who know SQL have an advantage over other their colleagues!

Conclusion

The CASE statement can help you get creative with your queries. This tutorial only scratches the surface on what is possible with the construct. You will find plenty of uses for it once you have mastered it. The good news is it doesn’t take too much effort to learn.

About the author 

James

James is a data science writer who has several years' experience in writing and technology. He helps others who are trying to break into the technology field like data science. If this is something you've been trying to do, you've come to the right place. You'll find resources to help you accomplish this.

Leave a Reply

Your email address will not be published. Required fields are marked

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
Subscribe to get the latest updates