08 Mar

Grouping By An Unknown Range

Lets imagine that you wanted to break out your order data in 10% increments from the lowest sales to the largest sale?  And, suppose that your data was actually functional for this sort of thing and you didn’t have widgets that sold for $0.09 cents right next to 90,000 Lamborghini’s?  And, suppose this kind of analysis was actually useful for something besides a SQL GROUP BY example, OK, just suppose.  Anyways, here is, a process that you might go through and a query that would satisfy the problem.

Step 1 – Generate a range for your data.

USE adventureworks2012;
GO
DECLARE @minval decimal(18,2);
DECLARE @maxval decimal(18,2);

SELECT @maxval = MAX(totaldue), @minval = MIN(totaldue)
FROM sales.salesorderheader;
SELECT @maxval AS 'Max', @minval AS 'Min';

Max	        Min
187487.83	1.52

Right off the bat we need to contact Houston because this won’t work, well, it’ll work when we imagine it working later, but this is telling me that we have no negative amounts in our order data. In other words, that we’ve never issued a single credit. In other words, that our data is probably useless.

However, this is AdventureWorks, dammit, we don’t issue credits, and since it’s true in the data, I’ll return you to your regular programming.

Step 2 – Create a clause that segments our data range into 10% increments.

Here’s how you would do this. You create a case statement based on the @maxval value you created, and because we can in this case, 0, use it to segment the GROUP BY, as follows.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
	WHEN totaldue / @range <= .10 THEN '.0000 <= .10'
	WHEN totaldue / @range <= .20 THEN '.1001 <= .20'
	WHEN totaldue / @range <= .30 THEN '.2001 <= .30'
	WHEN totaldue / @range <= .40 THEN '.3001 <= .40'
	WHEN totaldue / @range <= .50 THEN '.4001 <= .50'
	WHEN totaldue / @range <= .60 THEN '.5001 <= .60'
	WHEN totaldue / @range <= .70 THEN '.6001 <= .70'
	WHEN totaldue / @range <= .80 THEN '.7001 <= .80'
	WHEN totaldue / @range <= .90 THEN '.8001 <= .90'
	ELSE '.9001 <= 1.0' END AS Range,
CAST(totaldue AS DECIMAL(18,2)) AS totaldue
FROM sales.salesorderheader
)
SELECT Range, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY Range

Range	        Total Due	Transaction Count	Average Sale
.0000 <= .10	41521958.70	29858	                1390.65
.1001 <= .20	17921756.80	615	                29141.07
.2001 <= .30	22384650.62	490	                45682.96
.3001 <= .40	15449160.66	240	                64371.50
.4001 <= .50	10416627.45	125	                83333.02
.5001 <= .60	8452682.22	82	                103081.49
.6001 <= .70	4923353.28	41	                120081.79
.7001 <= .80	1116949.70	8	                139618.71
.8001 <= .90	489622.37	3	                163207.46
.9001 <= 1.0	540019.13	3	                180006.38

Does this tell us a lot? Actually, it does tell us a few things. First, most of our transactions are small relatively speaking with a few large outliers in the data. If we really wanted to analyze this in some meaningful fashion we’d have to further segment the range. But this post really isn’t about analysis, it’s more about CASE, so I’m going to add an additional designation to the data so the ranges are clearer.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
    WHEN totaldue / @range <= .10 THEN 0*@range
    WHEN totaldue / @range <= .20 THEN .1001*@range
    WHEN totaldue / @range <= .30 THEN .2001*@range
    WHEN totaldue / @range <= .40 THEN .3001*@range
    WHEN totaldue / @range <= .50 THEN .4001*@range
    WHEN totaldue / @range <= .60 THEN .5001*@range
    WHEN totaldue / @range <= .70 THEN .6001*@range
    WHEN totaldue / @range <= .80 THEN .7001*@range
    WHEN totaldue / @range <= .90 THEN .8001*@range
    ELSE CAST(.9001*@range AS INT) 
END AS beginRange,
totaldue
FROM sales.salesorderheader
),
cte1 AS
(
SELECT beginRange, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY beginRange
)
SELECT CAST(beginRange AS INT) AS beginRange,
CAST(LEAD(beginRange) OVER (ORDER BY beginRange) - 1 AS INT) AS endRange,
[Total Due],
[Transaction Count],
[Average Sale]
FROM cte1
ORDER BY beginRange;

beginRange	endRange	Total Due	Transaction Count	Average Sale
0	        18766	        41521963.87	29858	                1390.65
18767	        37515	        17921756.81	615	                29141.07
37516	        56264	        22384650.61	490	                45682.96
56265	        75012	        15449160.67	240	                64371.50
75013	        93761	        10416627.45	125	                83333.02
93762	        112510	        8452682.28	82	                103081.49
112511	        131259	        4923353.25	41	                120081.79
131260	        150008	        1116949.69	8	                139618.71
150009	        168756	        489622.37	3	                163207.46
168757	        NULL	        540019.12	3	                180006.37

There is a lot going on here so lets go through it.

  1. The first thing I change was the MIN and MAX to a @range variable because since there were no negative values in the data I could just start at 0 and work my way up.  Hence, all I needed was a single variable passed into it by the MAX aggregate function.
  2. There are to CTE’s in the query.  The first uses a CASE function, with the @range value to calculate the beginning of each data range.
  3. The second CTE exists because I didn’t want to do a second CASE function for the ending value in each data range.  Instead I use it to aggregate the data from the first CTE into a result set that will allow me to use the LEAD function in a final query that unites all of this.
  4. I converted the beginRange and endRange values to INT types because it looks better.  In real-life I’d probably do the same with the “Total Due” and the “Average Sale” columns.
  5. As mentioned earlier I used the LEAD function to get the endRange value instead of a CASE function.

Once again, I don’t think this specific query is all that useful.  It’s too broad across the entire data range when it would really need to focus on the lower ordertotal transactions.  However, this kind of analysis can be useful.  Very large, or very small orders can distort data analysis in unexpected ways.  For instance, a sales rep can have his sales margin destroyed because he writes a large sale at a low profit.  If your commission structure has a gross profit calculation, it is even possible that he could even lose money on a sale that benefits the company.

Finally, this query is missing the most useful thing, profit margin.  I didn’t include it because the adventureworks2012 data requires joins to get at the cost information and this process was complex enough without it.  What I really wanted to show here is that it is possible to dynamically write a GROUP BY to group data based on the data it’s grouping.

Or should I say, discretize our data?

Leave a Reply

Your email address will not be published. Required fields are marked *