Lets imagine that you wanted to break out your order data in 10% increments from the lowest sales to the largest sale? And, suppose that your data was actually functional for this sort of thing and you didn’t have widgets that sold for $0.09 cents right next to 90,000 Lamborghini’s? And, suppose this kind of analysis was actually useful for something besides a SQL GROUP BY example, OK, just suppose. Anyways, here is, a process that you might go through and a query that would satisfy the problem.
Step 1 – Generate a range for your data.
USE adventureworks2012; GO DECLARE @minval decimal(18,2); DECLARE @maxval decimal(18,2); SELECT @maxval = MAX(totaldue), @minval = MIN(totaldue) FROM sales.salesorderheader; SELECT @maxval AS 'Max', @minval AS 'Min'; Max Min 187487.83 1.52
Right off the bat we need to contact Houston because this won’t work, well, it’ll work when we imagine it working later, but this is telling me that we have no negative amounts in our order data. In other words, that we’ve never issued a single credit. In other words, that our data is probably useless.
However, this is AdventureWorks, dammit, we don’t issue credits, and since it’s true in the data, I’ll return you to your regular programming.
Step 2 – Create a clause that segments our data range into 10% increments.
Here’s how you would do this. You create a case statement based on the @maxval value you created, and because we can in this case, 0, use it to segment the GROUP BY, as follows.
USE adventureworks2012; GO DECLARE @range decimal(18,2); SELECT @range = MAX(totaldue) FROM sales.salesorderheader; WITH cte AS ( SELECT CASE WHEN totaldue / @range <= .10 THEN '.0000 <= .10' WHEN totaldue / @range <= .20 THEN '.1001 <= .20' WHEN totaldue / @range <= .30 THEN '.2001 <= .30' WHEN totaldue / @range <= .40 THEN '.3001 <= .40' WHEN totaldue / @range <= .50 THEN '.4001 <= .50' WHEN totaldue / @range <= .60 THEN '.5001 <= .60' WHEN totaldue / @range <= .70 THEN '.6001 <= .70' WHEN totaldue / @range <= .80 THEN '.7001 <= .80' WHEN totaldue / @range <= .90 THEN '.8001 <= .90' ELSE '.9001 <= 1.0' END AS Range, CAST(totaldue AS DECIMAL(18,2)) AS totaldue FROM sales.salesorderheader ) SELECT Range, CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], COUNT(totaldue) AS [Transaction Count], CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale] FROM cte GROUP BY Range Range Total Due Transaction Count Average Sale .0000 <= .10 41521958.70 29858 1390.65 .1001 <= .20 17921756.80 615 29141.07 .2001 <= .30 22384650.62 490 45682.96 .3001 <= .40 15449160.66 240 64371.50 .4001 <= .50 10416627.45 125 83333.02 .5001 <= .60 8452682.22 82 103081.49 .6001 <= .70 4923353.28 41 120081.79 .7001 <= .80 1116949.70 8 139618.71 .8001 <= .90 489622.37 3 163207.46 .9001 <= 1.0 540019.13 3 180006.38
Does this tell us a lot? Actually, it does tell us a few things. First, most of our transactions are small relatively speaking with a few large outliers in the data. If we really wanted to analyze this in some meaningful fashion we’d have to further segment the range. But this post really isn’t about analysis, it’s more about CASE, so I’m going to add an additional designation to the data so the ranges are clearer.
USE adventureworks2012; GO DECLARE @range decimal(18,2); SELECT @range = MAX(totaldue) FROM sales.salesorderheader; WITH cte AS ( SELECT CASE WHEN totaldue / @range <= .10 THEN 0*@range WHEN totaldue / @range <= .20 THEN .1001*@range WHEN totaldue / @range <= .30 THEN .2001*@range WHEN totaldue / @range <= .40 THEN .3001*@range WHEN totaldue / @range <= .50 THEN .4001*@range WHEN totaldue / @range <= .60 THEN .5001*@range WHEN totaldue / @range <= .70 THEN .6001*@range WHEN totaldue / @range <= .80 THEN .7001*@range WHEN totaldue / @range <= .90 THEN .8001*@range ELSE CAST(.9001*@range AS INT) END AS beginRange, totaldue FROM sales.salesorderheader ), cte1 AS ( SELECT beginRange, CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], COUNT(totaldue) AS [Transaction Count], CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale] FROM cte GROUP BY beginRange ) SELECT CAST(beginRange AS INT) AS beginRange, CAST(LEAD(beginRange) OVER (ORDER BY beginRange) - 1 AS INT) AS endRange, [Total Due], [Transaction Count], [Average Sale] FROM cte1 ORDER BY beginRange; beginRange endRange Total Due Transaction Count Average Sale 0 18766 41521963.87 29858 1390.65 18767 37515 17921756.81 615 29141.07 37516 56264 22384650.61 490 45682.96 56265 75012 15449160.67 240 64371.50 75013 93761 10416627.45 125 83333.02 93762 112510 8452682.28 82 103081.49 112511 131259 4923353.25 41 120081.79 131260 150008 1116949.69 8 139618.71 150009 168756 489622.37 3 163207.46 168757 NULL 540019.12 3 180006.37
There is a lot going on here so lets go through it.
- The first thing I change was the MIN and MAX to a @range variable because since there were no negative values in the data I could just start at 0 and work my way up. Hence, all I needed was a single variable passed into it by the MAX aggregate function.
- There are to CTE’s in the query. The first uses a CASE function, with the @range value to calculate the beginning of each data range.
- The second CTE exists because I didn’t want to do a second CASE function for the ending value in each data range. Instead I use it to aggregate the data from the first CTE into a result set that will allow me to use the LEAD function in a final query that unites all of this.
- I converted the beginRange and endRange values to INT types because it looks better. In real-life I’d probably do the same with the “Total Due” and the “Average Sale” columns.
- As mentioned earlier I used the LEAD function to get the endRange value instead of a CASE function.
Once again, I don’t think this specific query is all that useful. It’s too broad across the entire data range when it would really need to focus on the lower ordertotal transactions. However, this kind of analysis can be useful. Very large, or very small orders can distort data analysis in unexpected ways. For instance, a sales rep can have his sales margin destroyed because he writes a large sale at a low profit. If your commission structure has a gross profit calculation, it is even possible that he could even lose money on a sale that benefits the company.
Finally, this query is missing the most useful thing, profit margin. I didn’t include it because the adventureworks2012 data requires joins to get at the cost information and this process was complex enough without it. What I really wanted to show here is that it is possible to dynamically write a GROUP BY to group data based on the data it’s grouping.
Or should I say, discretize our data?