08 Mar

Grouping By An Unknown Range

Lets imagine that you wanted to break out your order data in 10% increments from the lowest sales to the largest sale?  And, suppose that your data was actually functional for this sort of thing and you didn’t have widgets that sold for $0.09 cents right next to 90,000 Lamborghini’s?  And, suppose this kind of analysis was actually useful for something besides a SQL GROUP BY example, OK, just suppose.  Anyways, here is, a process that you might go through and a query that would satisfy the problem.

Step 1 – Generate a range for your data.

USE adventureworks2012;
GO
DECLARE @minval decimal(18,2);
DECLARE @maxval decimal(18,2);

SELECT @maxval = MAX(totaldue), @minval = MIN(totaldue)
FROM sales.salesorderheader;
SELECT @maxval AS 'Max', @minval AS 'Min';

Max	        Min
187487.83	1.52

Right off the bat we need to contact Houston because this won’t work, well, it’ll work when we imagine it working later, but this is telling me that we have no negative amounts in our order data. In other words, that we’ve never issued a single credit. In other words, that our data is probably useless.

However, this is AdventureWorks, dammit, we don’t issue credits, and since it’s true in the data, I’ll return you to your regular programming.

Step 2 – Create a clause that segments our data range into 10% increments.

Here’s how you would do this. You create a case statement based on the @maxval value you created, and because we can in this case, 0, use it to segment the GROUP BY, as follows.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
	WHEN totaldue / @range <= .10 THEN '.0000 <= .10'
	WHEN totaldue / @range <= .20 THEN '.1001 <= .20'
	WHEN totaldue / @range <= .30 THEN '.2001 <= .30'
	WHEN totaldue / @range <= .40 THEN '.3001 <= .40'
	WHEN totaldue / @range <= .50 THEN '.4001 <= .50'
	WHEN totaldue / @range <= .60 THEN '.5001 <= .60'
	WHEN totaldue / @range <= .70 THEN '.6001 <= .70'
	WHEN totaldue / @range <= .80 THEN '.7001 <= .80'
	WHEN totaldue / @range <= .90 THEN '.8001 <= .90'
	ELSE '.9001 <= 1.0' END AS Range,
CAST(totaldue AS DECIMAL(18,2)) AS totaldue
FROM sales.salesorderheader
)
SELECT Range, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY Range

Range	        Total Due	Transaction Count	Average Sale
.0000 <= .10	41521958.70	29858	                1390.65
.1001 <= .20	17921756.80	615	                29141.07
.2001 <= .30	22384650.62	490	                45682.96
.3001 <= .40	15449160.66	240	                64371.50
.4001 <= .50	10416627.45	125	                83333.02
.5001 <= .60	8452682.22	82	                103081.49
.6001 <= .70	4923353.28	41	                120081.79
.7001 <= .80	1116949.70	8	                139618.71
.8001 <= .90	489622.37	3	                163207.46
.9001 <= 1.0	540019.13	3	                180006.38

Does this tell us a lot? Actually, it does tell us a few things. First, most of our transactions are small relatively speaking with a few large outliers in the data. If we really wanted to analyze this in some meaningful fashion we’d have to further segment the range. But this post really isn’t about analysis, it’s more about CASE, so I’m going to add an additional designation to the data so the ranges are clearer.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
    WHEN totaldue / @range <= .10 THEN 0*@range
    WHEN totaldue / @range <= .20 THEN .1001*@range
    WHEN totaldue / @range <= .30 THEN .2001*@range
    WHEN totaldue / @range <= .40 THEN .3001*@range
    WHEN totaldue / @range <= .50 THEN .4001*@range
    WHEN totaldue / @range <= .60 THEN .5001*@range
    WHEN totaldue / @range <= .70 THEN .6001*@range
    WHEN totaldue / @range <= .80 THEN .7001*@range
    WHEN totaldue / @range <= .90 THEN .8001*@range
    ELSE CAST(.9001*@range AS INT) 
END AS beginRange,
totaldue
FROM sales.salesorderheader
),
cte1 AS
(
SELECT beginRange, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY beginRange
)
SELECT CAST(beginRange AS INT) AS beginRange,
CAST(LEAD(beginRange) OVER (ORDER BY beginRange) - 1 AS INT) AS endRange,
[Total Due],
[Transaction Count],
[Average Sale]
FROM cte1
ORDER BY beginRange;

beginRange	endRange	Total Due	Transaction Count	Average Sale
0	        18766	        41521963.87	29858	                1390.65
18767	        37515	        17921756.81	615	                29141.07
37516	        56264	        22384650.61	490	                45682.96
56265	        75012	        15449160.67	240	                64371.50
75013	        93761	        10416627.45	125	                83333.02
93762	        112510	        8452682.28	82	                103081.49
112511	        131259	        4923353.25	41	                120081.79
131260	        150008	        1116949.69	8	                139618.71
150009	        168756	        489622.37	3	                163207.46
168757	        NULL	        540019.12	3	                180006.37

There is a lot going on here so lets go through it.

  1. The first thing I change was the MIN and MAX to a @range variable because since there were no negative values in the data I could just start at 0 and work my way up.  Hence, all I needed was a single variable passed into it by the MAX aggregate function.
  2. There are to CTE’s in the query.  The first uses a CASE function, with the @range value to calculate the beginning of each data range.
  3. The second CTE exists because I didn’t want to do a second CASE function for the ending value in each data range.  Instead I use it to aggregate the data from the first CTE into a result set that will allow me to use the LEAD function in a final query that unites all of this.
  4. I converted the beginRange and endRange values to INT types because it looks better.  In real-life I’d probably do the same with the “Total Due” and the “Average Sale” columns.
  5. As mentioned earlier I used the LEAD function to get the endRange value instead of a CASE function.

Once again, I don’t think this specific query is all that useful.  It’s too broad across the entire data range when it would really need to focus on the lower ordertotal transactions.  However, this kind of analysis can be useful.  Very large, or very small orders can distort data analysis in unexpected ways.  For instance, a sales rep can have his sales margin destroyed because he writes a large sale at a low profit.  If your commission structure has a gross profit calculation, it is even possible that he could even lose money on a sale that benefits the company.

Finally, this query is missing the most useful thing, profit margin.  I didn’t include it because the adventureworks2012 data requires joins to get at the cost information and this process was complex enough without it.  What I really wanted to show here is that it is possible to dynamically write a GROUP BY to group data based on the data it’s grouping.

Or should I say, discretize our data?

27 Jan

COALESCE And NULL Ordering

NULL’s are a strange concept in that they equal nothing so how do you order them? In the case of SQL Server they come first. Other databases have options to sort them first or last and the ANSI standard says either is OK as long as you pick one. Here is a quick example,

SELECT null AS col1
UNION
SELECT 1
UNION
SELECT 2
ORDER BY col1; — This could be written as ORDER BY 1

col1
NULL
1
2

And that’s that. Actually the most relevant parts of this are that the alias can only apply to the first SELECT and the ORDER BY, just like in a regular query must appear last. This is all pretty standard stuff.

But, what if you wanted to put the NULL into the middle of this result set because, well, just because.

The easiest way to do that would be to ORDER BY a COALESCE statement as follows:

SELECT null AS col1
UNION 
SELECT 1 as 
UNION
SELECT 2
ORDER BY COALESCE(col1,1.5);
go

Except for the small problem that it returns the following errors,

Msg 207, Level 16, State 1, Line 6
Invalid column name ‘col1’.
Msg 207, Level 16, State 1, Line 6
Invalid column name ‘col1’.
Msg 104, Level 16, State 1, Line 6
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.

Ok, that is kind of obvious. What’s interesting about that is outside of a UNION, INTERSECT or EXCEPT you can do exactly what I just tried to do and it will work. And, the ORDER BY item does, in fact, appear in the SELECT List. Fine, be that way!!!!!

So, what do we do? We use a Common Table Expression and call it a post.

WITH cte AS
(
SELECT null
UNION
SELECT 1
UNION
SELECT 2
)
SELECT *
FROM cte
ORDER BY col1

col1
1
NULL
2

I can’t really imagine using this, at least not with a NULL, but it does show how to use a SET operator with an alias or an ORDER BY and it also shows that you can order a SET by a column that isn’t in the result set, unless, you are using a UNION, INTERSECT or EXCEPT.

26 Jan

Correlated Subquery For Top N

In my last post I used a CTE on some terrible data (I know that, and, I know this example is even worse) to produce a TOP (2) Sales By Region. I’m going to do the same thing on the same data.

Note: This works because the names are unique. If they weren’t I’d have to have a unique identifier for the rep names. But this isn’t about normalized or tight data, it’s about a correlated subquery. So, here’s the data,

SELECT * FROM mysales

Region	Rep	Sales
reg1	Bob	1000
reg1	Bill	2000
reg1	Joe	3000
reg1	Frank	4000
reg2	Sam	5000
reg2	Jill	6000
reg2	Greg	7000
reg2	Harry	8000

and, again, we want the TOP 2 reps by region. So, here’s how the same query would look using a correlated subquery,

select *
FROM mysales o1
WHERE o1.rep IN (SELECT TOP (2) rep as big
FROM mysales o2
WHERE o1.region = o2.region
ORDER BY o2.sales DESC)
ORDER BY region, sales DESC

Like a CTE this query essentially runs two select statements: the main query and the correlated subquery.  The first item of note is the IN operator.  The IN operator says to match any item in the result set that the subquery returns.  The second thing to note is the WHERE clause in the subquery, it’s how we link the two queries together.  In this case it’s saying to run the subquery against the region in the outer query.  This is how the query knows what region to search against.

To think of it another way, the main query is saying: give us the top 2 reps in each region and it’s doing it with the IN operator and the WHERE clause.

All things considered a CTE is probably easier to write and cleaner to read but not all databases have that functionality available.  It’s very probable that some day you’ll need a correlated subquery for a problem like this.

25 Jan

Top N By Group With Ranking

This is a continuation of my earlier article on LIMIT and TOP.  It’s a similar exercise but it’s a bit more complex.  Essentially, you get asked at some point, probably an interview, to write a query that returns the TOP 2 of something ranked by something.  That’s pretty vague on my part so lets be a bit more specific with some data.  Suppose you have a simple data set like the following:

SELECT * FROM mysales

Region	Rep	Sales
reg1	Bob	1000
reg1	Bill	2000
reg1	Joe	3000
reg1	Frank	4000
reg2	Sam	5000
reg2	Jill	6000
reg2	Greg	7000
reg2	Harry	8000

Now your manager comes to you with a request to see the top 2 sales reps by region. If you use a TOP (n) you’ll only get the top 2 reps. There are other ways to do this but this is the most common, and easiest, on the SQL Server side of the house.

WITH CTE
AS
(
SELECT row_number() OVER (partition by region ORDER BY sales desc) as rowNum, Region, Rep, Sales
FROM mysales
)
SELECT rowNum, Region, Rep, Sales
FROM CTE
WHERE rowNum IN(1,2);

rowNum	Region	Rep	Sales
1	reg1	Frank	4000
2	reg1	Joe	3000
1	reg2	Harry	8000
2	reg2	Greg	7000

What we’re doing here is creating a common table expression that assigns a row number to the query results that is partitioned by region. In other words, the row numbers are recycled for each region. Then all we have to do is run a SELECT against the CTE and limit the results to a row number of 1 or 2, and we’re done.