18 May

GROUP BY, Again, This Time It’s Final, No Really It Is.

I think this will be the last of these.  I’ve gotten four posts out of GROUP BY and this time I’m going to double-dip, with the HAVING clause.  So, lets start with SQL Server and blow things up.

SQL Server

SELECT actlevel AL, SUM(height) AS 'sumHeight'
FROM junk
GROUP BY AL
HAVING SUM(height) > 400;

Msg 207, Level 16, State 1, Line 3
Invalid column name 'AL'.

SELECT actlevel AL, SUM(height) AS 'sumHeight'
FROM junk
GROUP BY actlevel
HAVING sumHeight > 400;

Msg 207, Level 16, State 1, Line 4
Invalid column name 'sumHeight'.

Both of these errors are saying that you can’t use an alias in a GROUP BY or HAVING clause. Honestly, this is kind of the expected behavior if you understand how SQL theoretically works (another post).

However, the other two guys I write about are just fine with it.

MySQL and SAS

mysql> SELECT actlevel AL, sum(height) as mysum
    -> FROM junk
    -> GROUP BY AL
    -> HAVING mysum > 400;
+------+-------+
| AL   | mysum |
+------+-------+
| HIGH |   491 |
| LOW  |   465 |
| MOD  |   477 |
+------+-------+
3 rows in set (0.00 sec)

proc sql;
SELECT actlevel AL, sum(height) AS mysum
FROM mysql.admit
GROUP BY AL
HAVING mysum > 400;
quit;

ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, *, **, +, ',', -, '.', 
/, <, <=, <>, =, >, >=, ?, AND, AS, CONTAINS, EQ, EQT, FROM, GE, GET, GT, GTT, LE, LET, LIKE, LT, 
LTT, NE, NET, OR, ^=, |, ||, ~=.

Ah, SAS, the 4th grade comedian that drove the teacher crazy. SAS requires the AS keyword.

proc sql;
SELECT actlevel AS AL, sum(height) as mysum
FROM mysql.admit
GROUP BY AL
HAVING mysum > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

So, that’s all good. I can end now, right. I mean, I wouldn’t be tempted to do something exotic, just to see what SAS, or maybe even MySQL, does with it, would I? Nope, not me.

mysql> SELECT actlevel 'AL', sum(height) as mysum
    -> FROM junk
    -> GROUP BY 'AL'
    -> HAVING mysum > 400;
+------+-------+
| AL   | mysum |
+------+-------+
| HIGH |  1433 |
+------+-------+
1 row in set (0.00 sec)

Err, OK. Maybe this will be better.

mysql> SELECT actlevel AL, sum(height) as 'mysum'
    -> FROM junk
    -> GROUP BY AL
    -> HAVING 'mysum' > 400;
Empty set, 1 warning (0.00 sec)

Oops! MySQL, ever the professionals at making it clear what went wrong. But, I wonder how SAS will handle this code.

proc sql;
SELECT actlevel 'AL', sum(height) as mysum
FROM mysql.admit
GROUP BY 'AL'
HAVING mysum > 400;
quit;

AL       mysum
--------------
HIGH      1433
HIGH      1433
LOW       1433
MOD       1433
LOW       1433
HIGH      1433
MOD       1433
MOD       1433
LOW       1433
LOW       1433
HIGH      1433
LOW       1433
MOD       1433
HIGH      1433
MOD       1433
LOW       1433
MOD       1433
HIGH      1433
LOW       1433
HIGH      1433
MOD       1433
Okey, dokey.

And,
proc sql;
SELECT actlevel AL, sum(height) as 'mysum'
FROM junk
GROUP BY AL
HAVING 'mysum' > 400;
quit;

ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, *, **, +, ',', -, 
'.', /, <, <=, <>, =, >, >=, ?, AND, AS, CONTAINS, EQ, EQT, FROM, GE, GET, GT, GTT, LE, LET, 
LIKE, LT, LTT, NE, NET, OR, ^=, |, ||, ~=.

Lets try this, instead:

proc sql;
SELECT actlevel AL, sum(height) as 'mybum'
FROM junk
GROUP BY AL
HAVING 'mybum' > 400;
quit;

ERROR: Expression using greater than (>) has components that are of different data types.

Really? No, Really? SAS is just a giant box of endless posts.

Well, I guess it has to be this way.

proc sql;
SELECT actlevel AS AL, sum(height) AS mybum
FROM mysql.admit
GROUP BY AL
HAVING mybum > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

Or, I suppose, you could just write it like you are supposed to.

proc sql;
SELECT actlevel AS AL, sum(height) AS mybum
FROM mysql.admit
GROUP BY actlevel
HAVING sum(height) > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

I just wish I could trust SAS to write the SAS Advanced certification that cleanly but after the Base I simply don’t trust them not to use all of this stuff.

I’m out!

15 May

More GROUP BY, SQL Server vs SAS and MySQL

This is pretty clean, and I’ve touched on it before, so it shouldn’t be a surprise, but, it’s something I’d have missed on a SAS test so it gets a post.  We’ll start with SQL Server.

SQL Server

SELECT actlevel, sum(height)
FROM junk
GROUP BY actlevel;

actlevel	sumHeight
HIGH	        491
LOW 	        465
MOD 	        477

This is clean and all three of my favorites will behave with this SQL.

However, lets change it just a bit.

SELECT actlevel, sum(height) AS sumHeight
FROM junk
GROUP BY 1;

Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

I suspect, that this is the result most people would expect with that code, certainly the Microsoft guys would. However, the other two platforms seem to disagree.

MySQL

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| HIGH     |       491 |
| LOW      |       465 |
| MOD      |       477 |
+----------+-----------+
3 rows in set (0.00 sec)

That’s pretty clean, and notice how it orders the actlevel column. And just for kicks, this happens to work.

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1 DESC;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| MOD      |       477 |
| LOW      |       465 |
| HIGH     |       491 |
+----------+-----------+
3 rows in set (0.00 sec)

Now, that other guy.

SAS

proc sql;
SELECT actlevel, SUM(height) AS sumHeight
FROM mysql.admit
GROUP BY 1;
quit;

Act
Level  sumHeight
----------------
HIGH         491
LOW          465
MOD          477

And just for kicks, you can alias a column with the label command as follows:

proc sql;
SELECT actlevel, SUM(height) label='sumHeight'
FROM mysql.admit
GROUP BY 1;
quit;

However, you cannot use the DESC keyword here, just, because.

13 May

GROUP BY, Without A Group By

A lot of what I do on this site falls into the “what the heck” category.  By that I mean, I just want to do things in a different and strange fashion.  Often, like the stuff in this article, there’s no reason to do it the ways I’m going to, they aren’t faster, they aren’t better and often they are easier to understand.

So why do them? This probably stems from my martial arts days. In my particular art there were two, somewhat, extreme approaches to it. One group felt that you could flow, literally from the beginning and discover the art. The other group, which I’ll call the 1.2.3.4.5 group felt that you spend decades doing something, then, and only then, can you explore because you had the basics down.

For the record, I think both groups failed miserably in their approach. From what I saw, if you never practiced the fundamentals, you never learned the fundamentals, but, if all you ever did were fundamentals, you never really developed your creativity. Both approaches had this philosophical approach that fell flat on its face. As such, I tend to try to incorporate both into my practice, whatever that practice might be. Today, it’s the creative side.

Lets dive in.  By now you all know my good friend, the militaryspending table,

mysql> desc militaryspending;
+----------+-------------+------+-----+---------+----------------+
| Field    | Type        | Null | Key | Default | Extra          |
+----------+-------------+------+-----+---------+----------------+
| country  | varchar(50) | NO   |     | NULL    |                |
| spending | double      | NO   |     | NULL    |                |
| gdpperc  | double      | NO   |     | NULL    |                |
| region   | varchar(50) | YES  |     | NULL    |                |
| id       | int(11)     | NO   | PRI | NULL    | auto_increment |
+----------+-------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)

mysql> SELECT *
    -> FROM militaryspending LIMIT 5;
+------------+----------+---------+---------------+----+
| country    | spending | gdpperc | region        | id |
+------------+----------+---------+---------------+----+
| Seychelles |      9.3 |     1.3 | Africa        |  1 |
| Cape Verde |      9.7 |     0.5 | Africa        |  2 |
| Mauritius  |     10.1 |     0.1 | Africa        |  3 |
| Belize     |     15.7 |     1.1 | North America |  4 |
| Moldova    |     20.8 |     0.3 | Europe        |  5 |
+------------+----------+---------+---------------+----+
5 rows in set (0.00 sec)

So, let’s do a simple GROUP BY that shows total spending by region. We would write this, and this is how we should do this, as follows:

mysql> SELECT region, SUM(spending) AS totSpending
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

Clean, straight-forward, exactly what you should do, and it works everywhere. However, there are other ways to write that.

For instance, you can use a correlated subquery as follows:

mysql> SELECT DISTINCT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.01 sec)

This will work in most platforms. It’s standard stuff, albeit, not ideal, and as we I said in the title, there is no GROUP BY.

It also requires the use of DISTINCT to make it work.

mysql> SELECT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a LIMIT 10;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Africa        |      120608 |
| Africa        |      120608 |
| North America |    743120.4 |
| Europe        |    312412.2 |
| Africa        |      120608 |
| Asia          |    428477.7 |
| South America |     65976.9 |
| North America |    743120.4 |
| Africa        |      120608 |
+---------------+-------------+
10 rows in set (0.00 sec)

As you can see by that query, we’re running the subquery for each record in the database, the using DISTINCT to eliminate all the duplicate records. Useless, in this example, but it works, and except for the LIMIT clause, this works on SQL Server too.

In SQL Server you can use the OVER() clause to do something similar.

SELECT DISTINCT a.region, 
SUM(spending) OVER(PARTITION BY region) AS totSpending
FROM militaryspending a;

region	        totSpending
Africa	        120606
Asia	        428477
Australia	28559
Europe	        312410
North America	743119
South America	65976

You can read more about the OVER() clause at the link below.

http://msdn.microsoft.com/en-us/library/ms189461.aspx

This will be the last query I do, you could, literally, do this dozens of ways. What this query does is use a UNION to select the distinct region values from the militaryspending table. Those values are then used for a self-join to generate the total spending by region.

mysql> SELECT a.region, SUM(b.spending) AS totSpending
    -> FROM
    -> (
    -> SELECT region FROM militaryspending
    -> UNION
    -> SELECT region FROM militaryspending
    -> ) AS a
    -> LEFT JOIN militaryspending b
    -> ON a.region = b.region
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

I would never do any of this, assuming viable alternatives, on a job that I was being paid to do, or even not paid to do, but as a creative practice this sort of thing is very useful. And, like it or not, SQL, and programming, are creative disciplines.

11 May

Horizontal Lists

Honestly, I just thought this was a fun exercise.  I’m not going to go into heavy detail on it and my solution is essentially identical to the SQL Cookbook except that I use variables.

So, here is some code, and the result.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY country
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| Belize             | Argentina     |
| Canada             | Bolivia       |
| Dominican Republic | Brazil        |
| El Salvador        | Chile         |
| Guatemala          | Colombia      |
| Honduras           | Ecuador       |
| Mexico             | Guyana        |
| Nicaragua          | Paraguay      |
| United States      | Peru          |
|                    | Uruguay       |
|                    | Venezuela     |
+--------------------+---------------+
11 rows in set (0.00 sec)

What this does is surprisingly simple. It uses two counters to build a list for two regions of data, orders them, adds a counter for each group, which is all put in a subquery. The outer query selects the max record for each counter and the counter is used for the GROUP BY. Finally, the COALESCE in the GROUP BY makes sure that we get whichever counter value applies to a given column.

When I first did this it didn’t seem all that useful but suppose we wanted a report that showed the top 5 spenders by each region? We could do that with the following code.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| United States      | Brazil        |
| Canada             | Colombia      |
| Mexico             | Chile         |
| Dominican Republic | Argentina     |
| El Salvador        | Venezuela     |
| Guatemala          | Ecuador       |
| Honduras           | Peru          |
| Nicaragua          | Uruguay       |
| Belize             | Bolivia       |
|                    | Paraguay      |
|                    | Guyana        |
+--------------------+---------------+
11 rows in set (0.00 sec)

Of course if you want to actually see the spending amounts then you will need to add two more columns.

mysql> SELECT MAX(north) AS 'North America',
    -> SUM(northspending) AS NorthSpending,
    -> MAX(south) AS 'South America',
    -> MAX(southSpending) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'North America', spending, '') AS northSpending,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> IF (region = 'South America', spending, '') AS southSpending,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+---------------+---------------+
| North America      | NorthSpending | South America | South America |
+--------------------+---------------+---------------+---------------+
| United States      |        711421 | Brazil        | 35360         |
| Canada             |         24659 | Colombia      | 10957         |
| Mexico             |          6022 | Chile         | 8040          |
| Dominican Republic |           341 | Argentina     | 3295          |
| El Salvador        |           222 | Venezuela     | 2385          |
| Guatemala          |           200 | Ecuador       | 2308          |
| Honduras           |           185 | Peru          | 2098          |
| Nicaragua          |          54.7 | Uruguay       | 902           |
| Belize             |          15.7 | Bolivia       | 352           |
|                    |             0 | Paraguay      | 250           |
|                    |             0 | Guyana        | 29.9          |
+--------------------+---------------+---------------+---------------+
11 rows in set (0.00 sec)

And there you go, the countries sorted by region, with their spending sorted in descending order. I also set the MAX and SUM for the aggregates intentionally because it didn’t matter in this example and I wanted to show that it didn’t matter.

Enjoy!

09 May

Finding The Next, Non-like, Item In A list

What a lousy title.  I really have no idea what to call this one.  It’s even, kind of, hard to explain.  On the SQL Team forums a user had a question about tracking entry’s and exits from a room.  However, each room might have multiple entries, and multiple exits, so you would want the first entry then the first exit that was greater than the entry time.

Here’s the thread.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=184712

This will be a SQL Server solution, for the most part, but I’ll show the MySQL version later on.  It’s pretty much the same except that you use an inline view.

CREATE TABLE mydates (id INT, checktime DATETIME, type CHAR(1));
INSERT INTO mydates VALUES 
('11','2013-01-01 09:00:00','I'),
('11','2013-01-01 09:10:00','I'),
('11','2013-01-01 09:11:00','I'),
('11','2013-01-01 12:00:00','O'),
('11','2013-01-01 12:02:00','O'),
('11','2013-01-01 14:15:00','I'),
('11','2013-01-01 14:45:00','O'),
('11','2013-01-01 15:00:00','I'),
('11','2013-01-01 15:03:00','I'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-02 09:10:00','I'),
('11','2013-01-02 09:11:00','I'),
('11','2013-01-02 18:00:00','O'),
('11','2013-01-02 18:05:00','O');

And you want to product the following result.

id	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

In essence, the first I type attached to the first O type, then the next I type greater than the O type, and so on and so forth. Yeah, it’s weird, and it would be fair to ask why they have multiple in/out times and to expect them to fix that problem but you know how that works in the real world.

So, here is how I got there, and for the record, it was pretty close to this. I kind of step-by-stepped it until I had an answer.

Step 1: Realize I only needed the I types at first.

SELECT * 
FROM mydates
WHERE type = 'I';

id	checktime	        type
11	2013-01-01 09:00:00.000	I
11	2013-01-01 09:10:00.000	I
11	2013-01-01 09:11:00.000	I
11	2013-01-01 14:15:00.000	I
11	2013-01-01 15:00:00.000	I
11	2013-01-01 15:03:00.000	I
11	2013-01-02 09:10:00.000	I
11	2013-01-02 09:11:00.000	I

This really doesn’t help much, but, it gives a base to work with.

Step 2: Create a self-join to get the O types.

Essentially, I just wrote a left-join that joined on everything where the types did not match. It doesn’t have to be a left join. A regular inner join would be fine. I have an old habit, because I did a lot of order entry queries in a past life, to reflexively write left joins.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

Yeah, that’s a bit long, but don’t worry, it will clean up soon.

Anyways, what this has done is combine every type I with every type 0, which is why it’s so long. We need to shorten it a bit.

Step 3: Only select type O times that are greater than a type I time.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

That helps but it still doesn’t get it narrowed down. However, the next step will.

Step 4: Use the MIN function to get the minimum type O time for each type I time.

You do this by rewriting the query to use a MIN function and a GROUP BY as follows:

SELECT a.id, a.checktime, a.type, b.id, MIN(b.checktime) AS timeout, b.type 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime, a.type, b.id, b.type

id	checktime	        type	id	timeout    	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O

Ah, better, now for each type I we have the first type O value greater than it.

Step 5: Clean up the earlier query.

Normally, I would do this last, but the WITH statement is disagreeable about creating an example the way I’ve gone so I’ll clean it up now.

SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 

uid	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:10:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:11:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-01 15:03:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000
11	2013-01-02 09:11:00.000	2013-01-02 18:00:00.000

Step 6: Select the minimum timein value based on the timeout value.

So, what we now have is a list of timeout values attached to a bunch of timein values. We simply run the same query we did earlier, only this time for the timein value and bang/done.

WITH cte AS
(
SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 
)
SELECT id, MIN(timein) AS timein, timeout
FROM cte
GROUP BY id, timeout

id	timein          	timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

This might make more sense if I add a summary of the steps.

  1. Select all type I values.
  2. Join the type O values to the type I values.
  3. Restrict the query to only values where the type O time is greater than the type I time.
  4. Select the minimum type O value for each type I value.
  5. Select the minimum type I value for each type O value via a subquery/with statement.

The MySQL version, which is really the MySQL / SQL Server version, because it works just fine on both sides of the house is,

mysql> SELECT a.id, MIN(timein) AS timein, timeout
    -> FROM
    -> (
    -> SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
    -> FROM mydates a
    -> LEFT JOIN mydates b
    -> ON a.type <> b.type
    -> AND a.id = b.id
    -> AND a.checktime < b.checktime     -> WHERE a.type = 'I'
    -> GROUP BY a.id, a.checktime ) AS A
    -> GROUP BY a.id, timeout;
+------+---------------------+---------------------+
| id   | timein              | timeout             |
+------+---------------------+---------------------+
|   11 | 2013-01-01 09:00:00 | 2013-01-01 12:00:00 |
|   11 | 2013-01-01 14:15:00 | 2013-01-01 14:45:00 |
|   11 | 2013-01-01 15:00:00 | 2013-01-01 18:00:00 |
|   11 | 2013-01-02 09:10:00 | 2013-01-02 18:00:00 |
+------+---------------------+---------------------+
4 rows in set (0.00 sec)

I used the WITH statement earlier, and in the original post, because the Microsoft guys seem more comfortable with that, especially on a “new guy” forum like where I posted this. I, also, have to admit that I think it’s easier to write too. Nesting subqueries gets a bit messy to read.

Hope you enjoyed it.

01 May

Removing numbers / not letters from a string

I’m still reading the SQL Cookbook and something he’s been doing has been bugging me.  Basically, when he strips numeric values from a string he kind of just says “sorry, can’t do it in MySQL/SQL Server”.  So, I’ve thought on this and I’ve finally figured it out with MySQL.  It’s a bit long, and a bit convoluted, but it works.  I’ll go through my process and how I got here.

Keep in mind that the reverse, removing letters, would also work just fine.

1. Get some data.

This will need a second table, because we’re going to rely on it to parse the string. To build some sample data quick I used the code found at this link.

http://datacharmer.blogspot.co.uk/2006/06/filling-test-tables-quickly.html

Using his code took .35 seconds to build a 1024 number sequence from 1 to 1024.

I’ll keep this part simple and create a simple sample table with, for now, a single value.

mysql> CREATE TABLE names
    -> (
    -> name varchar(20)
    -> );
Query OK, 0 rows affected (0.14 sec)

mysql> INSERT INTO names VALUES ('Nicole123 Eggert456');
Query OK, 1 row affected (0.02 sec)

She was on the front page of yahoo so she’s the lucky winner of finding me a name to use. I’m sure she’s very proud.

Anyways, we want to get her name out of that into something useful.

2. Parse each character of the string.

This is easy. You just read the string by joining to the number table and you’ve done a lot of the work, the easy part.

mysql> SELECT name, id
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id;
+---------------------+----+
| name                | id |
+---------------------+----+
| Nicole123 Eggert456 |  1 |
| Nicole123 Eggert456 |  2 |
| Nicole123 Eggert456 |  3 |
| Nicole123 Eggert456 |  4 |
| Nicole123 Eggert456 |  5 |
| Nicole123 Eggert456 |  6 |
| Nicole123 Eggert456 |  7 |
| Nicole123 Eggert456 |  8 |
| Nicole123 Eggert456 |  9 |
| Nicole123 Eggert456 | 10 |
| Nicole123 Eggert456 | 11 |
| Nicole123 Eggert456 | 12 |
| Nicole123 Eggert456 | 13 |
| Nicole123 Eggert456 | 14 |
| Nicole123 Eggert456 | 15 |
| Nicole123 Eggert456 | 16 |
| Nicole123 Eggert456 | 17 |
| Nicole123 Eggert456 | 18 |
| Nicole123 Eggert456 | 19 |
+---------------------+----+
19 rows in set (0.00 sec)

Then all we need to do is use SUBSTRING to build a column for each character in the string. It’s literally this simple,

mysql> SELECT name, id, SUBSTRING(name, id, 1) as eachChar
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id;
+---------------------+----+----------+
| name                | id | eachChar |
+---------------------+----+----------+
| Nicole123 Eggert456 |  1 | N        |
| Nicole123 Eggert456 |  2 | i        |
| Nicole123 Eggert456 |  3 | c        |
| Nicole123 Eggert456 |  4 | o        |
| Nicole123 Eggert456 |  5 | l        |
| Nicole123 Eggert456 |  6 | e        |
| Nicole123 Eggert456 |  7 | 1        |
| Nicole123 Eggert456 |  8 | 2        |
| Nicole123 Eggert456 |  9 | 3        |
| Nicole123 Eggert456 | 10 |          |
| Nicole123 Eggert456 | 11 | E        |
| Nicole123 Eggert456 | 12 | g        |
| Nicole123 Eggert456 | 13 | g        |
| Nicole123 Eggert456 | 14 | e        |
| Nicole123 Eggert456 | 15 | r        |
| Nicole123 Eggert456 | 16 | t        |
| Nicole123 Eggert456 | 17 | 4        |
| Nicole123 Eggert456 | 18 | 5        |
| Nicole123 Eggert456 | 19 | 6        |
+---------------------+----+----------+
19 rows in set (0.00 sec)

3. Get rid of everything, except a space, that isn’t a number.

In order to do this I’ll simply add a where clause, using SUBSTRING, that cleans the data up.

mysql> SELECT name, id, SUBSTRING(name, id, 1) as eachChar
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     
    -> OR SUBSTRING(name, id, 1) = ' ';
+---------------------+----+----------+
| name                | id | eachChar |
+---------------------+----+----------+
| Nicole123 Eggert456 |  1 | N        |
| Nicole123 Eggert456 |  2 | i        |
| Nicole123 Eggert456 |  3 | c        |
| Nicole123 Eggert456 |  4 | o        |
| Nicole123 Eggert456 |  5 | l        |
| Nicole123 Eggert456 |  6 | e        |
| Nicole123 Eggert456 | 10 |          |
| Nicole123 Eggert456 | 11 | E        |
| Nicole123 Eggert456 | 12 | g        |
| Nicole123 Eggert456 | 13 | g        |
| Nicole123 Eggert456 | 14 | e        |
| Nicole123 Eggert456 | 15 | r        |
| Nicole123 Eggert456 | 16 | t        |
+---------------------+----+----------+
13 rows in set (0.00 sec)

Man, I swear, we’re all the way there.

4. Use GROUP_CONCAT to reassemble the name.

mysql> SELECT name, id, GROUP_CONCAT(SUBSTRING(name, id, 1)) as newName
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     
    -> OR SUBSTRING(name, id, 1) = ' ';
+---------------------+----+---------------------------+
| name                | id | newName                   |
+---------------------+----+---------------------------+
| Nicole123 Eggert456 |  1 | N,i,c,o,l,e, ,E,g,g,e,r,t |
+---------------------+----+---------------------------+
1 row in set (0.00 sec)

5. Use REPLACE to clean it up.

Yep, we just need to get rid of some commas.

mysql> SELECT name, id, REPLACE(GROUP_CONCAT(SUBSTRING(name, id, 1)),',','') as newName
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     
    -> OR SUBSTRING(name, id, 1) = ' ';
+---------------------+----+---------------+
| name                | id | newName       |
+---------------------+----+---------------+
| Nicole123 Eggert456 |  1 | Nicole Eggert |
+---------------------+----+---------------+
1 row in set (0.00 sec)

Bam! I’m a stud.

I was really happy when I got this to work. I was sure that I was done. I was wrong. This next part is actually the messiest.

6. Add multiple records.

mysql> INSERT INTO names VALUES ('D*B* Cooper'),('Usain400Bolt');
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT name, id, REPLACE(GROUP_CONCAT(SUBSTRING(name, id, 1)),',','') as newName
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     -> OR SUBSTRING(name, id, 1) = ' ';
+---------------------+----+---------------------------------+
| name                | id | newName                         |
+---------------------+----+---------------------------------+
| Nicole123 Eggert456 |  1 | Nicole EggertDB CooperUsainBolt |
+---------------------+----+---------------------------------+
1 row in set (0.00 sec)

O:o, that isn’t what I wanted, although it did remove the special characters and numbers from the strings. What has to happen here is that we need to sort, and group, the query. To do that we change the GROUP_CONCAT to include an ORDER BY and add a GROUP BY to the code. And just to make it clear I’m going to add a second GROUP_CONCAT so you can see the data play out better.

mysql> SELECT name,
    -> REPLACE(GROUP_CONCAT(SUBSTRING(name, id, 1) ORDER BY id),',','') as newName,
    -> GROUP_CONCAT(id ORDER BY id) as charOrder
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     
    -> OR SUBSTRING(name, id, 1) = ' '
    -> GROUP BY name;
+---------------------+---------------+----------------------------------+
| name                | newName       | charOrder                        |
+---------------------+---------------+----------------------------------+
| D*B* Cooper         | DB Cooper     | 1,3,5,6,7,8,9,10,11              |
| Nicole123 Eggert456 | Nicole Eggert | 1,2,3,4,5,6,10,11,12,13,14,15,16 |
| Usain400Bolt        | UsainBolt     | 1,2,3,4,5,9,10,11,12             |
+---------------------+---------------+----------------------------------+
3 rows in set (0.00 sec)

Bang, there you go, all numbers and special characters, except spaces, removed. The charOrder column is there to show what characters have been removed. For instance, the DB Cooper record is missing characters 2 and 4.

There are a couple of other problems with this code. One fixable and the other not so much. The one that isn’t really fixable is UsainBolt. The hard truth is that we’re removing non-standard characters so the code really has no idea that this should be two words. I thought about this, a bit, but decided to leave it. My code meets the requirements of what I was striving for.

7.  Fix repeating records.

The other thing is what happens if for some reason you have two identical records? Here’s what this code would do.

mysql> INSERT INTO names VALUES ('D*B* Cooper');
Query OK, 1 row affected (0.04 sec)
mysql> SELECT name,
    -> REPLACE(GROUP_CONCAT(SUBSTRING(name, id, 1) ORDER BY id),',','') as newName,
    -> GROUP_CONCAT(id ORDER BY id) as charOrder
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= id
    -> WHERE SUBSTRING(name, id, 1) >= 'a'
    -> AND SUBSTRING(name, id, 1) <= 'Z'     
    -> OR SUBSTRING(name, id, 1) = ' '
    -> GROUP BY name;
+---------------------+--------------------+-----------------------------------------+
| name                | newName            | charOrder                               |
+---------------------+--------------------+-----------------------------------------+
| D*B* Cooper         | DDBB  CCooooppeerr | 1,1,3,3,5,5,6,6,7,7,8,8,9,9,10,10,11,11 |
| Nicole123 Eggert456 | Nicole Eggert      | 1,2,3,4,5,6,10,11,12,13,14,15,16        |
| Usain400Bolt        | UsainBolt          | 1,2,3,4,5,9,10,11,12                    |
+---------------------+--------------------+-----------------------------------------+
3 rows in set (0.00 sec)

As you can see the DB Cooper record is duplicated, more or less. This is actually fixed really easily. Just add a auto_increment / identity property to the names column and use that for your GROUP BY. For example,

mysql> ALTER TABLE names ADD id INT NOT NULL PRIMARY KEY AUTO_INCREMENT;
Query OK, 4 rows affected (0.17 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM names;
+---------------------+----+
| name                | id |
+---------------------+----+
| Nicole123 Eggert456 |  1 |
| D*B* Cooper         |  2 |
| Usain400Bolt        |  3 |
| D*B* Cooper         |  4 |
+---------------------+----+
4 rows in set (0.00 sec)

Now use the same code only change the ORDER BY to id as follows,

mysql> SELECT name,
    -> REPLACE(GROUP_CONCAT(SUBSTRING(name, numbers.id, 1) ORDER BY numbers.id),',','') as newName,
    -> GROUP_CONCAT(numbers.id ORDER BY numbers.id) as charOrder
    -> FROM names
    -> INNER JOIN numbers
    -> ON CHAR_LENGTH(name) >= numbers.id
    -> WHERE SUBSTRING(name, numbers.id, 1) >= 'a'
    -> AND SUBSTRING(name, numbers.id, 1) <= 'Z'     
    -> OR SUBSTRING(name, numbers.id, 1) = ' '
    -> GROUP BY names.id;
+---------------------+---------------+----------------------------------+
| name                | newName       | charOrder                        |
+---------------------+---------------+----------------------------------+
| Nicole123 Eggert456 | Nicole Eggert | 1,2,3,4,5,6,10,11,12,13,14,15,16 |
| D*B* Cooper         | DB Cooper     | 1,3,5,6,7,8,9,10,11              |
| Usain400Bolt        | UsainBolt     | 1,2,3,4,5,9,10,11,12             |
| D*B* Cooper         | DB Cooper     | 1,3,5,6,7,8,9,10,11              |
+---------------------+---------------+----------------------------------+
4 rows in set (0.00 sec)

And there you go, all the numbers / special characters removed from a string.

This, obviously, doesn’t work with SQL Server, as they don’t have a GROUP_CONCAT on that side of the house but it can, at least, be done on the MySQL side of the house.

Or, I guess, I could just nest 10 REPLACE statements.

30 Apr

Running Totals

In the SQL Cookbook the author gives an example of a running total.  I thought I’d do it a couple of other ways.

I’ll be using summarized data from the military spending data that I have used in earlier articles. Here is a summary of that.

mysql> SELECT region, SUM(spending) AS totspending
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+-------------+
| region        | totspending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

The SQL Cookbook uses the following for a running total, and I might add, it’s perfectly fine as a solution. Anyways, here it is.

mysql> SELECT a.region, SUM(a.spending) as regSpending,
    -> (SELECT SUM(spending) FROM militaryspending b WHERE a.region >= b.region)
 as rollingTot
    -> FROM militaryspending a
    -> GROUP BY a.region
    -> ORDER BY a.region;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

This also works for SQL Server. What it does is use a correlated subquery to calculate a SUM based on the region value. This is quite clean. Another way you can do this is with a non-equi join (that gets spelled all kinds of different ways). The code for the join looks like this,

mysql> SELECT a.region, SUM(b.spending) AS rollingTot
    -> FROM militaryspending a
    -> JOIN militaryspending b
    -> ON a.region >= b.region
    -> GROUP BY a.region
    -> ORDER BY a.region;
+---------------+------------+
| region        | rollingTot |
+---------------+------------+
| Africa        |    4703712 |
| Asia          | 15374399.6 |
| Australia     |  1732934.4 |
| Europe        |   32042052 |
| North America | 14698596.6 |
| South America | 18690697.3 |
+---------------+------------+
6 rows in set (0.00 sec)

This, mostly, does the same thing as the SQL Cookbook example except that it’s a bit less elegant, and individual region totals aren’t an option. Anyways, all this does is change the SUM in the SELECT to a JOIN. All things considered, it’s definitely, option D in the list of the three options.

However, in MySQL, there is an even easier way to do this, use a variable.

mysql> SELECT region, SUM(spending) AS regSpending,
    -> @rollingTot := @rollingTot + regSpending as rollingTot
    -> FROM militaryspending
    -> JOIN (SELECT @rollingTot := 0) a
    -> GROUP BY region
    -> ORDER BY region;
ERROR 1054 (42S22): Unknown column 'regSpending' in 'field list'

OK, that “should” work, and it would work, if this didn’t include a GROUP BY. The GROUP BY blows it up, so instead, we simply use a subquery as follows:

mysql> SELECT region, regSpending, @rollingTot := @rollingTot + regSpending as r
ollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a
    -> JOIN (SELECT @rollingTot := 0) b;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

This is simply a basic GROUP BY that the main query reads from using the variable to maintain the rolling total. It’s really that simple, albeit, purists won’t like it. Oh, and it can be written without the join as follows,

mysql> SET @rollingTot = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT region, regSpending, @rollingTot := @rollingTot + regSpending as r
ollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

Finally, just for comparison sakes I ran each through an explain. Warning, incoming row-wrap ugliness.

mysql> EXPLAIN SELECT region, regSpending, @rollingTot := @rollingTot + regSpend
ing as rollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a
    -> JOIN (SELECT @rollingTot := 0) b;
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
| id | select_type | table            | type   | possible_keys | key  | key_len
| ref  | rows | Extra                           |
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
|  1 | PRIMARY     |        | system | NULL          | NULL | NULL
| NULL |    1 |                                 |
|  1 | PRIMARY     |        | ALL    | NULL          | NULL | NULL
| NULL |    6 |                                 |
|  3 | DERIVED     | NULL             | NULL   | NULL          | NULL | NULL
| NULL | NULL | No tables used                  |
|  2 | DERIVED     | militaryspending | ALL    | NULL          | NULL | NULL
| NULL |  126 | Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
4 rows in set (0.00 sec)

mysql> EXPLAIN SELECT a.region, SUM(a.spending) as regSpending,
    -> (SELECT SUM(spending) FROM militaryspending b WHERE a.region >= b.region)
 as rollingTot
    -> FROM militaryspending a
    -> GROUP BY a.region
    -> ORDER BY a.region;
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
| id | select_type        | table | type | possible_keys | key  | key_len | ref
 | rows | Extra                           |
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
|  1 | PRIMARY            | a     | ALL  | NULL          | NULL | NULL    | NULL
 |  126 | Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | b     | ALL  | NULL          | NULL | NULL    | NULL
 |  126 | Using where                     |
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
2 rows in set (0.00 sec)

mysql> EXPLAIN SELECT a.region, SUM(b.spending) AS rollingTot
    -> FROM militaryspending a
    -> JOIN militaryspending b
    -> ON a.region >= b.region
    -> GROUP BY a.region
    -> ORDER BY a.region;
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows
 | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
|  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL |  126
 | Using temporary; Using filesort |
|  1 | SIMPLE      | b     | ALL  | NULL          | NULL | NULL    | NULL |  126
 | Using where; Using join buffer  |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
2 rows in set (0.00 sec)

Anyways, both the Cookbook version, and the join had basically the same performance, which makes sense because they’re both doing the same thing, mostly. The variable option, however, is faster, because it’s simply aggregating the totals of a single GROUP BY. It doesn’t have to do the join.  It would, by far, be the best option if you needed a rolling total without an aggregate.

Unfortunately, it’s only good on the MySQL side of the house.

08 Apr

GROUP BY, SAS vs SQL Server vs MySQL

Anyone who isn’t confused really doesn’t understand the situation.
Edward R. Murrow

It looks like I’m going to talk myself into doing the SAS ADVANCED certification.  I was going through the sample questions for the first chapter and I ran into the following: What happens when you use a GROUP BY without an aggregate function..  Well, unfortunately, this depends on the database but I’m pretty sure that SAS expects a specific answer.  So, I’m going to go through SAS, MySQL and SQL Server in regards to ordering with a GROUP BY.

SQL Server

SELECT * INTO myjunk 
FROM
(
  SELECT 'A' AS Name, 1 AS Number
  UNION
  SELECT 'C', 3
  UNION 
  SELECT 'B', 2
) newTable;

SELECT Name, SUM(number) AS Number
FROM myjunk
GROUP BY name;

Name	Number
A	1
B	2
C	3

SELECT Name, Number
FROM myjunk
GROUP BY Name;

Msg 8120, Level 16, State 1, Line 1
Column 'myjunk.Number' is invalid in the select list because it is not contained in
either an aggregate function or the GROUP BY clause.

As you can see SQL Server doesn’t like it when you use a GROUP BY without an Aggregate.

MySQL

mysql> CREATE TABLE myjunk AS
    -> SELECT *
    -> FROM
    -> (
    ->   SELECT 'A' AS Name, 1 AS Number
    ->   UNION
    ->   SELECT 'C', 3
    ->   UNION
    ->   SELECT 'B', 2
    -> );
ERROR 1248 (42000): Every derived table must have its own alias
mysql> CREATE TABLE myjunk AS
    -> SELECT *
    -> FROM
    -> (
    ->   SELECT 'A' AS Name, 1 AS Number
    ->   UNION
    ->   SELECT 'C', 3
    ->   UNION
    ->   SELECT 'B', 2
    -> ) as newTable;
Query OK, 3 rows affected (0.10 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT Name, SUM(number) AS Number
    -> FROM myjunk
    -> GROUP BY name;
+------+--------+
| Name | Number |
+------+--------+
| A    |      1 |
| B    |      2 |
| C    |      3 |
+------+--------+
3 rows in set (0.00 sec)

mysql> SELECT Name, Number
    -> FROM myjunk
    -> GROUP BY Name;
+------+--------+
| Name | Number |
+------+--------+
| A    |      1 |
| B    |      2 |
| C    |      3 |
+------+--------+
3 rows in set (0.00 sec)

mysql> SELECT Name, Number
    -> FROM myjunk
    -> GROUP BY Name DESC;
+------+--------+
| Name | Number |
+------+--------+
| C    |      3 |
| B    |      2 |
| A    |      1 |
+------+--------+
3 rows in set (0.00 sec)

Ah, MySQL is much more fun. Not only does it turn the GROUP BY into an ORDER BY, but, you can use the DESCENDING keyword, and it works. That’s good stuff.

SAS

The Study Guide says that SAS will treat the GROUP BY as an ORDER BY, so here goes.

LIBNAME mysql "....";

ods listing;

PROC SQL;
      SELECT age, height, weight
      FROM mysql.admit
      GROUP BY height;
QUIT;

     Age    Height    Weight
----------------------------
      31        61       123
      28        62       118
      43        63       137
      22        63       139
      49        64       172
      43        65       123
      44        66       140
      34        66       152
      32        67       151
      41        67       141
      40        69       163
      30        69       147
      35        70       173
      51        71       158
      54        71       183
      60        71       191
      27        72       168
      47        72       173
      34        73       154
      25        75       188
      29        76       193

So, it does work. I tried the DESC keyword but it didn’t work. So, once again, 3 completely different approaches to the language.

  • SQL Server hard stops a GROUP BY without an aggregate function.
  • MySQL lets you not only order with a GROUP BY, but, you can use the DESC keyword.
  • SAS lets you order with it but you can’t use the DESC keyword.

Interesting side note, and it means maybe I missed something DESC comes up in the auto-complete drop down for SAS in the GROUP BY.  So, maybe it works?  Or, maybe since I’m working with a .dat file in the sample data, and not, say, SQL Server, the behavior is different?

28 Mar

SAS: Column Totals (SUM and BY)

Because I tend to think SQL first, I try to find a comparison that fits in that world, and the closest I can come to how SAS generates column totals is a SUM with a GROUP BY and a WITH ROLLUP.  But even that only works for a simple query with a single column in the GROUP BY.  Here is an example of the use of a SUM.

proc print data=mydata;
	sum score1 score2 score3 score4;
run;

Obs          Name          Age    Type     Date    score1    score2    score3    score4    score5
1     Alicia Grossman      13     c      17833      7.8       6.5       7.2       8.0       7.9 
2     Matthew Lee           9     D      17835      6.5       5.9       6.8       6.0       8.1 
3     Elizabeth Garcia     10     C      17834      8.9       7.9       8.5       9.0       8.8 
4     Lori Newcombe         6     D      17835      6.7       5.6       4.9       5.2       6.1 
5     Jose Martinez         7     d      17836      8.9       9.5      10.0       9.7       9.0 
6     Brian Williams       11     C      17834      7.8       8.4       8.5       7.9       8.0 
                                                  ======    ======    ======    ======
                                                   46.6      43.8      45.9      45.8

The BY statement works very similarly to a GROUP BY in that it segments the results by a group.  However, there is a requirement, the data must be sorted, in ascending order, on the columns that are using with the BY statement. The following all produce errors:

proc print data=mydata;
	sum score1 score2 score3 score4;
	by Date;
run;
ERROR: Data set USER.MYDATA is not sorted in ascending sequence. The current BY group has Date = 17835 and the next BY group has Date = 17834.

proc sort data=mydata;
by descending Date;
run;

proc print data=mydata;
sum score1 score2 score3 score4;
by Date;
run;
ERROR: Data set USER.MYDATA is not sorted in ascending sequence. The current BY group has Date = 17836 and the next BY group has Date = 17835.

The following will work without errors (both produce the same result).
proc sort data=mydata;
	by Date;
run;

proc print data=mydata;
	by Date;
run;

proc sort data=mydata;
	by Date age;
run;

proc print data=mydata;
	by Date;
run;

Date=17833
Obs         Name          Age    Type    score1    score2    score3    score4    score5
1     Alicia Grossman     13     c        7.8       6.5       7.2        8        7.9 

Date=17834 
Obs          Name          Age    Type    score1    score2    score3    score4    score5
2     Elizabeth Garcia     10     C        8.9       7.9       8.5       9.0       8.8 
3     Brian Williams       11     C        7.8       8.4       8.5       7.9       8.0 

Date=17835 
Obs        Name         Age    Type    score1    score2    score3    score4    score5
4     Matthew Lee       9      D        6.5       5.9       6.8       6.0       8.1 
5     Lori Newcombe     6      D        6.7       5.6       4.9       5.2       6.1 

Date=17836
Obs        Name         Age    Type    score1    score2    score3    score4    score5
6     Jose Martinez     7      d        8.9       9.5       10        9.7        9

As you can see the SUM statement is not necessary but if it was included you would have had total for each BY section as follows:

proc print data=mydata;
    sum score1 score2 score3 score4 score5;
	by Date;
run;

Date=17833

Obs         Name         Age    Type    score1    score2    score3    score4    score5
1    Alicia Grossman     13     c        7.8       6.5       7.2        8        7.9 

Date=17834
Obs          Name         Age    Type    score1    score2    score3    score4    score5
2    Elizabeth Garcia     10     C        8.9       7.9       8.5       9.0       8.8 
3    Brian Williams       11     C        7.8       8.4       8.5       7.9       8.0 
----                                      ------    ------    ------    ------    ------
Date                                      16.7      16.3      17.0      16.9      16.8 

Date=17835
Obs        Name         Age    Type    score1    score2    score3    score4    score5
4    Matthew Lee        9      D        6.5       5.9       6.8       6.0       8.1 
5    Lori Newcombe      6      D        6.7       5.6       4.9       5.2       6.1 
----                                   -----    ------    ------    ------    ------
Date                                   3.2      11.5      11.7      11.2      14.2 

Date=17836
Obs        Name         Age    Type    score1    score2    score3    score4    score5
6    Jose Martinez      7      d       8.9       9.5      10.0       9.7       9.0 
                                       ======    ======    ======    ======    ======
                                       46.6      43.8      45.9      45.8      47.9

Note that it does not total when there is a single row in a group and that it does give a grand total designated by “=====”.

You can also use the PAGEBY statement with the BY statement (it requires the BY statement) to generate page breaks.  On my system the data looks exactly the same as the prior example, it simply shows with a page break between each section.

08 Mar

Grouping By An Unknown Range

Lets imagine that you wanted to break out your order data in 10% increments from the lowest sales to the largest sale?  And, suppose that your data was actually functional for this sort of thing and you didn’t have widgets that sold for $0.09 cents right next to 90,000 Lamborghini’s?  And, suppose this kind of analysis was actually useful for something besides a SQL GROUP BY example, OK, just suppose.  Anyways, here is, a process that you might go through and a query that would satisfy the problem.

Step 1 – Generate a range for your data.

USE adventureworks2012;
GO
DECLARE @minval decimal(18,2);
DECLARE @maxval decimal(18,2);

SELECT @maxval = MAX(totaldue), @minval = MIN(totaldue)
FROM sales.salesorderheader;
SELECT @maxval AS 'Max', @minval AS 'Min';

Max	        Min
187487.83	1.52

Right off the bat we need to contact Houston because this won’t work, well, it’ll work when we imagine it working later, but this is telling me that we have no negative amounts in our order data. In other words, that we’ve never issued a single credit. In other words, that our data is probably useless.

However, this is AdventureWorks, dammit, we don’t issue credits, and since it’s true in the data, I’ll return you to your regular programming.

Step 2 – Create a clause that segments our data range into 10% increments.

Here’s how you would do this. You create a case statement based on the @maxval value you created, and because we can in this case, 0, use it to segment the GROUP BY, as follows.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
	WHEN totaldue / @range <= .10 THEN '.0000 <= .10'
	WHEN totaldue / @range <= .20 THEN '.1001 <= .20'
	WHEN totaldue / @range <= .30 THEN '.2001 <= .30'
	WHEN totaldue / @range <= .40 THEN '.3001 <= .40'
	WHEN totaldue / @range <= .50 THEN '.4001 <= .50'
	WHEN totaldue / @range <= .60 THEN '.5001 <= .60'
	WHEN totaldue / @range <= .70 THEN '.6001 <= .70'
	WHEN totaldue / @range <= .80 THEN '.7001 <= .80'
	WHEN totaldue / @range <= .90 THEN '.8001 <= .90'
	ELSE '.9001 <= 1.0' END AS Range,
CAST(totaldue AS DECIMAL(18,2)) AS totaldue
FROM sales.salesorderheader
)
SELECT Range, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY Range

Range	        Total Due	Transaction Count	Average Sale
.0000 <= .10	41521958.70	29858	                1390.65
.1001 <= .20	17921756.80	615	                29141.07
.2001 <= .30	22384650.62	490	                45682.96
.3001 <= .40	15449160.66	240	                64371.50
.4001 <= .50	10416627.45	125	                83333.02
.5001 <= .60	8452682.22	82	                103081.49
.6001 <= .70	4923353.28	41	                120081.79
.7001 <= .80	1116949.70	8	                139618.71
.8001 <= .90	489622.37	3	                163207.46
.9001 <= 1.0	540019.13	3	                180006.38

Does this tell us a lot? Actually, it does tell us a few things. First, most of our transactions are small relatively speaking with a few large outliers in the data. If we really wanted to analyze this in some meaningful fashion we’d have to further segment the range. But this post really isn’t about analysis, it’s more about CASE, so I’m going to add an additional designation to the data so the ranges are clearer.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
    WHEN totaldue / @range <= .10 THEN 0*@range
    WHEN totaldue / @range <= .20 THEN .1001*@range
    WHEN totaldue / @range <= .30 THEN .2001*@range
    WHEN totaldue / @range <= .40 THEN .3001*@range
    WHEN totaldue / @range <= .50 THEN .4001*@range
    WHEN totaldue / @range <= .60 THEN .5001*@range
    WHEN totaldue / @range <= .70 THEN .6001*@range
    WHEN totaldue / @range <= .80 THEN .7001*@range
    WHEN totaldue / @range <= .90 THEN .8001*@range
    ELSE CAST(.9001*@range AS INT) 
END AS beginRange,
totaldue
FROM sales.salesorderheader
),
cte1 AS
(
SELECT beginRange, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY beginRange
)
SELECT CAST(beginRange AS INT) AS beginRange,
CAST(LEAD(beginRange) OVER (ORDER BY beginRange) - 1 AS INT) AS endRange,
[Total Due],
[Transaction Count],
[Average Sale]
FROM cte1
ORDER BY beginRange;

beginRange	endRange	Total Due	Transaction Count	Average Sale
0	        18766	        41521963.87	29858	                1390.65
18767	        37515	        17921756.81	615	                29141.07
37516	        56264	        22384650.61	490	                45682.96
56265	        75012	        15449160.67	240	                64371.50
75013	        93761	        10416627.45	125	                83333.02
93762	        112510	        8452682.28	82	                103081.49
112511	        131259	        4923353.25	41	                120081.79
131260	        150008	        1116949.69	8	                139618.71
150009	        168756	        489622.37	3	                163207.46
168757	        NULL	        540019.12	3	                180006.37

There is a lot going on here so lets go through it.

  1. The first thing I change was the MIN and MAX to a @range variable because since there were no negative values in the data I could just start at 0 and work my way up.  Hence, all I needed was a single variable passed into it by the MAX aggregate function.
  2. There are to CTE’s in the query.  The first uses a CASE function, with the @range value to calculate the beginning of each data range.
  3. The second CTE exists because I didn’t want to do a second CASE function for the ending value in each data range.  Instead I use it to aggregate the data from the first CTE into a result set that will allow me to use the LEAD function in a final query that unites all of this.
  4. I converted the beginRange and endRange values to INT types because it looks better.  In real-life I’d probably do the same with the “Total Due” and the “Average Sale” columns.
  5. As mentioned earlier I used the LEAD function to get the endRange value instead of a CASE function.

Once again, I don’t think this specific query is all that useful.  It’s too broad across the entire data range when it would really need to focus on the lower ordertotal transactions.  However, this kind of analysis can be useful.  Very large, or very small orders can distort data analysis in unexpected ways.  For instance, a sales rep can have his sales margin destroyed because he writes a large sale at a low profit.  If your commission structure has a gross profit calculation, it is even possible that he could even lose money on a sale that benefits the company.

Finally, this query is missing the most useful thing, profit margin.  I didn’t include it because the adventureworks2012 data requires joins to get at the cost information and this process was complex enough without it.  What I really wanted to show here is that it is possible to dynamically write a GROUP BY to group data based on the data it’s grouping.

Or should I say, discretize our data?