18 May

GROUP BY, Again, This Time It’s Final, No Really It Is.

I think this will be the last of these.  I’ve gotten four posts out of GROUP BY and this time I’m going to double-dip, with the HAVING clause.  So, lets start with SQL Server and blow things up.

SQL Server

SELECT actlevel AL, SUM(height) AS 'sumHeight'
FROM junk
GROUP BY AL
HAVING SUM(height) > 400;

Msg 207, Level 16, State 1, Line 3
Invalid column name 'AL'.

SELECT actlevel AL, SUM(height) AS 'sumHeight'
FROM junk
GROUP BY actlevel
HAVING sumHeight > 400;

Msg 207, Level 16, State 1, Line 4
Invalid column name 'sumHeight'.

Both of these errors are saying that you can’t use an alias in a GROUP BY or HAVING clause. Honestly, this is kind of the expected behavior if you understand how SQL theoretically works (another post).

However, the other two guys I write about are just fine with it.

MySQL and SAS

mysql> SELECT actlevel AL, sum(height) as mysum
    -> FROM junk
    -> GROUP BY AL
    -> HAVING mysum > 400;
+------+-------+
| AL   | mysum |
+------+-------+
| HIGH |   491 |
| LOW  |   465 |
| MOD  |   477 |
+------+-------+
3 rows in set (0.00 sec)

proc sql;
SELECT actlevel AL, sum(height) AS mysum
FROM mysql.admit
GROUP BY AL
HAVING mysum > 400;
quit;

ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, *, **, +, ',', -, '.', 
/, <, <=, <>, =, >, >=, ?, AND, AS, CONTAINS, EQ, EQT, FROM, GE, GET, GT, GTT, LE, LET, LIKE, LT, 
LTT, NE, NET, OR, ^=, |, ||, ~=.

Ah, SAS, the 4th grade comedian that drove the teacher crazy. SAS requires the AS keyword.

proc sql;
SELECT actlevel AS AL, sum(height) as mysum
FROM mysql.admit
GROUP BY AL
HAVING mysum > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

So, that’s all good. I can end now, right. I mean, I wouldn’t be tempted to do something exotic, just to see what SAS, or maybe even MySQL, does with it, would I? Nope, not me.

mysql> SELECT actlevel 'AL', sum(height) as mysum
    -> FROM junk
    -> GROUP BY 'AL'
    -> HAVING mysum > 400;
+------+-------+
| AL   | mysum |
+------+-------+
| HIGH |  1433 |
+------+-------+
1 row in set (0.00 sec)

Err, OK. Maybe this will be better.

mysql> SELECT actlevel AL, sum(height) as 'mysum'
    -> FROM junk
    -> GROUP BY AL
    -> HAVING 'mysum' > 400;
Empty set, 1 warning (0.00 sec)

Oops! MySQL, ever the professionals at making it clear what went wrong. But, I wonder how SAS will handle this code.

proc sql;
SELECT actlevel 'AL', sum(height) as mysum
FROM mysql.admit
GROUP BY 'AL'
HAVING mysum > 400;
quit;

AL       mysum
--------------
HIGH      1433
HIGH      1433
LOW       1433
MOD       1433
LOW       1433
HIGH      1433
MOD       1433
MOD       1433
LOW       1433
LOW       1433
HIGH      1433
LOW       1433
MOD       1433
HIGH      1433
MOD       1433
LOW       1433
MOD       1433
HIGH      1433
LOW       1433
HIGH      1433
MOD       1433
Okey, dokey.

And,
proc sql;
SELECT actlevel AL, sum(height) as 'mysum'
FROM junk
GROUP BY AL
HAVING 'mysum' > 400;
quit;

ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, *, **, +, ',', -, 
'.', /, <, <=, <>, =, >, >=, ?, AND, AS, CONTAINS, EQ, EQT, FROM, GE, GET, GT, GTT, LE, LET, 
LIKE, LT, LTT, NE, NET, OR, ^=, |, ||, ~=.

Lets try this, instead:

proc sql;
SELECT actlevel AL, sum(height) as 'mybum'
FROM junk
GROUP BY AL
HAVING 'mybum' > 400;
quit;

ERROR: Expression using greater than (>) has components that are of different data types.

Really? No, Really? SAS is just a giant box of endless posts.

Well, I guess it has to be this way.

proc sql;
SELECT actlevel AS AL, sum(height) AS mybum
FROM mysql.admit
GROUP BY AL
HAVING mybum > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

Or, I suppose, you could just write it like you are supposed to.

proc sql;
SELECT actlevel AS AL, sum(height) AS mybum
FROM mysql.admit
GROUP BY actlevel
HAVING sum(height) > 400;
quit;

AL       mysum
--------------
HIGH       491
LOW        465
MOD        477

I just wish I could trust SAS to write the SAS Advanced certification that cleanly but after the Base I simply don’t trust them not to use all of this stuff.

I’m out!

17 May

EXISTS vs NOT EXISTS

I’m going to break a string of posts, and not beat up SAS, for once.  In fact, this post is inspired by the SAS Advanced study guide which mentioned the topic and since I don’t remember writing about it, but probably have, here you go.

EXISTS and it’s best friend, NOT EXISTS, are used with a subquery in the WHERE clause (this is a lie, it can also be used in a HAVING clause, but I’m not doing that, today).  Essentially, they test to see if any results are returned by the subquery.  In a way, it’s a binary test, that says, “yes, something is there” or “no, nothing is there”.

A few examples, all in MySQL.

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT name FROM junk WHERE actlevel = 'HIGH');
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

Since, we know that at least one record has an actlevel of HIGH all the results are returned by the query.

If we change the query to NOT EXISTS, we won’t get any results.

mysql> SELECT *
    -> FROM junk
    -> WHERE NOT EXISTS (SELECT name FROM junk WHERE actlevel = 'HIGH');
Empty set (0.00 sec)

This is really clean at this level. It either does, or doesn’t. But we can make it a bit more interesting.

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT 1);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

The condition returns a result, EXISTS is true, so you get the full result set. So, what happens when you test a condition that returns false?

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT 1 = 2);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

We’re still good. The query came back with at least one row so we get all of them. What about NULL values?

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT NULL);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

mysql> SELECT name
    -> FROM junk
    -> WHERE NOT EXISTS (SELECT NULL);
Empty set (0.00 sec)

They all work the same. If the result set has a row, even a single-column row with a NULL value you get the entire result set.

You can also use EXISTS with a correlated subquery. This is a bit more interesting because it can act like a WHERE clause in a query. This isn’t the most productive thing you could do but I find it, weirdly, fun.

mysql> SELECT name, actlevel
    -> FROM junk j
    -> WHERE EXISTS (SELECT name FROM junk WHERE j.actlevel = 'HIGH');
+-------------+----------+
| name        | actlevel |
+-------------+----------+
| Murray, W   | HIGH     |
| Almers, C   | HIGH     |
| Jones, M    | HIGH     |
| Nunnelly, A | HIGH     |
| Quigley, M  | HIGH     |
| Derber, B   | HIGH     |
| Wilcox, E   | HIGH     |
+-------------+----------+
7 rows in set (0.00 sec)

mysql> SELECT name, actlevel
    -> FROM junk
    -> WHERE actlevel = 'HIGH';
+-------------+----------+
| name        | actlevel |
+-------------+----------+
| Murray, W   | HIGH     |
| Almers, C   | HIGH     |
| Jones, M    | HIGH     |
| Nunnelly, A | HIGH     |
| Quigley, M  | HIGH     |
| Derber, B   | HIGH     |
| Wilcox, E   | HIGH     |
+-------------+----------+
7 rows in set (0.00 sec)

Because the query is correlated to the subquery in the WHERE clause it actually fires for each row. It’s definitely not an optimal solution, especially, in this case. Just for kicks, because it makes it clear, here are the EXPLAIN statements for each of those.  This is for the Microsoft people with question marks floating over their heads.

mysql> EXPLAIN SELECT name, actlevel
    -> FROM junk j
    -> WHERE EXISTS (SELECT name FROM junk WHERE j.actlevel = 'HIGH');
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
| id | select_type        | table | type | possible_keys | key  | key_len |ref |rows|Extra      |
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
|  1 | PRIMARY            | j     | ALL  | NULL          | NULL | NULL    |NULL| 21 |Using where|
|  2 | DEPENDENT SUBQUERY | junk  | ALL  | NULL          | NULL | NULL    |NULL| 21 |Using where|
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
2 rows in set (0.00 sec)

mysql> EXPLAIN SELECT name, actlevel
    -> FROM junk
    -> WHERE actlevel = 'HIGH';
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | junk  | ALL  | NULL          | NULL | NULL    | NULL |   21 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

You don’t need to be a MySQL expert to see the problem. You can be one of the kids in the ATT ad, except they all think moar is better. It isn’t in this case.  Less is better.  I’m not going into detail on an EXPLAIN, and this EXPLAIN doesn’t have enough detail to discuss anyways.  If you like to read more about explain you can do it at the link below..

You can read more about EXPLAIN here.

16 May

SAS: NMISS and another installment of SAS vs. The World

First, lets look at a specific function, NMISS.  NMISS is used to count the number of missing records.  Think of it as a NULL counter. First, some data:

proc sql;
SELECT *
FROM mysql.demog
quit;

ID         Age  Sex      Date
-----------------------------
A001        21  m    05/22/07
A002        32  m    06/15/06
A003        24  f    08/17/07
A004         .       01/27/06
A005        44  f    02/24/05
A007        39  m    11/11/05

Here is NMISS in action.

proc sql;
SELECT COUNT(age), NMISS(age), COUNT(*) - COUNT(age)
FROM mysql.demog
quit;

----------------------------
5         1         1

Don’t get mad, I, intentionally, didn’t alias the columns. So, each part of the SELECT did the following:

COUNT(age)

Counted all columns that did not contain a NULL.

NMISS(age)

Counted all NULL values.

COUNT(*) – COUNT(age)

This part used COUNT(*), which counts all the rows in the table, including NULLs and then subtracts all the records in the age column that are not NULL. It is the equivalent of the NMISS function.

So, why didn’t I alias the columns? Because my first try didn’t work. Here’s a sample that I ran in MySQL (SQL Server does the same thing.

mysql> select count(age) AS 'count(age)' FROM junk;
+------------+
| count(age) |
+------------+
|         21 |
+------------+
1 row in set (0.00 sec)

All is good in the world. Now, let’s try that with SAS, because, as we all know by now, something weird is about to happen.

proc sql;
SELECT count(*) AS 'count(*)'
FROM mysql.demog
quit;

                              __________
                              22
ERROR 22-322: Expecting a name.

Yup, it errors. I guess it doesn’t like aliasing a function. So, what’s the workaround? You use the label command, which, I guess, explains why they have it.

proc sql;
SELECT count(*) label='count(*)'
FROM mysql.demog
quit;

count(*)
--------
       6

Man, SAS, is just a special little snowflake.

15 May

More GROUP BY, SQL Server vs SAS and MySQL

This is pretty clean, and I’ve touched on it before, so it shouldn’t be a surprise, but, it’s something I’d have missed on a SAS test so it gets a post.  We’ll start with SQL Server.

SQL Server

SELECT actlevel, sum(height)
FROM junk
GROUP BY actlevel;

actlevel	sumHeight
HIGH	        491
LOW 	        465
MOD 	        477

This is clean and all three of my favorites will behave with this SQL.

However, lets change it just a bit.

SELECT actlevel, sum(height) AS sumHeight
FROM junk
GROUP BY 1;

Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

I suspect, that this is the result most people would expect with that code, certainly the Microsoft guys would. However, the other two platforms seem to disagree.

MySQL

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| HIGH     |       491 |
| LOW      |       465 |
| MOD      |       477 |
+----------+-----------+
3 rows in set (0.00 sec)

That’s pretty clean, and notice how it orders the actlevel column. And just for kicks, this happens to work.

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1 DESC;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| MOD      |       477 |
| LOW      |       465 |
| HIGH     |       491 |
+----------+-----------+
3 rows in set (0.00 sec)

Now, that other guy.

SAS

proc sql;
SELECT actlevel, SUM(height) AS sumHeight
FROM mysql.admit
GROUP BY 1;
quit;

Act
Level  sumHeight
----------------
HIGH         491
LOW          465
MOD          477

And just for kicks, you can alias a column with the label command as follows:

proc sql;
SELECT actlevel, SUM(height) label='sumHeight'
FROM mysql.admit
GROUP BY 1;
quit;

However, you cannot use the DESC keyword here, just, because.

14 May

More GROUP BY, SAS vs MySQL vs SQL Server

This is a post about a few more GROUP BY quirks on the three platforms.

First, some data.

CREATE TABLE junk
(
	id int,
	name varchar(40),
	sex char(1),
	age int,
	mydate int,
	height int,
	weight int,
	actlevel char(4),
	fee float
);

INSERT INTO junk VALUES ('2458','Murray, W','M','27','1','72','168','HIGH','85.20');
INSERT INTO junk VALUES ('2462','Almers, C','F','34','3','66','152','HIGH','124.80');
INSERT INTO junk VALUES ('2501','Bonaventure, T','F','31','17','61','123','LOW','149.75');
INSERT INTO junk VALUES ('2523','Johnson, R','F','43','31','63','137','MOD','149.75');
INSERT INTO junk VALUES ('2539','LaMance, K','M','51','4','71','158','LOW','124.80');
INSERT INTO junk VALUES ('2544','Jones, M','M','29','6','76','193','HIGH','124.80');
INSERT INTO junk VALUES ('2552','Reberson, P','F','32','9','67','151','MOD','149.75');
INSERT INTO junk VALUES ('2555','King, E','M','35','13','70','173','MOD','149.75');
INSERT INTO junk VALUES ('2563','Pitts, D','M','34','22','73','154','LOW','124.80');
INSERT INTO junk VALUES ('2568','Eberhardt, S','F','49','27','64','172','LOW','124.80');
INSERT INTO junk VALUES ('2571','Nunnelly, A','F','44','19','66','140','HIGH','149.75');
INSERT INTO junk VALUES ('2572','Oberon, M','F','28','17','62','118','LOW','85.20');
INSERT INTO junk VALUES ('2574','Peterson, V','M','30','6','69','147','MOD','149.75');
INSERT INTO junk VALUES ('2575','Quigley, M','F','40','8','69','163','HIGH','124.80');
INSERT INTO junk VALUES ('2578','Cameron, L','M','47','5','72','173','MOD','124.80');
INSERT INTO junk VALUES ('2579','Underwood, K','M','60','22','71','191','LOW','149.75');
INSERT INTO junk VALUES ('2584','Takahashi, Y','F','43','29','65','123','MOD','124.80');
INSERT INTO junk VALUES ('2586','Derber, B','M','25','23','75','188','HIGH','85.20');
INSERT INTO junk VALUES ('2588','Ivan, H','F','22','20','63','139','LOW','85.20');
INSERT INTO junk VALUES ('2589','Wilcox, E','F','41','16','67','141','HIGH','149.75');
INSERT INTO junk VALUES ('2595','Warren, C','M','54','7','71','183','MOD','149.75');

So, let’s get started by looking at two queries that don’t work.

SQL Server

SELECT actlevel, SUM(age,height)
FROM junk
GROUP BY actlevel

Msg 174, Level 15, State 1, Line 1
The SUM function requires 1 argument(s).

SQL Server does a nice job here. It delineates the error, clearly, and this is expected behavior.

MySQL

mysql> SELECT actlevel, SUM(age,height)
    -> FROM junk
    -> GROUP BY actlevel;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near 'heigh
t)
FROM junk
GROUP BY actlevel' at line 1

That isn’t the most descriptive error you’ll run across, however, it is expected behavior. Now, here comes SAS.

SAS

proc sql;
    select actlevel, sum(age, height)
	from mysql.admit
	group by actlevel;
quit;

Act
Level          
---------------
HIGH         99
HIGH        108
HIGH        100
HIGH        109
HIGH        110
HIGH        105
HIGH        100
LOW          92
LOW          90
LOW         113
LOW          85
LOW         107
LOW         122
LOW         131
MOD          99
MOD         119
MOD          99
MOD         125
MOD         106
MOD         105
MOD         108

WTF? OK, just to confirm that there is, some, normal behavior in SAS.

proc sql;
    select actlevel, sum(height)
	from mysql.admit
	group by actlevel;
quit;

Act
Level          
---------------
HIGH        491
LOW         465
MOD         477

I’m not sure what SAS thinks it’s doing, I’ll have to dig into it, but, man, that is very nonstandard stuff.

All three variants do what they should with the following code:

SELECT actlevel, SUM(age+height)
FROM junk
GROUP BY actlevel;

or the SAS variant

proc sql;
SELECT actlevel, SUM(age+height)
FROM mysql.admit
GROUP BY actlevel;
quit;

Act
Level          
---------------
HIGH        731
LOW         740
MOD         761

This is really more than enough for a post but I’m going one better tonight and providing bonus time.

SQL Server

SELECT actlevel, age, SUM(height)
FROM junk
GROUP BY actlevel;

Msg 8120, Level 16, State 1, Line 1
Column 'junk.age' is invalid in the select list because it is not contained in either an
aggregate function or the GROUP BY clause.

This is the behavior I expected, maybe not from SAS, but certainly from SQL Server and MySQL, except, well, it’s not what happens on those two platforms.

MySQL

mysql> SELECT actlevel, age, SUM(height)
    -> FROM junk
    -> GROUP BY actlevel;
+----------+------+-------------+
| actlevel | age  | SUM(height) |
+----------+------+-------------+
| HIGH     |   27 |         491 |
| LOW      |   31 |         465 |
| MOD      |   43 |         477 |
+----------+------+-------------+
3 rows in set (0.00 sec)
I have to admit that I did not expect that.  Unfortunately, it's pretty much meaningless.

And now, SAS, because we save the weirdest for last.
proc sql;
SELECT actlevel, age, SUM(height)
FROM mysql.admit
GROUP BY actlevel;
quit;

Act
Level       Age          
-------------------------
HIGH         27       491
HIGH         41       491
HIGH         34       491
HIGH         40       491
HIGH         44       491
HIGH         29       491
HIGH         25       491
LOW          31       465
LOW          28       465
LOW          49       465
LOW          22       465
LOW          34       465
LOW          51       465
LOW          60       465
MOD          32       477
MOD          47       477
MOD          30       477
MOD          54       477
MOD          43       477
MOD          35       477
MOD          43       477

POP! My head sploded!

If you are wondering how I keep all of this straight in my head, I don’t. Regrettably, despite looking up stuff like this, I can go totally brain-dead in an interview setting. I would have said that the SQL Server approach was correct and blown it on MySQL and SAS. The SAS behavior is totally unexpected. I probably need to install Oracle Express to act as a tie-breaker.

13 May

GROUP BY, Without A Group By

A lot of what I do on this site falls into the “what the heck” category.  By that I mean, I just want to do things in a different and strange fashion.  Often, like the stuff in this article, there’s no reason to do it the ways I’m going to, they aren’t faster, they aren’t better and often they are easier to understand.

So why do them? This probably stems from my martial arts days. In my particular art there were two, somewhat, extreme approaches to it. One group felt that you could flow, literally from the beginning and discover the art. The other group, which I’ll call the 1.2.3.4.5 group felt that you spend decades doing something, then, and only then, can you explore because you had the basics down.

For the record, I think both groups failed miserably in their approach. From what I saw, if you never practiced the fundamentals, you never learned the fundamentals, but, if all you ever did were fundamentals, you never really developed your creativity. Both approaches had this philosophical approach that fell flat on its face. As such, I tend to try to incorporate both into my practice, whatever that practice might be. Today, it’s the creative side.

Lets dive in.  By now you all know my good friend, the militaryspending table,

mysql> desc militaryspending;
+----------+-------------+------+-----+---------+----------------+
| Field    | Type        | Null | Key | Default | Extra          |
+----------+-------------+------+-----+---------+----------------+
| country  | varchar(50) | NO   |     | NULL    |                |
| spending | double      | NO   |     | NULL    |                |
| gdpperc  | double      | NO   |     | NULL    |                |
| region   | varchar(50) | YES  |     | NULL    |                |
| id       | int(11)     | NO   | PRI | NULL    | auto_increment |
+----------+-------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)

mysql> SELECT *
    -> FROM militaryspending LIMIT 5;
+------------+----------+---------+---------------+----+
| country    | spending | gdpperc | region        | id |
+------------+----------+---------+---------------+----+
| Seychelles |      9.3 |     1.3 | Africa        |  1 |
| Cape Verde |      9.7 |     0.5 | Africa        |  2 |
| Mauritius  |     10.1 |     0.1 | Africa        |  3 |
| Belize     |     15.7 |     1.1 | North America |  4 |
| Moldova    |     20.8 |     0.3 | Europe        |  5 |
+------------+----------+---------+---------------+----+
5 rows in set (0.00 sec)

So, let’s do a simple GROUP BY that shows total spending by region. We would write this, and this is how we should do this, as follows:

mysql> SELECT region, SUM(spending) AS totSpending
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

Clean, straight-forward, exactly what you should do, and it works everywhere. However, there are other ways to write that.

For instance, you can use a correlated subquery as follows:

mysql> SELECT DISTINCT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.01 sec)

This will work in most platforms. It’s standard stuff, albeit, not ideal, and as we I said in the title, there is no GROUP BY.

It also requires the use of DISTINCT to make it work.

mysql> SELECT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a LIMIT 10;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Africa        |      120608 |
| Africa        |      120608 |
| North America |    743120.4 |
| Europe        |    312412.2 |
| Africa        |      120608 |
| Asia          |    428477.7 |
| South America |     65976.9 |
| North America |    743120.4 |
| Africa        |      120608 |
+---------------+-------------+
10 rows in set (0.00 sec)

As you can see by that query, we’re running the subquery for each record in the database, the using DISTINCT to eliminate all the duplicate records. Useless, in this example, but it works, and except for the LIMIT clause, this works on SQL Server too.

In SQL Server you can use the OVER() clause to do something similar.

SELECT DISTINCT a.region, 
SUM(spending) OVER(PARTITION BY region) AS totSpending
FROM militaryspending a;

region	        totSpending
Africa	        120606
Asia	        428477
Australia	28559
Europe	        312410
North America	743119
South America	65976

You can read more about the OVER() clause at the link below.

http://msdn.microsoft.com/en-us/library/ms189461.aspx

This will be the last query I do, you could, literally, do this dozens of ways. What this query does is use a UNION to select the distinct region values from the militaryspending table. Those values are then used for a self-join to generate the total spending by region.

mysql> SELECT a.region, SUM(b.spending) AS totSpending
    -> FROM
    -> (
    -> SELECT region FROM militaryspending
    -> UNION
    -> SELECT region FROM militaryspending
    -> ) AS a
    -> LEFT JOIN militaryspending b
    -> ON a.region = b.region
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

I would never do any of this, assuming viable alternatives, on a job that I was being paid to do, or even not paid to do, but as a creative practice this sort of thing is very useful. And, like it or not, SQL, and programming, are creative disciplines.

13 May

SAS: PROC SQL CALCULATED

The CALCULATED keyword is something that’s unique to SAS, and even more unique, I kind of like it.

For article purposes, I’m going to use the admit data set and the CALCULATED keyword (SAS not ANSI) to generate a BMI number. BMI stands for Body Mass Index and it’s commonly used as a measure of how healthy you are in terms of your weight. The word “healthy” isn’t really correct in this case, instead, it’s more of a guideline as to what you should weigh in a healthy statistical sense. The formula is:

BMI = ( Weight in Pounds / ( Height in inches x Height in inches ) ) x 703

http://en.wikipedia.org/wiki/Body_mass_index

So, to use it we would write a PROC SQL statement as follows:

ods listing;

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703
	FROM mysql.admit;
quit;

Sex       Age    Height    Weight       BMI      
-------------------------------------------
M          27        72       168  22.78241
F          34        66       152  24.53076
F          31        61       123  23.23811
F          43        63       137  24.26581
M          51        71       158  22.03412
......

Now, suppose we want to use BMI to eliminate some of our data? The following doesn’t work.

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI
	FROM mysql.admit
	WHERE BMI > 23;
quit;

ERROR: The following columns were not found in the contributing tables: BMI.

So, instead, we prefix the word BMI in the WHERE clause with the word CALCULATED, like the following:

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI
	FROM mysql.admit
	WHERE CALCULATED BMI > 23;
quit;

Sex       Age    Height    Weight       BMI
 -------------------------------------------
 F          34        66       152  24.53076
 F          31        61       123  23.23811
 F          43        63       137  24.26581
 M          29        76       193  23.49013
 F          32        67       151  23.64736
 M          35        70       173   24.8202
 F          49        64       172  29.52051
 F          40        69       163  24.06826
 M          47        72       173  23.46046
 M          60        71       191  26.63618
 M          25        75       188  23.49582
 F          22        63       139  24.62006
 M          54        71       183  25.52053

You can also write this by doing the following but SAS makes a big deal about using additional system resources. My guess is that while it works, answering a test question this way could have unpredictable results.  But, since this is SAS, it’s conceivable there would be a question where it is the correct answer.

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI
	FROM mysql.admit
	WHERE (weight / (height * height)) * 703  > 23;
quit;

You basically use CALCULATED any time you want to use a calculated result later in the query. For example,

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI, 
        CALCULATED BMI * 2
	FROM mysql.admit;
quit;

Also, you need to use the CALCULATED keyword anytime you use the calculation in a query. For example, the following doesn’t work:

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI,
        CALCULATED BMI * 2
	FROM mysql.admit
	WHERE BMI > 23;
quit;

ERROR: The following columns were not found in the contributing tables: BMI.

But, this does work.

proc sql;
	SELECT sex, age, height, weight, (weight / (height * height)) * 703 AS BMI,
        CALCULATED BMI * 2
	FROM mysql.admit
	WHERE CALCULATED BMI > 23;
quit;

This is another SAS quirk but I kind of like this one. I could see myself using it.

Oh, and just remember, this is not ANSI standard SQL. It’s SAS SQL.

11 May

Horizontal Lists

Honestly, I just thought this was a fun exercise.  I’m not going to go into heavy detail on it and my solution is essentially identical to the SQL Cookbook except that I use variables.

So, here is some code, and the result.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY country
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| Belize             | Argentina     |
| Canada             | Bolivia       |
| Dominican Republic | Brazil        |
| El Salvador        | Chile         |
| Guatemala          | Colombia      |
| Honduras           | Ecuador       |
| Mexico             | Guyana        |
| Nicaragua          | Paraguay      |
| United States      | Peru          |
|                    | Uruguay       |
|                    | Venezuela     |
+--------------------+---------------+
11 rows in set (0.00 sec)

What this does is surprisingly simple. It uses two counters to build a list for two regions of data, orders them, adds a counter for each group, which is all put in a subquery. The outer query selects the max record for each counter and the counter is used for the GROUP BY. Finally, the COALESCE in the GROUP BY makes sure that we get whichever counter value applies to a given column.

When I first did this it didn’t seem all that useful but suppose we wanted a report that showed the top 5 spenders by each region? We could do that with the following code.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| United States      | Brazil        |
| Canada             | Colombia      |
| Mexico             | Chile         |
| Dominican Republic | Argentina     |
| El Salvador        | Venezuela     |
| Guatemala          | Ecuador       |
| Honduras           | Peru          |
| Nicaragua          | Uruguay       |
| Belize             | Bolivia       |
|                    | Paraguay      |
|                    | Guyana        |
+--------------------+---------------+
11 rows in set (0.00 sec)

Of course if you want to actually see the spending amounts then you will need to add two more columns.

mysql> SELECT MAX(north) AS 'North America',
    -> SUM(northspending) AS NorthSpending,
    -> MAX(south) AS 'South America',
    -> MAX(southSpending) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'North America', spending, '') AS northSpending,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> IF (region = 'South America', spending, '') AS southSpending,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+---------------+---------------+
| North America      | NorthSpending | South America | South America |
+--------------------+---------------+---------------+---------------+
| United States      |        711421 | Brazil        | 35360         |
| Canada             |         24659 | Colombia      | 10957         |
| Mexico             |          6022 | Chile         | 8040          |
| Dominican Republic |           341 | Argentina     | 3295          |
| El Salvador        |           222 | Venezuela     | 2385          |
| Guatemala          |           200 | Ecuador       | 2308          |
| Honduras           |           185 | Peru          | 2098          |
| Nicaragua          |          54.7 | Uruguay       | 902           |
| Belize             |          15.7 | Bolivia       | 352           |
|                    |             0 | Paraguay      | 250           |
|                    |             0 | Guyana        | 29.9          |
+--------------------+---------------+---------------+---------------+
11 rows in set (0.00 sec)

And there you go, the countries sorted by region, with their spending sorted in descending order. I also set the MAX and SUM for the aggregates intentionally because it didn’t matter in this example and I wanted to show that it didn’t matter.

Enjoy!

09 May

Finding The Next, Non-like, Item In A list

What a lousy title.  I really have no idea what to call this one.  It’s even, kind of, hard to explain.  On the SQL Team forums a user had a question about tracking entry’s and exits from a room.  However, each room might have multiple entries, and multiple exits, so you would want the first entry then the first exit that was greater than the entry time.

Here’s the thread.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=184712

This will be a SQL Server solution, for the most part, but I’ll show the MySQL version later on.  It’s pretty much the same except that you use an inline view.

CREATE TABLE mydates (id INT, checktime DATETIME, type CHAR(1));
INSERT INTO mydates VALUES 
('11','2013-01-01 09:00:00','I'),
('11','2013-01-01 09:10:00','I'),
('11','2013-01-01 09:11:00','I'),
('11','2013-01-01 12:00:00','O'),
('11','2013-01-01 12:02:00','O'),
('11','2013-01-01 14:15:00','I'),
('11','2013-01-01 14:45:00','O'),
('11','2013-01-01 15:00:00','I'),
('11','2013-01-01 15:03:00','I'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-02 09:10:00','I'),
('11','2013-01-02 09:11:00','I'),
('11','2013-01-02 18:00:00','O'),
('11','2013-01-02 18:05:00','O');

And you want to product the following result.

id	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

In essence, the first I type attached to the first O type, then the next I type greater than the O type, and so on and so forth. Yeah, it’s weird, and it would be fair to ask why they have multiple in/out times and to expect them to fix that problem but you know how that works in the real world.

So, here is how I got there, and for the record, it was pretty close to this. I kind of step-by-stepped it until I had an answer.

Step 1: Realize I only needed the I types at first.

SELECT * 
FROM mydates
WHERE type = 'I';

id	checktime	        type
11	2013-01-01 09:00:00.000	I
11	2013-01-01 09:10:00.000	I
11	2013-01-01 09:11:00.000	I
11	2013-01-01 14:15:00.000	I
11	2013-01-01 15:00:00.000	I
11	2013-01-01 15:03:00.000	I
11	2013-01-02 09:10:00.000	I
11	2013-01-02 09:11:00.000	I

This really doesn’t help much, but, it gives a base to work with.

Step 2: Create a self-join to get the O types.

Essentially, I just wrote a left-join that joined on everything where the types did not match. It doesn’t have to be a left join. A regular inner join would be fine. I have an old habit, because I did a lot of order entry queries in a past life, to reflexively write left joins.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

Yeah, that’s a bit long, but don’t worry, it will clean up soon.

Anyways, what this has done is combine every type I with every type 0, which is why it’s so long. We need to shorten it a bit.

Step 3: Only select type O times that are greater than a type I time.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

That helps but it still doesn’t get it narrowed down. However, the next step will.

Step 4: Use the MIN function to get the minimum type O time for each type I time.

You do this by rewriting the query to use a MIN function and a GROUP BY as follows:

SELECT a.id, a.checktime, a.type, b.id, MIN(b.checktime) AS timeout, b.type 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime, a.type, b.id, b.type

id	checktime	        type	id	timeout    	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O

Ah, better, now for each type I we have the first type O value greater than it.

Step 5: Clean up the earlier query.

Normally, I would do this last, but the WITH statement is disagreeable about creating an example the way I’ve gone so I’ll clean it up now.

SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 

uid	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:10:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:11:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-01 15:03:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000
11	2013-01-02 09:11:00.000	2013-01-02 18:00:00.000

Step 6: Select the minimum timein value based on the timeout value.

So, what we now have is a list of timeout values attached to a bunch of timein values. We simply run the same query we did earlier, only this time for the timein value and bang/done.

WITH cte AS
(
SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 
)
SELECT id, MIN(timein) AS timein, timeout
FROM cte
GROUP BY id, timeout

id	timein          	timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

This might make more sense if I add a summary of the steps.

  1. Select all type I values.
  2. Join the type O values to the type I values.
  3. Restrict the query to only values where the type O time is greater than the type I time.
  4. Select the minimum type O value for each type I value.
  5. Select the minimum type I value for each type O value via a subquery/with statement.

The MySQL version, which is really the MySQL / SQL Server version, because it works just fine on both sides of the house is,

mysql> SELECT a.id, MIN(timein) AS timein, timeout
    -> FROM
    -> (
    -> SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
    -> FROM mydates a
    -> LEFT JOIN mydates b
    -> ON a.type <> b.type
    -> AND a.id = b.id
    -> AND a.checktime < b.checktime     -> WHERE a.type = 'I'
    -> GROUP BY a.id, a.checktime ) AS A
    -> GROUP BY a.id, timeout;
+------+---------------------+---------------------+
| id   | timein              | timeout             |
+------+---------------------+---------------------+
|   11 | 2013-01-01 09:00:00 | 2013-01-01 12:00:00 |
|   11 | 2013-01-01 14:15:00 | 2013-01-01 14:45:00 |
|   11 | 2013-01-01 15:00:00 | 2013-01-01 18:00:00 |
|   11 | 2013-01-02 09:10:00 | 2013-01-02 18:00:00 |
+------+---------------------+---------------------+
4 rows in set (0.00 sec)

I used the WITH statement earlier, and in the original post, because the Microsoft guys seem more comfortable with that, especially on a “new guy” forum like where I posted this. I, also, have to admit that I think it’s easier to write too. Nesting subqueries gets a bit messy to read.

Hope you enjoyed it.

07 May

Returning The Bottom Two Items (No One Will Like This)

In an interview, guys like to do slightly exotic things.  Often, they’re hitting you with a problem that is solved pretty easily on one platform, but not so easily on another.  For instance, finding the top, or bottom 2, of an ordered result is pretty easy with the ROW_NUMBER function, but it’s less easy in MySQL.  I guess I’m feeling persnickety (it’s good that I have a spell check) because I’m going to come up with an exotic answer to the problem.

By the way, this is, really, a very unfair interview question on the MySQL side of the house, except, maybe in rare cases.  The kind of guy who knows this sort of thing, especially this approach, in an interview, is a freak.  Period.  I had to process this in the background for awhile until I sat down and came up with it.  In fact, really, until the SQL Cookbook entered my life I had never really even used GROUP_CONCAT.

Note: This is a MySQL thing. In SQL Server you’d just throw out a ROW_NUMBER and move on.

So, we’ll be using my old military spending data.  Here it is, all normalized, clean, and showing incredible attention to detail (I copied it off the SIRI site, tossed it into an editor and created a table with it because I’m pro).

mysql> SELECT * FROM militaryspending;
+------------------------+----------+---------+---------------+-----+
| country                | spending | gdpperc | region        | id  |
+------------------------+----------+---------+---------------+-----+
| Seychelles             |      9.3 |     1.3 | Africa        |   1 |
| Cape Verde             |      9.7 |     0.5 | Africa        |   2 |
| Mauritius              |     10.1 |     0.1 | Africa        |   3 |
| Belize                 |     15.7 |     1.1 | North America |   4 |
| Moldova                |     20.8 |     0.3 | Europe        |   5 |
Snipped...
| France                 |    62535 |     2.3 | Europe        | 122 |
| United Kingdom         |    62685 |     2.6 | Europe        | 123 |
| Russia                 |    71853 |     3.9 | Asia          | 124 |
| China                  |   142859 |     2.1 | Asia          | 125 |
| United States          |   711421 |     4.8 | North America | 126 |
+------------------------+----------+---------+---------------+-----+
126 rows in set (0.01 sec)

Actually, I did clean it up.  It didn’t have an ID/primary key field before.

Anyways, we have 6 regions and we want the result set to only show the lowest two for each region.

To do this we’re first going to do something completely weird, use the GROUP_CONCAT function combined with the SUBSTRING_INDEX function. The code and the result will look like this.

mysql> SELECT region,
    -> SUBSTRING_INDEX(GROUP_CONCAT(country ORDER BY spending DESC),',',-2) AS b
ottomTwo
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+------------------------------+
| region        | bottomTwo                    |
+---------------+------------------------------+
| Africa        | Cape Verde,Seychelles        |
| Asia          | Mongolia,Timor Leste         |
| Australia     | New Zealand,Papua New Guinea |
| Europe        | Malta,Moldova                |
| North America | Nicaragua,Belize             |
| South America | Paraguay,Guyana              |
+---------------+------------------------------+
6 rows in set (0.00 sec)

So, technically, you have provided an answer to a sloppy interview question. These are, the two lowest military spenders by region.

What you still need to do is provide the results, in an ordered list, with spending data included.  They will ask for it, immediately.  Here’s how I did this although it’s not really an optimal solution by any means.

mysql> SELECT a.region, a.country, a.spending
    -> FROM militaryspending a
    -> INNER JOIN
    -> (
    -> SELECT region,
    -> SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY spending DESC),',',-2) AS bottomTwo
    -> FROM militaryspending
    -> GROUP BY region
    -> ) b
    -> ON a.region = b.region
    -> WHERE FIND_IN_SET(a.id, b.bottomtwo) > 0
    -> ORDER BY a.region, a.spending;
+---------------+------------------+----------+
| region        | country          | spending |
+---------------+------------------+----------+
| Africa        | Seychelles       |      9.3 |
| Africa        | Cape Verde       |      9.7 |
| Asia          | Timor Leste      |     27.3 |
| Asia          | Mongolia         |     81.4 |
| Australia     | Papua New Guinea |     61.1 |
| Australia     | New Zealand      |     1792 |
| Europe        | Moldova          |     20.8 |
| Europe        | Malta            |     63.5 |
| North America | Belize           |     15.7 |
| North America | Nicaragua        |     54.7 |
| South America | Guyana           |     29.9 |
| South America | Paraguay         |      250 |
+---------------+------------------+----------+
12 rows in set (0.00 sec)

I should probably explain this.

First, I join the earlier query to the militaryspending table in a straight old inner join. This give me 126 results with the country and my GROUP_CONCAT tied together. Then all I do is write the WHERE clause by using FIND_IN_SET to go hunting for a country in the GROUP_CONCAT results. If it finds something it will return a result greater than 1, and there you go.

If you do this with variables, and some IF magic, that will produce a better result because you won’t need the join. On the other hand, if you toss this at an interviewer, unless they are very good, they will have no idea what you just hit them with.  You’ll either get big points for teaching them something, or, they’ll bounce you, for teaching them something.

Have fun!