17 May

EXISTS vs NOT EXISTS

I’m going to break a string of posts, and not beat up SAS, for once.  In fact, this post is inspired by the SAS Advanced study guide which mentioned the topic and since I don’t remember writing about it, but probably have, here you go.

EXISTS and it’s best friend, NOT EXISTS, are used with a subquery in the WHERE clause (this is a lie, it can also be used in a HAVING clause, but I’m not doing that, today).  Essentially, they test to see if any results are returned by the subquery.  In a way, it’s a binary test, that says, “yes, something is there” or “no, nothing is there”.

A few examples, all in MySQL.

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT name FROM junk WHERE actlevel = 'HIGH');
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

Since, we know that at least one record has an actlevel of HIGH all the results are returned by the query.

If we change the query to NOT EXISTS, we won’t get any results.

mysql> SELECT *
    -> FROM junk
    -> WHERE NOT EXISTS (SELECT name FROM junk WHERE actlevel = 'HIGH');
Empty set (0.00 sec)

This is really clean at this level. It either does, or doesn’t. But we can make it a bit more interesting.

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT 1);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

The condition returns a result, EXISTS is true, so you get the full result set. So, what happens when you test a condition that returns false?

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT 1 = 2);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

We’re still good. The query came back with at least one row so we get all of them. What about NULL values?

mysql> SELECT name
    -> FROM junk
    -> WHERE EXISTS (SELECT NULL);
+----------------+
| name           |
+----------------+
| Murray, W      |
| Almers, C      |
| Bonaventure, T |
| Johnson, R     |
| LaMance, K     |
| Jones, M       |
| Reberson, P    |
| King, E        |
| Pitts, D       |
| Eberhardt, S   |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Quigley, M     |
| Cameron, L     |
| Underwood, K   |
| Takahashi, Y   |
| Derber, B      |
| Ivan, H        |
| Wilcox, E      |
| Warren, C      |
+----------------+
21 rows in set (0.00 sec)

mysql> SELECT name
    -> FROM junk
    -> WHERE NOT EXISTS (SELECT NULL);
Empty set (0.00 sec)

They all work the same. If the result set has a row, even a single-column row with a NULL value you get the entire result set.

You can also use EXISTS with a correlated subquery. This is a bit more interesting because it can act like a WHERE clause in a query. This isn’t the most productive thing you could do but I find it, weirdly, fun.

mysql> SELECT name, actlevel
    -> FROM junk j
    -> WHERE EXISTS (SELECT name FROM junk WHERE j.actlevel = 'HIGH');
+-------------+----------+
| name        | actlevel |
+-------------+----------+
| Murray, W   | HIGH     |
| Almers, C   | HIGH     |
| Jones, M    | HIGH     |
| Nunnelly, A | HIGH     |
| Quigley, M  | HIGH     |
| Derber, B   | HIGH     |
| Wilcox, E   | HIGH     |
+-------------+----------+
7 rows in set (0.00 sec)

mysql> SELECT name, actlevel
    -> FROM junk
    -> WHERE actlevel = 'HIGH';
+-------------+----------+
| name        | actlevel |
+-------------+----------+
| Murray, W   | HIGH     |
| Almers, C   | HIGH     |
| Jones, M    | HIGH     |
| Nunnelly, A | HIGH     |
| Quigley, M  | HIGH     |
| Derber, B   | HIGH     |
| Wilcox, E   | HIGH     |
+-------------+----------+
7 rows in set (0.00 sec)

Because the query is correlated to the subquery in the WHERE clause it actually fires for each row. It’s definitely not an optimal solution, especially, in this case. Just for kicks, because it makes it clear, here are the EXPLAIN statements for each of those.  This is for the Microsoft people with question marks floating over their heads.

mysql> EXPLAIN SELECT name, actlevel
    -> FROM junk j
    -> WHERE EXISTS (SELECT name FROM junk WHERE j.actlevel = 'HIGH');
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
| id | select_type        | table | type | possible_keys | key  | key_len |ref |rows|Extra      |
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
|  1 | PRIMARY            | j     | ALL  | NULL          | NULL | NULL    |NULL| 21 |Using where|
|  2 | DEPENDENT SUBQUERY | junk  | ALL  | NULL          | NULL | NULL    |NULL| 21 |Using where|
+----+--------------------+-------+------+---------------+------+---------+----+----+------------
2 rows in set (0.00 sec)

mysql> EXPLAIN SELECT name, actlevel
    -> FROM junk
    -> WHERE actlevel = 'HIGH';
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | junk  | ALL  | NULL          | NULL | NULL    | NULL |   21 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

You don’t need to be a MySQL expert to see the problem. You can be one of the kids in the ATT ad, except they all think moar is better. It isn’t in this case.  Less is better.  I’m not going into detail on an EXPLAIN, and this EXPLAIN doesn’t have enough detail to discuss anyways.  If you like to read more about explain you can do it at the link below..

You can read more about EXPLAIN here.

15 May

More GROUP BY, SQL Server vs SAS and MySQL

This is pretty clean, and I’ve touched on it before, so it shouldn’t be a surprise, but, it’s something I’d have missed on a SAS test so it gets a post.  We’ll start with SQL Server.

SQL Server

SELECT actlevel, sum(height)
FROM junk
GROUP BY actlevel;

actlevel	sumHeight
HIGH	        491
LOW 	        465
MOD 	        477

This is clean and all three of my favorites will behave with this SQL.

However, lets change it just a bit.

SELECT actlevel, sum(height) AS sumHeight
FROM junk
GROUP BY 1;

Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

I suspect, that this is the result most people would expect with that code, certainly the Microsoft guys would. However, the other two platforms seem to disagree.

MySQL

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| HIGH     |       491 |
| LOW      |       465 |
| MOD      |       477 |
+----------+-----------+
3 rows in set (0.00 sec)

That’s pretty clean, and notice how it orders the actlevel column. And just for kicks, this happens to work.

mysql> SELECT actlevel, sum(height) AS sumHeight
    -> FROM junk
    -> GROUP BY 1 DESC;
+----------+-----------+
| actlevel | sumHeight |
+----------+-----------+
| MOD      |       477 |
| LOW      |       465 |
| HIGH     |       491 |
+----------+-----------+
3 rows in set (0.00 sec)

Now, that other guy.

SAS

proc sql;
SELECT actlevel, SUM(height) AS sumHeight
FROM mysql.admit
GROUP BY 1;
quit;

Act
Level  sumHeight
----------------
HIGH         491
LOW          465
MOD          477

And just for kicks, you can alias a column with the label command as follows:

proc sql;
SELECT actlevel, SUM(height) label='sumHeight'
FROM mysql.admit
GROUP BY 1;
quit;

However, you cannot use the DESC keyword here, just, because.

14 May

More GROUP BY, SAS vs MySQL vs SQL Server

This is a post about a few more GROUP BY quirks on the three platforms.

First, some data.

CREATE TABLE junk
(
	id int,
	name varchar(40),
	sex char(1),
	age int,
	mydate int,
	height int,
	weight int,
	actlevel char(4),
	fee float
);

INSERT INTO junk VALUES ('2458','Murray, W','M','27','1','72','168','HIGH','85.20');
INSERT INTO junk VALUES ('2462','Almers, C','F','34','3','66','152','HIGH','124.80');
INSERT INTO junk VALUES ('2501','Bonaventure, T','F','31','17','61','123','LOW','149.75');
INSERT INTO junk VALUES ('2523','Johnson, R','F','43','31','63','137','MOD','149.75');
INSERT INTO junk VALUES ('2539','LaMance, K','M','51','4','71','158','LOW','124.80');
INSERT INTO junk VALUES ('2544','Jones, M','M','29','6','76','193','HIGH','124.80');
INSERT INTO junk VALUES ('2552','Reberson, P','F','32','9','67','151','MOD','149.75');
INSERT INTO junk VALUES ('2555','King, E','M','35','13','70','173','MOD','149.75');
INSERT INTO junk VALUES ('2563','Pitts, D','M','34','22','73','154','LOW','124.80');
INSERT INTO junk VALUES ('2568','Eberhardt, S','F','49','27','64','172','LOW','124.80');
INSERT INTO junk VALUES ('2571','Nunnelly, A','F','44','19','66','140','HIGH','149.75');
INSERT INTO junk VALUES ('2572','Oberon, M','F','28','17','62','118','LOW','85.20');
INSERT INTO junk VALUES ('2574','Peterson, V','M','30','6','69','147','MOD','149.75');
INSERT INTO junk VALUES ('2575','Quigley, M','F','40','8','69','163','HIGH','124.80');
INSERT INTO junk VALUES ('2578','Cameron, L','M','47','5','72','173','MOD','124.80');
INSERT INTO junk VALUES ('2579','Underwood, K','M','60','22','71','191','LOW','149.75');
INSERT INTO junk VALUES ('2584','Takahashi, Y','F','43','29','65','123','MOD','124.80');
INSERT INTO junk VALUES ('2586','Derber, B','M','25','23','75','188','HIGH','85.20');
INSERT INTO junk VALUES ('2588','Ivan, H','F','22','20','63','139','LOW','85.20');
INSERT INTO junk VALUES ('2589','Wilcox, E','F','41','16','67','141','HIGH','149.75');
INSERT INTO junk VALUES ('2595','Warren, C','M','54','7','71','183','MOD','149.75');

So, let’s get started by looking at two queries that don’t work.

SQL Server

SELECT actlevel, SUM(age,height)
FROM junk
GROUP BY actlevel

Msg 174, Level 15, State 1, Line 1
The SUM function requires 1 argument(s).

SQL Server does a nice job here. It delineates the error, clearly, and this is expected behavior.

MySQL

mysql> SELECT actlevel, SUM(age,height)
    -> FROM junk
    -> GROUP BY actlevel;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near 'heigh
t)
FROM junk
GROUP BY actlevel' at line 1

That isn’t the most descriptive error you’ll run across, however, it is expected behavior. Now, here comes SAS.

SAS

proc sql;
    select actlevel, sum(age, height)
	from mysql.admit
	group by actlevel;
quit;

Act
Level          
---------------
HIGH         99
HIGH        108
HIGH        100
HIGH        109
HIGH        110
HIGH        105
HIGH        100
LOW          92
LOW          90
LOW         113
LOW          85
LOW         107
LOW         122
LOW         131
MOD          99
MOD         119
MOD          99
MOD         125
MOD         106
MOD         105
MOD         108

WTF? OK, just to confirm that there is, some, normal behavior in SAS.

proc sql;
    select actlevel, sum(height)
	from mysql.admit
	group by actlevel;
quit;

Act
Level          
---------------
HIGH        491
LOW         465
MOD         477

I’m not sure what SAS thinks it’s doing, I’ll have to dig into it, but, man, that is very nonstandard stuff.

All three variants do what they should with the following code:

SELECT actlevel, SUM(age+height)
FROM junk
GROUP BY actlevel;

or the SAS variant

proc sql;
SELECT actlevel, SUM(age+height)
FROM mysql.admit
GROUP BY actlevel;
quit;

Act
Level          
---------------
HIGH        731
LOW         740
MOD         761

This is really more than enough for a post but I’m going one better tonight and providing bonus time.

SQL Server

SELECT actlevel, age, SUM(height)
FROM junk
GROUP BY actlevel;

Msg 8120, Level 16, State 1, Line 1
Column 'junk.age' is invalid in the select list because it is not contained in either an
aggregate function or the GROUP BY clause.

This is the behavior I expected, maybe not from SAS, but certainly from SQL Server and MySQL, except, well, it’s not what happens on those two platforms.

MySQL

mysql> SELECT actlevel, age, SUM(height)
    -> FROM junk
    -> GROUP BY actlevel;
+----------+------+-------------+
| actlevel | age  | SUM(height) |
+----------+------+-------------+
| HIGH     |   27 |         491 |
| LOW      |   31 |         465 |
| MOD      |   43 |         477 |
+----------+------+-------------+
3 rows in set (0.00 sec)
I have to admit that I did not expect that.  Unfortunately, it's pretty much meaningless.

And now, SAS, because we save the weirdest for last.
proc sql;
SELECT actlevel, age, SUM(height)
FROM mysql.admit
GROUP BY actlevel;
quit;

Act
Level       Age          
-------------------------
HIGH         27       491
HIGH         41       491
HIGH         34       491
HIGH         40       491
HIGH         44       491
HIGH         29       491
HIGH         25       491
LOW          31       465
LOW          28       465
LOW          49       465
LOW          22       465
LOW          34       465
LOW          51       465
LOW          60       465
MOD          32       477
MOD          47       477
MOD          30       477
MOD          54       477
MOD          43       477
MOD          35       477
MOD          43       477

POP! My head sploded!

If you are wondering how I keep all of this straight in my head, I don’t. Regrettably, despite looking up stuff like this, I can go totally brain-dead in an interview setting. I would have said that the SQL Server approach was correct and blown it on MySQL and SAS. The SAS behavior is totally unexpected. I probably need to install Oracle Express to act as a tie-breaker.

13 May

GROUP BY, Without A Group By

A lot of what I do on this site falls into the “what the heck” category.  By that I mean, I just want to do things in a different and strange fashion.  Often, like the stuff in this article, there’s no reason to do it the ways I’m going to, they aren’t faster, they aren’t better and often they are easier to understand.

So why do them? This probably stems from my martial arts days. In my particular art there were two, somewhat, extreme approaches to it. One group felt that you could flow, literally from the beginning and discover the art. The other group, which I’ll call the 1.2.3.4.5 group felt that you spend decades doing something, then, and only then, can you explore because you had the basics down.

For the record, I think both groups failed miserably in their approach. From what I saw, if you never practiced the fundamentals, you never learned the fundamentals, but, if all you ever did were fundamentals, you never really developed your creativity. Both approaches had this philosophical approach that fell flat on its face. As such, I tend to try to incorporate both into my practice, whatever that practice might be. Today, it’s the creative side.

Lets dive in.  By now you all know my good friend, the militaryspending table,

mysql> desc militaryspending;
+----------+-------------+------+-----+---------+----------------+
| Field    | Type        | Null | Key | Default | Extra          |
+----------+-------------+------+-----+---------+----------------+
| country  | varchar(50) | NO   |     | NULL    |                |
| spending | double      | NO   |     | NULL    |                |
| gdpperc  | double      | NO   |     | NULL    |                |
| region   | varchar(50) | YES  |     | NULL    |                |
| id       | int(11)     | NO   | PRI | NULL    | auto_increment |
+----------+-------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)

mysql> SELECT *
    -> FROM militaryspending LIMIT 5;
+------------+----------+---------+---------------+----+
| country    | spending | gdpperc | region        | id |
+------------+----------+---------+---------------+----+
| Seychelles |      9.3 |     1.3 | Africa        |  1 |
| Cape Verde |      9.7 |     0.5 | Africa        |  2 |
| Mauritius  |     10.1 |     0.1 | Africa        |  3 |
| Belize     |     15.7 |     1.1 | North America |  4 |
| Moldova    |     20.8 |     0.3 | Europe        |  5 |
+------------+----------+---------+---------------+----+
5 rows in set (0.00 sec)

So, let’s do a simple GROUP BY that shows total spending by region. We would write this, and this is how we should do this, as follows:

mysql> SELECT region, SUM(spending) AS totSpending
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

Clean, straight-forward, exactly what you should do, and it works everywhere. However, there are other ways to write that.

For instance, you can use a correlated subquery as follows:

mysql> SELECT DISTINCT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.01 sec)

This will work in most platforms. It’s standard stuff, albeit, not ideal, and as we I said in the title, there is no GROUP BY.

It also requires the use of DISTINCT to make it work.

mysql> SELECT region,
    -> (SELECT SUM(spending) FROM militaryspending WHERE a.region = region) AS totSpending
    -> FROM militaryspending a LIMIT 10;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Africa        |      120608 |
| Africa        |      120608 |
| North America |    743120.4 |
| Europe        |    312412.2 |
| Africa        |      120608 |
| Asia          |    428477.7 |
| South America |     65976.9 |
| North America |    743120.4 |
| Africa        |      120608 |
+---------------+-------------+
10 rows in set (0.00 sec)

As you can see by that query, we’re running the subquery for each record in the database, the using DISTINCT to eliminate all the duplicate records. Useless, in this example, but it works, and except for the LIMIT clause, this works on SQL Server too.

In SQL Server you can use the OVER() clause to do something similar.

SELECT DISTINCT a.region, 
SUM(spending) OVER(PARTITION BY region) AS totSpending
FROM militaryspending a;

region	        totSpending
Africa	        120606
Asia	        428477
Australia	28559
Europe	        312410
North America	743119
South America	65976

You can read more about the OVER() clause at the link below.

http://msdn.microsoft.com/en-us/library/ms189461.aspx

This will be the last query I do, you could, literally, do this dozens of ways. What this query does is use a UNION to select the distinct region values from the militaryspending table. Those values are then used for a self-join to generate the total spending by region.

mysql> SELECT a.region, SUM(b.spending) AS totSpending
    -> FROM
    -> (
    -> SELECT region FROM militaryspending
    -> UNION
    -> SELECT region FROM militaryspending
    -> ) AS a
    -> LEFT JOIN militaryspending b
    -> ON a.region = b.region
    -> GROUP BY region;
+---------------+-------------+
| region        | totSpending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

I would never do any of this, assuming viable alternatives, on a job that I was being paid to do, or even not paid to do, but as a creative practice this sort of thing is very useful. And, like it or not, SQL, and programming, are creative disciplines.

11 May

Horizontal Lists

Honestly, I just thought this was a fun exercise.  I’m not going to go into heavy detail on it and my solution is essentially identical to the SQL Cookbook except that I use variables.

So, here is some code, and the result.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY country
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| Belize             | Argentina     |
| Canada             | Bolivia       |
| Dominican Republic | Brazil        |
| El Salvador        | Chile         |
| Guatemala          | Colombia      |
| Honduras           | Ecuador       |
| Mexico             | Guyana        |
| Nicaragua          | Paraguay      |
| United States      | Peru          |
|                    | Uruguay       |
|                    | Venezuela     |
+--------------------+---------------+
11 rows in set (0.00 sec)

What this does is surprisingly simple. It uses two counters to build a list for two regions of data, orders them, adds a counter for each group, which is all put in a subquery. The outer query selects the max record for each counter and the counter is used for the GROUP BY. Finally, the COALESCE in the GROUP BY makes sure that we get whichever counter value applies to a given column.

When I first did this it didn’t seem all that useful but suppose we wanted a report that showed the top 5 spenders by each region? We could do that with the following code.

mysql> SELECT MAX(north) AS 'North America', MAX(south) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+
| North America      | South America |
+--------------------+---------------+
| United States      | Brazil        |
| Canada             | Colombia      |
| Mexico             | Chile         |
| Dominican Republic | Argentina     |
| El Salvador        | Venezuela     |
| Guatemala          | Ecuador       |
| Honduras           | Peru          |
| Nicaragua          | Uruguay       |
| Belize             | Bolivia       |
|                    | Paraguay      |
|                    | Guyana        |
+--------------------+---------------+
11 rows in set (0.00 sec)

Of course if you want to actually see the spending amounts then you will need to add two more columns.

mysql> SELECT MAX(north) AS 'North America',
    -> SUM(northspending) AS NorthSpending,
    -> MAX(south) AS 'South America',
    -> MAX(southSpending) AS 'South America'
    -> FROM
    -> (
    -> SELECT
    -> IF (region = 'North America', country, '') AS north,
    -> IF (region = 'North America', @ctr := @ctr + 1, NULL) AS northCtr,
    -> IF (region = 'North America', spending, '') AS northSpending,
    -> IF (region = 'South America', country, '') AS south,
    -> IF (region = 'South America', @ctr1 := @ctr1 + 1, NULL) AS southCtr,
    -> IF (region = 'South America', spending, '') AS southSpending,
    -> spending
    -> FROM militaryspending
    -> JOIN (SELECT @ctr := 0, @ctr1 := 0) as A
    -> WHERE region IN ('North America','South America')
    -> ORDER BY spending DESC
    -> ) AS B
    -> GROUP BY COALESCE(northCtr, southCtr);
+--------------------+---------------+---------------+---------------+
| North America      | NorthSpending | South America | South America |
+--------------------+---------------+---------------+---------------+
| United States      |        711421 | Brazil        | 35360         |
| Canada             |         24659 | Colombia      | 10957         |
| Mexico             |          6022 | Chile         | 8040          |
| Dominican Republic |           341 | Argentina     | 3295          |
| El Salvador        |           222 | Venezuela     | 2385          |
| Guatemala          |           200 | Ecuador       | 2308          |
| Honduras           |           185 | Peru          | 2098          |
| Nicaragua          |          54.7 | Uruguay       | 902           |
| Belize             |          15.7 | Bolivia       | 352           |
|                    |             0 | Paraguay      | 250           |
|                    |             0 | Guyana        | 29.9          |
+--------------------+---------------+---------------+---------------+
11 rows in set (0.00 sec)

And there you go, the countries sorted by region, with their spending sorted in descending order. I also set the MAX and SUM for the aggregates intentionally because it didn’t matter in this example and I wanted to show that it didn’t matter.

Enjoy!

09 May

Finding The Next, Non-like, Item In A list

What a lousy title.  I really have no idea what to call this one.  It’s even, kind of, hard to explain.  On the SQL Team forums a user had a question about tracking entry’s and exits from a room.  However, each room might have multiple entries, and multiple exits, so you would want the first entry then the first exit that was greater than the entry time.

Here’s the thread.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=184712

This will be a SQL Server solution, for the most part, but I’ll show the MySQL version later on.  It’s pretty much the same except that you use an inline view.

CREATE TABLE mydates (id INT, checktime DATETIME, type CHAR(1));
INSERT INTO mydates VALUES 
('11','2013-01-01 09:00:00','I'),
('11','2013-01-01 09:10:00','I'),
('11','2013-01-01 09:11:00','I'),
('11','2013-01-01 12:00:00','O'),
('11','2013-01-01 12:02:00','O'),
('11','2013-01-01 14:15:00','I'),
('11','2013-01-01 14:45:00','O'),
('11','2013-01-01 15:00:00','I'),
('11','2013-01-01 15:03:00','I'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-01 18:00:00','O'),
('11','2013-01-02 09:10:00','I'),
('11','2013-01-02 09:11:00','I'),
('11','2013-01-02 18:00:00','O'),
('11','2013-01-02 18:05:00','O');

And you want to product the following result.

id	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

In essence, the first I type attached to the first O type, then the next I type greater than the O type, and so on and so forth. Yeah, it’s weird, and it would be fair to ask why they have multiple in/out times and to expect them to fix that problem but you know how that works in the real world.

So, here is how I got there, and for the record, it was pretty close to this. I kind of step-by-stepped it until I had an answer.

Step 1: Realize I only needed the I types at first.

SELECT * 
FROM mydates
WHERE type = 'I';

id	checktime	        type
11	2013-01-01 09:00:00.000	I
11	2013-01-01 09:10:00.000	I
11	2013-01-01 09:11:00.000	I
11	2013-01-01 14:15:00.000	I
11	2013-01-01 15:00:00.000	I
11	2013-01-01 15:03:00.000	I
11	2013-01-02 09:10:00.000	I
11	2013-01-02 09:11:00.000	I

This really doesn’t help much, but, it gives a base to work with.

Step 2: Create a self-join to get the O types.

Essentially, I just wrote a left-join that joined on everything where the types did not match. It doesn’t have to be a left join. A regular inner join would be fine. I have an old habit, because I did a lot of order entry queries in a past life, to reflexively write left joins.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

Yeah, that’s a bit long, but don’t worry, it will clean up soon.

Anyways, what this has done is combine every type I with every type 0, which is why it’s so long. We need to shorten it a bit.

Step 3: Only select type O times that are greater than a type I time.

SELECT * 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I';

id	checktime	        type	id	checktime	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:02:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:05:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:05:00.000	O

That helps but it still doesn’t get it narrowed down. However, the next step will.

Step 4: Use the MIN function to get the minimum type O time for each type I time.

You do this by rewriting the query to use a MIN function and a GROUP BY as follows:

SELECT a.id, a.checktime, a.type, b.id, MIN(b.checktime) AS timeout, b.type 
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime, a.type, b.id, b.type

id	checktime	        type	id	timeout    	        type
11	2013-01-01 09:00:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:10:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 09:11:00.000	I	11	2013-01-01 12:00:00.000	O
11	2013-01-01 14:15:00.000	I	11	2013-01-01 14:45:00.000	O
11	2013-01-01 15:00:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-01 15:03:00.000	I	11	2013-01-01 18:00:00.000	O
11	2013-01-02 09:10:00.000	I	11	2013-01-02 18:00:00.000	O
11	2013-01-02 09:11:00.000	I	11	2013-01-02 18:00:00.000	O

Ah, better, now for each type I we have the first type O value greater than it.

Step 5: Clean up the earlier query.

Normally, I would do this last, but the WITH statement is disagreeable about creating an example the way I’ve gone so I’ll clean it up now.

SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 

uid	timein	                timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:10:00.000	2013-01-01 12:00:00.000
11	2013-01-01 09:11:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-01 15:03:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000
11	2013-01-02 09:11:00.000	2013-01-02 18:00:00.000

Step 6: Select the minimum timein value based on the timeout value.

So, what we now have is a list of timeout values attached to a bunch of timein values. We simply run the same query we did earlier, only this time for the timein value and bang/done.

WITH cte AS
(
SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
FROM mydates a
LEFT JOIN mydates b
ON a.type <> b.type
AND a.id = b.id
AND a.checktime < b.checktime
WHERE a.type = 'I'
GROUP BY a.id, a.checktime 
)
SELECT id, MIN(timein) AS timein, timeout
FROM cte
GROUP BY id, timeout

id	timein          	timeout
11	2013-01-01 09:00:00.000	2013-01-01 12:00:00.000
11	2013-01-01 14:15:00.000	2013-01-01 14:45:00.000
11	2013-01-01 15:00:00.000	2013-01-01 18:00:00.000
11	2013-01-02 09:10:00.000	2013-01-02 18:00:00.000

This might make more sense if I add a summary of the steps.

  1. Select all type I values.
  2. Join the type O values to the type I values.
  3. Restrict the query to only values where the type O time is greater than the type I time.
  4. Select the minimum type O value for each type I value.
  5. Select the minimum type I value for each type O value via a subquery/with statement.

The MySQL version, which is really the MySQL / SQL Server version, because it works just fine on both sides of the house is,

mysql> SELECT a.id, MIN(timein) AS timein, timeout
    -> FROM
    -> (
    -> SELECT a.id, a.checktime AS timein, MIN(b.checktime) AS timeout
    -> FROM mydates a
    -> LEFT JOIN mydates b
    -> ON a.type <> b.type
    -> AND a.id = b.id
    -> AND a.checktime < b.checktime     -> WHERE a.type = 'I'
    -> GROUP BY a.id, a.checktime ) AS A
    -> GROUP BY a.id, timeout;
+------+---------------------+---------------------+
| id   | timein              | timeout             |
+------+---------------------+---------------------+
|   11 | 2013-01-01 09:00:00 | 2013-01-01 12:00:00 |
|   11 | 2013-01-01 14:15:00 | 2013-01-01 14:45:00 |
|   11 | 2013-01-01 15:00:00 | 2013-01-01 18:00:00 |
|   11 | 2013-01-02 09:10:00 | 2013-01-02 18:00:00 |
+------+---------------------+---------------------+
4 rows in set (0.00 sec)

I used the WITH statement earlier, and in the original post, because the Microsoft guys seem more comfortable with that, especially on a “new guy” forum like where I posted this. I, also, have to admit that I think it’s easier to write too. Nesting subqueries gets a bit messy to read.

Hope you enjoyed it.

07 May

Returning The Bottom Two Items (No One Will Like This)

In an interview, guys like to do slightly exotic things.  Often, they’re hitting you with a problem that is solved pretty easily on one platform, but not so easily on another.  For instance, finding the top, or bottom 2, of an ordered result is pretty easy with the ROW_NUMBER function, but it’s less easy in MySQL.  I guess I’m feeling persnickety (it’s good that I have a spell check) because I’m going to come up with an exotic answer to the problem.

By the way, this is, really, a very unfair interview question on the MySQL side of the house, except, maybe in rare cases.  The kind of guy who knows this sort of thing, especially this approach, in an interview, is a freak.  Period.  I had to process this in the background for awhile until I sat down and came up with it.  In fact, really, until the SQL Cookbook entered my life I had never really even used GROUP_CONCAT.

Note: This is a MySQL thing. In SQL Server you’d just throw out a ROW_NUMBER and move on.

So, we’ll be using my old military spending data.  Here it is, all normalized, clean, and showing incredible attention to detail (I copied it off the SIRI site, tossed it into an editor and created a table with it because I’m pro).

mysql> SELECT * FROM militaryspending;
+------------------------+----------+---------+---------------+-----+
| country                | spending | gdpperc | region        | id  |
+------------------------+----------+---------+---------------+-----+
| Seychelles             |      9.3 |     1.3 | Africa        |   1 |
| Cape Verde             |      9.7 |     0.5 | Africa        |   2 |
| Mauritius              |     10.1 |     0.1 | Africa        |   3 |
| Belize                 |     15.7 |     1.1 | North America |   4 |
| Moldova                |     20.8 |     0.3 | Europe        |   5 |
Snipped...
| France                 |    62535 |     2.3 | Europe        | 122 |
| United Kingdom         |    62685 |     2.6 | Europe        | 123 |
| Russia                 |    71853 |     3.9 | Asia          | 124 |
| China                  |   142859 |     2.1 | Asia          | 125 |
| United States          |   711421 |     4.8 | North America | 126 |
+------------------------+----------+---------+---------------+-----+
126 rows in set (0.01 sec)

Actually, I did clean it up.  It didn’t have an ID/primary key field before.

Anyways, we have 6 regions and we want the result set to only show the lowest two for each region.

To do this we’re first going to do something completely weird, use the GROUP_CONCAT function combined with the SUBSTRING_INDEX function. The code and the result will look like this.

mysql> SELECT region,
    -> SUBSTRING_INDEX(GROUP_CONCAT(country ORDER BY spending DESC),',',-2) AS b
ottomTwo
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+------------------------------+
| region        | bottomTwo                    |
+---------------+------------------------------+
| Africa        | Cape Verde,Seychelles        |
| Asia          | Mongolia,Timor Leste         |
| Australia     | New Zealand,Papua New Guinea |
| Europe        | Malta,Moldova                |
| North America | Nicaragua,Belize             |
| South America | Paraguay,Guyana              |
+---------------+------------------------------+
6 rows in set (0.00 sec)

So, technically, you have provided an answer to a sloppy interview question. These are, the two lowest military spenders by region.

What you still need to do is provide the results, in an ordered list, with spending data included.  They will ask for it, immediately.  Here’s how I did this although it’s not really an optimal solution by any means.

mysql> SELECT a.region, a.country, a.spending
    -> FROM militaryspending a
    -> INNER JOIN
    -> (
    -> SELECT region,
    -> SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY spending DESC),',',-2) AS bottomTwo
    -> FROM militaryspending
    -> GROUP BY region
    -> ) b
    -> ON a.region = b.region
    -> WHERE FIND_IN_SET(a.id, b.bottomtwo) > 0
    -> ORDER BY a.region, a.spending;
+---------------+------------------+----------+
| region        | country          | spending |
+---------------+------------------+----------+
| Africa        | Seychelles       |      9.3 |
| Africa        | Cape Verde       |      9.7 |
| Asia          | Timor Leste      |     27.3 |
| Asia          | Mongolia         |     81.4 |
| Australia     | Papua New Guinea |     61.1 |
| Australia     | New Zealand      |     1792 |
| Europe        | Moldova          |     20.8 |
| Europe        | Malta            |     63.5 |
| North America | Belize           |     15.7 |
| North America | Nicaragua        |     54.7 |
| South America | Guyana           |     29.9 |
| South America | Paraguay         |      250 |
+---------------+------------------+----------+
12 rows in set (0.00 sec)

I should probably explain this.

First, I join the earlier query to the militaryspending table in a straight old inner join. This give me 126 results with the country and my GROUP_CONCAT tied together. Then all I do is write the WHERE clause by using FIND_IN_SET to go hunting for a country in the GROUP_CONCAT results. If it finds something it will return a result greater than 1, and there you go.

If you do this with variables, and some IF magic, that will produce a better result because you won’t need the join. On the other hand, if you toss this at an interviewer, unless they are very good, they will have no idea what you just hit them with.  You’ll either get big points for teaching them something, or, they’ll bounce you, for teaching them something.

Have fun!

06 May

SQL Cookbook

SQL Cookbook

The SQL Cookbook, Available At Amazon.com.

I’m overdue on this book as I’ve been reading, and posting about it, for several days now.  In fact, it’s pretty much replaced my desire to write about the SAS Advanced certification.  Frankly, it’s exactly the kind of thing I’ve been looking for in a SQL book, slightly exotic queries and approaches, from a variety of platforms.  Outside of a couple of bugs, and the fact that it’s slightly dated (published in 2006) it’s an excellent book.

In fact, I’ll start the review process by giving it 5 stars.  It’s rare that I read a tech book, outside of certification books, from start to finish.  Usually, I get part way in, get bored, and stop.  This book has kept me plugging through it.

What this book does is present a series of problems / questions that it solves with SQL.  It shows variants for the following platforms: SQL Server, MySQL, Oracle, PostgreSQL and DB2.  Frankly, I think this is awesome.  One of the problems I’ve run into over the years, and it’s something I’ve tried to discuss on this site, is that people don’t understand that different platforms do things differently.  That it covers multiple platforms, is by far, the best part of this book.

The problems / questions, for the most part, range from really basic, to more complex variations on the problem.  This is useful in a couple of ways.  First, a novice won’t know the basic approach so showing it allows the book to be useful for even relative beginners.  The more complex problems, while not often things you would do on a job, are the kinds of things some wanker would ask in an interview.  And, unless you are exceptionally good, or lucky, scrambling to throw together a SQL solution, to a problem that might take a while to even get clear in your head won’t happen in an interview.  This book can help you be prepared for that sort of thing.  It’s a terrible approach to interviewing but I can attest that it will happen.

The book also moves at a really fluid pace.  If you’ve written SQL in the past you don’t need to plow each example so you can skip around to pick the areas you don’t know that well or are curious about.  It doesn’t need a thorough reading.

While I grade the book at 5 stars, there are a few problems with it.

First, the book, is kind of written as a reference.  But, truthfully, with the internet, and places like Stackoverflow, you can almost always get a workable answer, somewhere.  However, despite that, I would still trust what I find here more than what I might find online.  Books, even as bad as O’Reilly has been with a few recent books, should still have an editing process.  Online, who knows what you might stumble on, and who knows if it’s the best way.

Second, because the book was published in 2006 it doesn’t include some things that have been added in later versions.  However, I believe, and would encourage people, to rely on basic SQL as a general rule.  The catch with the tricks is that they are almost always version specific and if you ever switch, or wind up working with some other platform, then you could be in trouble.  I think, mostly, the author tends towards this approach and I appreciate it.

Finally, the book does have a couple of errors or less than ideal approaches.  I’ve written posts about a couple of those.  Unfortunately, O’Reilly has gotten sloppy with their QA, although it’s a lot worse today than it was when this book was written.  I’ve had some terrible experiences with them, see my reviews on the Microsoft Training Kits, for example.  Overall though, I’m not that bothered by what I’ve noticed, or not noticed, and the Errata, especially on a book 6 years old, is decent.

Overall, I’ve really enjoyed this book.  I’m not quite finished (2/3rds of the way) but it’s the rare technical book that I will take all the way home, and I can’t help but recommend it.   I wish I’d found it a few years ago.  It would have helped to save me some grief in a few cases.  So, if you are learning SQL, or want to dabble in some more exotic approaches to problems, get this book, and get it now.

You can get this book at Amazon.com.

04 May

Creating A Calendar

Ah, another post on The SQL Cookbook.  He decided to make a calendar using SQL as an exercise.  It sounded like fun, so I’m weird, so I did it.  His solution is different from mine, and his result looks a bit cleaner, but conceptually we’re both doing exactly the same thing.

Normally I would go step by step on one of these and detail the thought process but frankly I’m under a bit of time pressure and really just want to get this one done.  So, I’m going to toss out the code, explain it, and let it roll.  If you have experience with SQL you should get what I’ve done.

mysql> SET @var = '2013-05-01';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 1, DAY(ADDDATE(@var,id-1)), '')) AS Su,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 2, DAY(ADDDATE(@var,id-1)), '')) AS Mo,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 3, DAY(ADDDATE(@var,id-1)), '')) AS Tu,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 4, DAY(ADDDATE(@var,id-1)), '')) AS We,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 5, DAY(ADDDATE(@var,id-1)), '')) AS Th,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 6, DAY(ADDDATE(@var,id-1)), '')) AS Fr,
    -> SUM(IF(DAYOFWEEK(ADDDATE(@var,id-1)) = 7, DAY(ADDDATE(@var,id-1)), '')) AS Sa
    -> FROM numbers
    -> WHERE id <= DAY(LAST_DAY(@var))     -> GROUP BY WEEK(ADDDATE(@var,id-1));
+------+------+------+------+------+------+------+
| Su   | Mo   | Tu   | We   | Th   | Fr   | Sa   |
+------+------+------+------+------+------+------+
|    0 |    0 |    0 |    1 |    2 |    3 |    4 |
|    5 |    6 |    7 |    8 |    9 |   10 |   11 |
|   12 |   13 |   14 |   15 |   16 |   17 |   18 |
|   19 |   20 |   21 |   22 |   23 |   24 |   25 |
|   26 |   27 |   28 |   29 |   30 |   31 |    0 |
+------+------+------+------+------+------+------+
5 rows in set (0.00 sec)

So, here’s how it works. First we set @var to the first day of a month. It’s easier that way.

The query then reads from the numbers table, which starts at one for this example, and selects all records that are less than, or equal to, the number of days in the target month, which in this case, is 31.

The data is then grouped by the week of the year for each day of the month. The ADDDATE function cycles through each day in the month.

Finally, the SELECT create a SUM for the 7 days in a week. So, our data set has 31 records, split over 7 columns and however many weeks are in the month. One other note, I’m using “id – 1” because of how the ADDDATE function works. I want it to add 0 for the first record.

This should also work for SQL Server if you cleaned up the code a bit for the language differences. I don’t know if I’ll ever do it but I’m comfortable that you could do this with a PIVOT on the SQL Server side of the house.

In closing, there is no earthly reason, at least none that I can think of, to do this. It’s just a fun exercise.

Hope you enjoy it.

02 May

Days Between Two Dates (Excluding Weekends)

I’m still having fun with the SQL Cookbook (I need to review it and I think I’ve already said that) and he talks about calculating the difference between two dates, which is easy, but then excluding weekends, which is less easy, but not that hard.

Because I’m a Monday Morning SQL writer, does that analogy really work here, I’m going to offer a different solution to what the author proposed.  Technically, all I’m doing is changing the SQL, as this is basically what he proposes.  I’m not going to show his code in this case as what he did is a bit long and he’s doing a couple of other things in it so I’ll just show you how to do it.

This will go through a number of steps and follow a similar pattern to my last SQL example in that it will join to a sequence of numbers.

1. Calculate the number of days between two dates.

mysql> SELECT DATEDIFF('2007-08-01','2007-01-01') as diff;
+------+
| diff |
+------+
|  212 |
+------+

So, there are 212 days between the two dates. Simple enough but we still have to get rid of weekends, which will take a few more steps.

One other thing that is important is to consider whether you want to include the first date in our data. The answer is yes in this case so our numbers table will need to include a zero. I’ve added it to the table, however, yesterday’s example needed to start at 1, so it did not have a zero.  In other words, if you include the two date ranges there are actually 213 dates in the list.

2. Select data from the numbers table.

In a normal example we would have to do a join but because we’re calculating the difference between two dates all we need to do is a simple select.

mysql> SELECT DATEDIFF('2007-08-01','2007-01-01') as totDays,
    -> id as diff
    -> FROM numbers
    -> WHERE id <= DATEDIFF('2007-08-01','2007-01-01');
+---------+------+
| totDays | diff |
+---------+------+
|     212 |    0 |
|     212 |    1 |
|     212 |    2 |
|     212 |    3 |
|     212 |    4 |

Snipped...

|     212 |  208 |
|     212 |  209 |
|     212 |  210 |
|     212 |  211 |
|     212 |  212 |
+---------+------+
212 rows in set (0.00 sec)

So now we have a list of 212 numbers. All we need to do is convert those to dates as follows.

3. Build a list of dates.

mysql> SELECT DATEDIFF('2007-08-01','2007-01-01') as totDays,
    -> id as diff,
    -> ADDDATE('2007-01-01',id) as newDate
    -> FROM numbers
    -> WHERE id <= DATEDIFF('2007-08-01','2007-01-01');
+---------+------+------------+
| totDays | diff | newDate    |
+---------+------+------------+
|     212 |    0 | 2007-01-01 |
|     212 |    1 | 2007-01-02 |
|     212 |    2 | 2007-01-03 |
|     212 |    3 | 2007-01-04 |
|     212 |    4 | 2007-01-05 |

Snipped...

|     212 |  208 | 2007-07-28 |
|     212 |  209 | 2007-07-29 |
|     212 |  210 | 2007-07-30 |
|     212 |  211 | 2007-07-31 |
|     212 |  212 | 2007-08-01 |
+---------+------+------------+
212 rows in set (0.00 sec)

4. Evaluate each date using the DAYOFWEEK function.

We’re going to clean this up in a second, but for now we’ll simply add the DAYOFWEEK function to the code. Note that 1 and 7 are Sunday and Saturday.

mysql> SELECT DATEDIFF('2007-08-01','2007-01-01') as totDays,
    -> id as diff,
    -> ADDDATE('2007-01-01',id) as newDate,
    -> DAYOFWEEK(ADDDATE('2007-01-01',id)) as dayOfWeek
    -> FROM numbers
    -> WHERE id <= DATEDIFF('2007-08-01','2007-01-01');
+---------+------+------------+-----------+
| totDays | diff | newDate    | dayOfWeek |
+---------+------+------------+-----------+
|     212 |    0 | 2007-01-01 |         2 |
|     212 |    1 | 2007-01-02 |         3 |
|     212 |    2 | 2007-01-03 |         4 |
|     212 |    3 | 2007-01-04 |         5 |
|     212 |    4 | 2007-01-05 |         6 |

Snipped...

|     212 |  208 | 2007-07-28 |         7 |
|     212 |  209 | 2007-07-29 |         1 |
|     212 |  210 | 2007-07-30 |         2 |
|     212 |  211 | 2007-07-31 |         3 |
|     212 |  212 | 2007-08-01 |         4 |
+---------+------+------------+-----------+
212 rows in set (0.00 sec)

5. Remove Saturdays and Sundays.

In order to do this all we need to do is add a WHERE clause that checks for the values of 1 and 7 which are Saturday and Sunday with the DAYOFWEEK function.

mysql> SELECT DATEDIFF('2007-08-01','2007-01-01') as totDays,
    -> id as diff,
    -> ADDDATE('2007-01-01',id) as newDate,
    -> DAYOFWEEK(ADDDATE('2007-01-01',id)) as dayOfWeek
    -> FROM numbers
    -> WHERE id <= DATEDIFF('2007-08-01','2007-01-01')     
    -> AND DAYOFWEEK(ADDDATE('2007-01-01',id)) NOT IN (1,7);
+---------+------+------------+-----------+
| totDays | diff | newDate    | dayOfWeek |
+---------+------+------------+-----------+
|     212 |    0 | 2007-01-01 |         2 |
|     212 |    1 | 2007-01-02 |         3 |
|     212 |    2 | 2007-01-03 |         4 |
|     212 |    3 | 2007-01-04 |         5 |
|     212 |    4 | 2007-01-05 |         6 |

Snipped...

|     212 |  206 | 2007-07-26 |         5 |
|     212 |  207 | 2007-07-27 |         6 |
|     212 |  210 | 2007-07-30 |         2 |
|     212 |  211 | 2007-07-31 |         3 |
|     212 |  212 | 2007-08-01 |         4 |
+---------+------+------------+-----------+
153 rows in set (0.01 sec)

Notice that we’ve returned 153 rows in the set which is how many working days there are between the two dates. Now we just need to clean up the code so we can get the number of days between the two.

6. Clean up the code.

This is super easy. Just remove the entire SELECT and replace it with a COUNT(*).

mysql> SELECT COUNT(*) as numWorkingDays
    -> FROM numbers
    -> WHERE id <= DATEDIFF('2007-08-01','2007-01-01')     
    -> AND DAYOFWEEK(ADDDATE('2007-01-01',id)) NOT IN (1,7);
+----------------+
| numWorkingDays |
+----------------+
|            153 |
+----------------+
1 row in set (0.00 sec)

You can do the same thing in SQL Server, however, the code, is a bit different. Also, the DATEFIRST value can be set so that Sunday is not the first day in a week. I won’t cover that but just be aware that it can happen. The SQL Server default is the same as it is in MySQL.

Here is the SQL Server code.

SELECT COUNT(*) as numWorkingDays
FROM numbers
WHERE id <= DATEDIFF(day,'2007-01-01','2007-08-01')
AND DATEPART(weekday,DATEADD(day,id,'2007-01-01')) NOT IN (1,7);

numWorkingDays
153

As you can see the DATEDIFF is different, it requires you to specify day (days), and you have to use DATEPART to get the specific day from a date, and not just using DAYOFWEEK in MySQL. Otherwise, it’s the same general concept.

I really didn’t add anything beyond the book except that my example is a bit cleaner and isn’t trying to do two things.

Hope it helps as this does come up periodically in business calculations.