19 May

SET Operators: UNION

The easiest set operator to deal with is a UNION.  It’s basically a data set appended to a data set.  It has a couple of minor quirks, column naming, ordering, and then, of course, SAS picked the path least chosen, but for the most part, anytime you see a UNION, or a UNION ALL, it’s going to work the same way.

The code for this example, except where noted, is in MySQL.

Here is a very simple example of a UNION.

mysql> SELECT 1,2,3
    -> UNION
    -> SELECT 4,5,6;
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
+---+---+---+
2 rows in set (0.00 sec)

As you can see the second SELECT is added to the first set. Typically, the data types do not have to match, as automatic type conversion will take place. Except, of course, if you are working with SAS, where the following will happen:

proc sql;
SELECT name, age
FROM mysql.admit
UNION
SELECT age, name
FROM mysql.admit;
quit;

ERROR: Column 1 from the first contributor of UNION is not the same type as its counterpart from the second.
ERROR: Column 2 from the first contributor of UNION is not the same type as its counterpart from the second.

A straight UNION will remove duplicates similar to what a DISTINCT does in a SELECT statement.

mysql> SELECT 1,2,3
    -> UNION
    -> SELECT 1,2,3
    -> UNION
    -> SELECT 3,4,5;
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
| 3 | 4 | 5 |
+---+---+---+
2 rows in set (0.00 sec)

If you, instead, write the query as a UNION ALL it will include all rows, including any duplicates. This applies in SAS, although, they approach the ALL keyword slightly differently but I’ll deal with that in a later article.

mysql> SELECT 1,2,3
    -> UNION ALL
    -> SELECT 1,2,3
    -> UNION ALL
    -> SELECT 1,2,3;
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
+---+---+---+
3 rows in set (0.00 sec)

You can also mix a UNION and a UNION ALL.

mysql> SELECT 1,2,3
    -> UNION ALL
    -> SELECT 1,2,3
    -> UNION
    -> SELECT 1,2,3;
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
1 row in set (0.00 sec)

Each UNION only applies to the SELECT statements that it separates.

You can also alias a UNION, however, the alias must be applied to the first SELECT statement’s columns. You cannot apply it to later columns, well, you can, but as you can see it won’t work. For example:

mysql> SELECT 1 as one, 2 as two, 3 as three
    -> UNION
    -> Select 4,5,6;
+-----+-----+-------+
| one | two | three |
+-----+-----+-------+
|   1 |   2 |     3 |
|   4 |   5 |     6 |
+-----+-----+-------+
2 rows in set (0.00 sec)

mysql> SELECT 1,2,3
    -> UNION ALL
    -> SELECT 1 as one, 2 as two, 3 as three;
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 1 | 2 | 3 |
| 1 | 2 | 3 |
+---+---+---+
2 rows in set (0.00 sec)

ORDER BY statements must be applied at the end of the UNION. For example:

mysql> SELECT name FROM junk ORDER BY name
    -> UNION
    -> SELECT name FROM junk;
ERROR 1221 (HY000): Incorrect usage of UNION and ORDER BY
mysql>

mysql> SELECT name FROM junk
    -> UNION
    -> SELECT name FROM junk ORDER BY name;
+----------------+
| name           |
+----------------+
| Almers, C      |
| Bonaventure, T |
| Cameron, L     |
| Derber, B      |
| Eberhardt, S   |
| Ivan, H        |
| Johnson, R     |
| Jones, M       |
| King, E        |
| LaMance, K     |
| Murray, W      |
| Nunnelly, A    |
| Oberon, M      |
| Peterson, V    |
| Pitts, D       |
| Quigley, M     |
| Reberson, P    |
| Takahashi, Y   |
| Underwood, K   |
| Warren, C      |
| Wilcox, E      |
+----------------+
21 rows in set (0.04 sec)

And now, it gets a bit funky, because SAS joined the party.

This is viable in SAS:

proc sql;
SELECT name, age
FROM mysql.admit
UNION
SELECT name
FROM mysql.admit;
quit;

Name                 Age
------------------------
Almers, C              .
Almers, C             34
Bonaventure, T         .
Bonaventure, T        31
Cameron, L             .
Cameron, L            47
Derber, B              .
Derber, B             25
Eberhardt, S           .
Eberhardt, S          49
Ivan, H                .
Ivan, H               22
Johnson, R             .
Johnson, R            43

MySQL and SQL Server will both error.

mysql> SELECT 1,2
-> UNION
-> SELECT 1,2,3;
ERROR 1222 (21000): The used SELECT statements have a different number of column

In other words, SQL Server, and MySQL, expect that a UNION operator will have the same number of columns.  SAS, doesn’t care, and simply NULLs, or in SAS terminology, sets those fields to missing.

SET operators, all things considered, are really easy to use, assuming SAS doesn’t get involved and even there, as long as you stay with a UNION or UNION ALL, it’s sort of, maybe, kind of, close.

As far as the SAS Advanced test goes, I think I’ll keep in mind that SAS allows a different number of columns, and, it doesn’t allow different data types in a column. It also has other quirks but I’ll deal with those in later posts.

16 Feb

Separating Strings By Delimiter, Again

This is kind of a weird one that I saw on a forum.  This isn’t an elegant solution but then it isn’t an elegant question either.  I assume that the original poster was doing something with hierarchical data and things got a bit weird.  Anyways, here is the scenario.  You have have two tables that look like this.

SELECT * 
FROM tableA;

id	col1
1	2,3
2	2,6
3	1,7
4	4,10
5	10

SELECT *
FROM tableB;
id	col1
1	2
2	2,3 
3	1,4
4	10,11

The logic that the poster was trying to achieve is if you select id 1 from tableA to check if either 2 or 6 appear in tableB. In this case, we have to separate both tables on the comma delimiter. Something like this.

WHERE tableA.id = 2

Return from tableB

id	col1
1	2
2	2,3

This is actually surprisingly complex. Here are the basics of my solution to it. Note that I assume there will never be more than 2 values in a field.

  • Build 2 common table expressions that separate the delimited values.
  • Separate the second index value by using RIGHT function and subtracting the position of the delimiter from the length of the string.
  • Reverse the string, apply the prior RIGHT function, then REVERSE the string back to its original value.
  • Create a join from the two CTE’s.
  • Test if either of the two values from the first table appear in the second.

The strange piece in all of this is the REVERSE function.  The reason I’m doing this is that you can’t, without being very, very tricky, break out a string based on a delimiter.  The math simply doesn’t work, so I perform something that works, in reverse, get the result, the reverse the result.  The query I came up with looks like this.

WITH cteA AS
(
SELECT col1, 
REVERSE(RIGHT(REVERSE(col1),LEN(col1) - CHARINDEX(',',REVERSE(col1)))) as index1,
RIGHT(col1,LEN(col1) - CHARINDEX(',',col1)) as index2
FROM tableA
),
cteB AS
(
SELECT col1, 
REVERSE(RIGHT(REVERSE(col1),LEN(col1) - CHARINDEX(',',REVERSE(col1)))) as index1,
RIGHT(col1,LEN(col1) - CHARINDEX(',',col1)) AS index2
FROM tableB
)
SELECT cteA.col1, cteB.col1
FROM cteA,cteB
WHERE (cteA.index1 IN (cteB.index1,cteB.index2)
OR cteA.index2 IN (cteB.index1,cteB.index2))
ORDER BY cteA.col1

Results

col1	col1
1,7	1,4
10	10,11
2,3	2
2,3	2,3
2,6	2,3
2,6	2
4,10	1,4
4,10	10,11

It’s not the prettiest thing I’ve ever come up with but in the context of what the guy was asking for, and not using a cursor, it works.

11 Feb

PROC MEANS In T-SQL

SAS has a basic procedure called PROC MEANS.  It essentially takes a column of data and performs some basic statistical analysis on it.  Specifically it looks at the following data,

  • N – number of occurrences
  • Mean
  • Maximum
  • Minimum
  • Range (Maximum – Minimum)
  • Standard Deviation.

You can read more about PROC MEANS at this page: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473539.htm.

Because this site is for my entertainment I figured I’d use the cake data as found in the above example and create that result set. It’s actually fairly easy to do as the above examples map to T-SQL functions quite easily as follows,

  • COUNT(presentscore) as N
  • AVG(presentscore) as Mean
  • MAX(presentscore) as Maximum
  • MIN(presentscore) as Minimum
  • MAX(presentscore) – MIN(presentscore) as Range
  • ROUND(STDEV(presentscore),4) as ‘Std Dev’

So to reproduce the output generated by the PROC MEANS example on that page all I need to do is the following,

SELECT 'PresentScore' as Variable, 
COUNT(presentscore) as N, 
AVG(presentscore) as Mean,
MAX(presentscore) as Maximum,
MIN(presentscore) as Minimum,
MAX(presentscore) - MIN(presentscore) as Range,
ROUND(STDEV(presentscore),4) as 'Std Dev'
FROM cake
UNION ALL
SELECT 'TasteScore', 
COUNT(tastescore), 
AVG(tastescore),
MAX(tastescore),
MIN(tastescore),
MAX(tastescore) - MIN(tastescore),
ROUND(STDEV(tastescore),4)
FROM cake

Variable	N	Mean	Maximum	Minimum	Range	Std Dev
PresentScore	20	76	93	56	37	9.3768
TasteScore	20	81	94	72	22	6.6116

There isn’t much to note here.  I build two queries, alias them, add a UNION ALL and call it a day.  Perhaps the most interesting thing is that it lets you build aggregates against a character value.  Otherwise, it’s all very straight-forward.

However, there is an even easier way to do this.

WITH cte AS
(
SELECT 'PresentScore' as Variable, PresentScore as value
FROM cake
UNION ALL
SELECT 'TasteScore', TasteScore
FROM cake
)
SELECT Variable as Variable,
COUNT(*) as N,
AVG(value) as Mean,
MAX(value) as Maximum,
MIN(value) as Minimum,
MAX(value) - MIN(value) as Range,
ROUND(STDEV(value),4) as 'Std Dev'
FROM cte
GROUP BY Variable;

The change is that we build the CTE with a union then select against it.  It’s much easier to work with.  This can also be written as an inline-view.

SELECT Variable as Variable,
COUNT(*) as N,
AVG(value) as Mean,
MAX(value) as Maximum,
MIN(value) as Minimum,
MAX(value) - MIN(value) as Range,
ROUND(STDEV(value),4) as 'Std Dev'
FROM (
SELECT 'PresentScore' as Variable, PresentScore as value
FROM cake
UNION ALL
SELECT 'TasteScore', TasteScore
FROM cake ) as A
GROUP BY Variable

I’ll be revisiting this in a later post but for right now you have a working simulation of PROC MEANS.