29 Apr

SAS: Free Format Data

Free format data is really delimited data with a couple of tweaks.  Here is a sample using datalines.

data mydata;
input name $ age weight;
datalines;
Fred 43 165
Frank 55 196
Bill 26 144
William 35 205
;
run;

proc print mydata;
run;

Obs    name       age    weight

 1     Fred        43      165 
 2     Frank       55      196 
 3     Bill        26      144 
 4     William     35      205

There are a couple of rules that apply to free form / delimited data.

  • It is read from left to right.
  • All fields are separated by a delimiter.
  • You can’t skip or reread fields.

You can set the delimiter (DLM) by using the DELIMITER option at the end of the INFILE statement as follows:

data mydata;
infile '....' delimiter=' ';
input name $ jump1-jump4;
run;

Obs    name      jump1    jump2    jump3    jump4
1     Lucky      2.3      1.9       .       3.0 
2     Spot       4.6      2.5      3.1      0.5 
3     Tubs       7.1       .        .       3.8 
4     Hop        4.5      3.2      1.9      2.6 
5     Noisy      3.8      1.3      1.8      1.5 
6     Winner     5.7       .        .        .

This command is actually useless, as it defaults to spaces, and the file is space separated, but if your delimiter was say a tilde you would use that in the DLM / Delimiter command.

In my previous example I used numeric data but I could also use the same statement to import the jump data as character data. I would do that as follows:

data mydata;
infile '....' delimiter=' ';
input name $ (jump1-jump4) ($);
run;

Obs    name      jump1    jump2    jump3    jump4

 1     Lucky      2.3      1.9               3.0 
 2     Spot       4.6      2.5      3.1      .5  
 3     Tubs       7.1                        3.8 
 4     Hop        4.5      3.2      1.9      2.6 
 5     Noisy      3.8      1.3      1.8      1.5 
 6     Winner     5.7

The parenthesis are required.

You can also use informats.

In some cases you will have data that is incomplete. For instance, one row will have 5 records and the next 3. This can confuse SAS into reading the next record. To prevent that you use the MISSOVER option. This only works for records missing at the end of the file.

data mydata;
infile '....';
input id $ a1 a2 a3 a4 a5;
run;
Obs      id      a1    a2    a3    a4    a5
 1     Nguyen    89    76    91    82     .

data mydata;
infile '....' missover;
input id $ a1 a2 a3 a4 a5;
run;

 1     Nguyen     89    76    91    82     .
 2     Ramos      67    72    80    76    86
 3     Robbins    76    65    79     .     .
 4                 .     .     .     .     .
 5                 .     .     .     .     .
Apparently there are some returns in the file.

You can also use the LENGTH statement to set the length of a value prior to the INPUT statement.

Note: The next couple of sections are almost identical to the Study Guide.
data mydata;
infile '....';
length id $10;
input id $;
run;

Sometimes you will run into data that has blanks in a field that you want to combine. Usually you will see this is a name field of sort. Here is an example of how that can go bad.

data mydata;
infile '....';
input id name $;
run;

id    name

 1    NEW     
 2    LOS     
 3    CHICAGO 
 4    HOUSTON 
 5    PHILADEL
 6    DETROIT 
 7    SAN     
 8    DALLAS  
 9    SAN     
10    PHOENIX

That actually went bad a couple of ways but the most obvious is that it separated the names with spaces.

data mydata;
infile '....';
length name $12;
input id name &;
run;

name            id
NEW YORK         1
LOS ANGELES      2
CHICAGO          3
HOUSTON          4
PHILADELPHIA     5
DETROIT          6
SAN DIEGO        7
DALLAS           8
SAN ANTONIO      9
PHOENIX         10

The Ampersand tells SAS to ignore blanks until it encounters consecutive blanks and to delimit the variables based on that. This something that honestly, I’ve never used. Everything is tab-delimited or CSV these days.

One last option here is the colon. The colon applies an informat to the data and is used for nonstandard data. Here’s how it didn’t work, absent the colon, and how it did with a colon.

data mydata (drop=id);
infile '....';
length name $12;
input id name & population comma.;
run;

name            population
NEW YORK             .    
LOS ANGELES          3    
CHICAGO              .    
HOUSTON              .    
PHILADELPHIA         1    
DETROIT              .    
SAN DIEGO            .    
DALLAS               .    
SAN ANTONIO          9    
PHOENIX              .

As you can see that didn’t work very well. But this, so much the better.

data mydata (drop=id);
infile '....';
length name $12;
input id name & population : comma.;
run;

name            population
NEW YORK          7262700 
LOS ANGELES       3259340 
CHICAGO           3009530 
HOUSTON           1728910 
PHILADELPHIA      1642900 
DETROIT           1086220 
SAN DIEGO         1015190 
DALLAS            1003520 
SAN ANTONIO        914350 
PHOENIX            894070

This would probably be super easy to screw up on a test. They would roll in with a fixed format toss a colon in, and,

data mydata (drop=id);
infile '....';
input region $ 1-12 quarter 13 sales : comma9.;
run;

region      quarter     sales
NorthEast       1       158510
SouthWest       4       292801
NorthEast       3       172301
SouthWest       1       384220
NorthEast       2       506510
SouthWest       3       392202
NorthEast       4       148510
SouthEast       1       284720
NorthWest       2       348803
SouthEast       3       122220
NorthWest       4       384840
SouthEast       1       339880

Wait, cool, that actually worked. I guess it still needs to be memorized as it would be easy to stop, look at that, ponder it a bit, and completely forget that it works in fixed format data too.

17 Apr

SQL Cookbook, Errors

SQL Cookbook

SQL Cookbook

I happened to scan this book online, it looked like the kind of thing I would like, lots of queries written for different platforms, so I picked it up. Unfortunately, it’s not off to the best start.

First, I downloaded the sample code and there are no semicolons in the data.  So, I couldn’t load it into SQL Server or MySQL.  It wasn’t hard to fix, I simply had to Find/Replace the end of line and ) characters, but, I shouldn’t have had to do that. Whomever does the O’Reilly sample code blows.

The second problem I ran into was an ORDER BY example.  The author says to use the following code to order a result set by the last two characters. Yeah, I know, it’s the kind of useful thing you’ve come to expect from these parts.  Here is his code for MySQL.

mysql> SELECT ename, job
    -> FROM emp
    -> ORDER BY SUBSTR(job, LENGTH(job)-2);
+--------+-----------+
| ename  | job       |
+--------+-----------+
| KING   | PRESIDENT |
| SMITH  | CLERK     |
| JAMES  | CLERK     |
| ADAMS  | CLERK     |
| MILLER | CLERK     |
| BLAKE  | MANAGER   |
| CLARK  | MANAGER   |
| JONES  | MANAGER   |
| MARTIN | SALESMAN  |
| TURNER | SALESMAN  |
| WARD   | SALESMAN  |
| ALLEN  | SALESMAN  |
| FORD   | ANALYST   |
| SCOTT  | ANALYST   |
+--------+-----------+
14 rows in set (0.00 sec)

Honestly, I probably would have blown by it, I was scanning pretty fast, except that I would have done it a different way, so I checked it. Here is my code:

mysql> SELECT ename, job
    -> FROM emp
    -> ORDER BY SUBSTR(job,-2);
+--------+-----------+
| ename  | job       |
+--------+-----------+
| TURNER | SALESMAN  |
| ALLEN  | SALESMAN  |
| WARD   | SALESMAN  |
| MARTIN | SALESMAN  |
| CLARK  | MANAGER   |
| BLAKE  | MANAGER   |
| JONES  | MANAGER   |
| KING   | PRESIDENT |
| JAMES  | CLERK     |
| SMITH  | CLERK     |
| ADAMS  | CLERK     |
| MILLER | CLERK     |
| FORD   | ANALYST   |
| SCOTT  | ANALYST   |
+--------+-----------+
14 rows in set (0.01 sec)

Uh, guys, those aren’t the same results. Here’s the problem,

mysql> SELECT ename, job, SUBSTR(job, LENGTH(job)-2), SUBSTR(job,-2)
    -> FROM emp;
+--------+-----------+----------------------------+----------------+
| ename  | job       | SUBSTR(job, LENGTH(job)-2) | SUBSTR(job,-2) |
+--------+-----------+----------------------------+----------------+
| SMITH  | CLERK     | ERK                        | RK             |
| ALLEN  | SALESMAN  | MAN                        | AN             |
| WARD   | SALESMAN  | MAN                        | AN             |
| JONES  | MANAGER   | GER                        | ER             |
| MARTIN | SALESMAN  | MAN                        | AN             |
| BLAKE  | MANAGER   | GER                        | ER             |
| CLARK  | MANAGER   | GER                        | ER             |
| SCOTT  | ANALYST   | YST                        | ST             |
| KING   | PRESIDENT | ENT                        | NT             |
| TURNER | SALESMAN  | MAN                        | AN             |
| ADAMS  | CLERK     | ERK                        | RK             |
| JAMES  | CLERK     | ERK                        | RK             |
| FORD   | ANALYST   | YST                        | ST             |
| MILLER | CLERK     | ERK                        | RK             |
+--------+-----------+----------------------------+----------------+
14 rows in set (0.00 sec)

As you can see that results in 3 characters. Using LENGTH is perfectly fine in this case, except for the problem in the next paragraph, but it should be LENGTH() – 1, instead of 2.

However, technically, there is still one more problem. Because LENGTH doesn’t actually do what we all use it for, return the length of a string. Technically, LENGTH returns the number of bytes in a string. So, in some situations, particularly with foreign languages you could get an answer you aren’t expecting.

Instead, what you should use is CHAR_LENGTH, which returns the length of a string based on its character length. So, the correct code for this example is actually,

mysql> SELECT ename, job
    -> FROM emp
    -> ORDER BY SUBSTR(job, CHAR_LENGTH(job)-1);
+--------+-----------+
| ename  | job       |
+--------+-----------+
| TURNER | SALESMAN  |
| ALLEN  | SALESMAN  |
| WARD   | SALESMAN  |
| MARTIN | SALESMAN  |
| CLARK  | MANAGER   |
| BLAKE  | MANAGER   |
| JONES  | MANAGER   |
| KING   | PRESIDENT |
| JAMES  | CLERK     |
| SMITH  | CLERK     |
| ADAMS  | CLERK     |
| MILLER | CLERK     |
| FORD   | ANALYST   |
| SCOTT  | ANALYST   |
+--------+-----------+
14 rows in set (0.00 sec)

Also, the author makes the same mistake with SQL Server. Here is his code,

SELECT ename, job
FROM emp
ORDER BY SUBSTRING(job,LEN(job)-2,2);

ename	job
KING	PRESIDENT
SMITH	CLERK
ADAMS	CLERK
JAMES	CLERK
MILLER	CLERK
JONES	MANAGER
BLAKE	MANAGER
CLARK	MANAGER
TURNER	SALESMAN
MARTIN	SALESMAN
ALLEN	SALESMAN
WARD	SALESMAN
SCOTT	ANALYST
FORD	ANALYST
Here is the code that he should have used.
SELECT ename, job
FROM emp
ORDER BY SUBSTRING(job,LEN(job)-1,2);

ename	job
MARTIN	SALESMAN
ALLEN	SALESMAN
WARD	SALESMAN
TURNER	SALESMAN
JONES	MANAGER
BLAKE	MANAGER
CLARK	MANAGER
KING	PRESIDENT
SMITH	CLERK
ADAMS	CLERK
JAMES	CLERK
MILLER	CLERK
FORD	ANALYST
SCOTT	ANALYST

LEN is the correct function to use here because it returns the character length. If you want the length in bytes no the SQL Server side of the house you should use DATALENGTH.

So, not a great start, but, to be fair, I’ve probably blown the LENGTH statement too.  It’s easy to just fire it, especially, when switching between different platforms.

09 Mar

Word Count

Absent any weird collations, or other unexpected oddities, getting a word count is very easy.  Here’s the code.

DECLARE @string varchar(50) = 'Now   is the time''s for all good men';

SET @string = REPLACE(@string, '''','');

SELECT LEN(REPLACE(LTRIM(RTRIM(@string)),'  ','')) -
LEN(REPLACE(@string,' ','')) + 1 AS wordCount;

wordCount
8

What it does.

  1. The SET removes any apostrophes from the data.
  2. Trims leading and trailing spaces from the string.
  3. Removes all multiple space instances from the first operation but leaves single spaces.
  4. Removes all spaces from the second operation.
  5. Subtracts the length of the first operation from the second..
  6. Adds one to the result.

On the MySQL side of the house,

mysql> SET @string = 'Now   is\' the time for all good men';
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> SET @string = REPLACE(@string, '\'', '');
Query OK, 0 rows affected (0.00 sec)

mysql>
mysql> SELECT LENGTH(REPLACE(LTRIM(RTRIM(@string)),'  ','')) -
    -> LENGTH(REPLACE(@string,' ','')) + 1 AS wordCount;
+-----------+
| wordCount |
+-----------+
|         8 |
+-----------+
1 row in set (0.00 sec)

Pretty much the same thing except that LEN, is LENGTH, you use SET instead of DECLARE and the apostrophes are escaped with \’ rather than ”.