30 Apr

Running Totals

In the SQL Cookbook the author gives an example of a running total.  I thought I’d do it a couple of other ways.

I’ll be using summarized data from the military spending data that I have used in earlier articles. Here is a summary of that.

mysql> SELECT region, SUM(spending) AS totspending
    -> FROM militaryspending
    -> GROUP BY region;
+---------------+-------------+
| region        | totspending |
+---------------+-------------+
| Africa        |      120608 |
| Asia          |    428477.7 |
| Australia     |     28559.1 |
| Europe        |    312412.2 |
| North America |    743120.4 |
| South America |     65976.9 |
+---------------+-------------+
6 rows in set (0.00 sec)

The SQL Cookbook uses the following for a running total, and I might add, it’s perfectly fine as a solution. Anyways, here it is.

mysql> SELECT a.region, SUM(a.spending) as regSpending,
    -> (SELECT SUM(spending) FROM militaryspending b WHERE a.region >= b.region)
 as rollingTot
    -> FROM militaryspending a
    -> GROUP BY a.region
    -> ORDER BY a.region;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

This also works for SQL Server. What it does is use a correlated subquery to calculate a SUM based on the region value. This is quite clean. Another way you can do this is with a non-equi join (that gets spelled all kinds of different ways). The code for the join looks like this,

mysql> SELECT a.region, SUM(b.spending) AS rollingTot
    -> FROM militaryspending a
    -> JOIN militaryspending b
    -> ON a.region >= b.region
    -> GROUP BY a.region
    -> ORDER BY a.region;
+---------------+------------+
| region        | rollingTot |
+---------------+------------+
| Africa        |    4703712 |
| Asia          | 15374399.6 |
| Australia     |  1732934.4 |
| Europe        |   32042052 |
| North America | 14698596.6 |
| South America | 18690697.3 |
+---------------+------------+
6 rows in set (0.00 sec)

This, mostly, does the same thing as the SQL Cookbook example except that it’s a bit less elegant, and individual region totals aren’t an option. Anyways, all this does is change the SUM in the SELECT to a JOIN. All things considered, it’s definitely, option D in the list of the three options.

However, in MySQL, there is an even easier way to do this, use a variable.

mysql> SELECT region, SUM(spending) AS regSpending,
    -> @rollingTot := @rollingTot + regSpending as rollingTot
    -> FROM militaryspending
    -> JOIN (SELECT @rollingTot := 0) a
    -> GROUP BY region
    -> ORDER BY region;
ERROR 1054 (42S22): Unknown column 'regSpending' in 'field list'

OK, that “should” work, and it would work, if this didn’t include a GROUP BY. The GROUP BY blows it up, so instead, we simply use a subquery as follows:

mysql> SELECT region, regSpending, @rollingTot := @rollingTot + regSpending as r
ollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a
    -> JOIN (SELECT @rollingTot := 0) b;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

This is simply a basic GROUP BY that the main query reads from using the variable to maintain the rolling total. It’s really that simple, albeit, purists won’t like it. Oh, and it can be written without the join as follows,

mysql> SET @rollingTot = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT region, regSpending, @rollingTot := @rollingTot + regSpending as r
ollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a;
+---------------+-------------+------------+
| region        | regSpending | rollingTot |
+---------------+-------------+------------+
| Africa        |      120608 |     120608 |
| Asia          |    428477.7 |   549085.7 |
| Australia     |     28559.1 |   577644.8 |
| Europe        |    312412.2 |     890057 |
| North America |    743120.4 |  1633177.4 |
| South America |     65976.9 |  1699154.3 |
+---------------+-------------+------------+
6 rows in set (0.00 sec)

Finally, just for comparison sakes I ran each through an explain. Warning, incoming row-wrap ugliness.

mysql> EXPLAIN SELECT region, regSpending, @rollingTot := @rollingTot + regSpend
ing as rollingTot
    -> FROM (
    -> SELECT region, SUM(spending) AS regSpending
    -> FROM militaryspending
    -> GROUP BY region
    -> ORDER BY region) a
    -> JOIN (SELECT @rollingTot := 0) b;
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
| id | select_type | table            | type   | possible_keys | key  | key_len
| ref  | rows | Extra                           |
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
|  1 | PRIMARY     |        | system | NULL          | NULL | NULL
| NULL |    1 |                                 |
|  1 | PRIMARY     |        | ALL    | NULL          | NULL | NULL
| NULL |    6 |                                 |
|  3 | DERIVED     | NULL             | NULL   | NULL          | NULL | NULL
| NULL | NULL | No tables used                  |
|  2 | DERIVED     | militaryspending | ALL    | NULL          | NULL | NULL
| NULL |  126 | Using temporary; Using filesort |
+----+-------------+------------------+--------+---------------+------+---------
+------+------+---------------------------------+
4 rows in set (0.00 sec)

mysql> EXPLAIN SELECT a.region, SUM(a.spending) as regSpending,
    -> (SELECT SUM(spending) FROM militaryspending b WHERE a.region >= b.region)
 as rollingTot
    -> FROM militaryspending a
    -> GROUP BY a.region
    -> ORDER BY a.region;
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
| id | select_type        | table | type | possible_keys | key  | key_len | ref
 | rows | Extra                           |
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
|  1 | PRIMARY            | a     | ALL  | NULL          | NULL | NULL    | NULL
 |  126 | Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | b     | ALL  | NULL          | NULL | NULL    | NULL
 |  126 | Using where                     |
+----+--------------------+-------+------+---------------+------+---------+-----
-+------+---------------------------------+
2 rows in set (0.00 sec)

mysql> EXPLAIN SELECT a.region, SUM(b.spending) AS rollingTot
    -> FROM militaryspending a
    -> JOIN militaryspending b
    -> ON a.region >= b.region
    -> GROUP BY a.region
    -> ORDER BY a.region;
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows
 | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
|  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL |  126
 | Using temporary; Using filesort |
|  1 | SIMPLE      | b     | ALL  | NULL          | NULL | NULL    | NULL |  126
 | Using where; Using join buffer  |
+----+-------------+-------+------+---------------+------+---------+------+-----
-+---------------------------------+
2 rows in set (0.00 sec)

Anyways, both the Cookbook version, and the join had basically the same performance, which makes sense because they’re both doing the same thing, mostly. The variable option, however, is faster, because it’s simply aggregating the totals of a single GROUP BY. It doesn’t have to do the join.  It would, by far, be the best option if you needed a rolling total without an aggregate.

Unfortunately, it’s only good on the MySQL side of the house.

30 Apr

SAS: Creating a File Revisited

This is a continuation of: http://www.erikhaselhofer.com/?p=3361.

In an earlier post I used something like the following to create a data file.

data _null_;
set mydata;
file 'C:\testData\myData.dat' ;
put name $ 1-10 weight 11-13 year 14-17;
run;

Which was all well and good, but, we can improve this a bit.

We can specify a comma-delimiter by doing the following:

data _null_;
set mydata;
file 'C:\testData\myData.dat' dlm=',';
put name weight year;
run;

What happens if your data has commas? Then you use the DSD option which specifies that fields containing commas are enclosed by quotation marks. For example:

data _null_;
set mydata;
file 'C:\testData\myData.dat' dsd;
put name weight year;
run;

DSD also works when you read a file. It is used in the INPUT statement in that case.

Finally, you can use the PUT statement to add raw data to a file. For example:

data _null_;
set mydata;
file 'C:\testData\myData.dat' dsd;
put 'this goes first ' name weight year;
run;

That might not be the most useful thing to do, at least in this case, but it could have its uses. And, I’ll add, this is not my longest or most productive post but it should help me remember these items.

29 Apr

SAS: Free Format Data

Free format data is really delimited data with a couple of tweaks.  Here is a sample using datalines.

data mydata;
input name $ age weight;
datalines;
Fred 43 165
Frank 55 196
Bill 26 144
William 35 205
;
run;

proc print mydata;
run;

Obs    name       age    weight

 1     Fred        43      165 
 2     Frank       55      196 
 3     Bill        26      144 
 4     William     35      205

There are a couple of rules that apply to free form / delimited data.

  • It is read from left to right.
  • All fields are separated by a delimiter.
  • You can’t skip or reread fields.

You can set the delimiter (DLM) by using the DELIMITER option at the end of the INFILE statement as follows:

data mydata;
infile '....' delimiter=' ';
input name $ jump1-jump4;
run;

Obs    name      jump1    jump2    jump3    jump4
1     Lucky      2.3      1.9       .       3.0 
2     Spot       4.6      2.5      3.1      0.5 
3     Tubs       7.1       .        .       3.8 
4     Hop        4.5      3.2      1.9      2.6 
5     Noisy      3.8      1.3      1.8      1.5 
6     Winner     5.7       .        .        .

This command is actually useless, as it defaults to spaces, and the file is space separated, but if your delimiter was say a tilde you would use that in the DLM / Delimiter command.

In my previous example I used numeric data but I could also use the same statement to import the jump data as character data. I would do that as follows:

data mydata;
infile '....' delimiter=' ';
input name $ (jump1-jump4) ($);
run;

Obs    name      jump1    jump2    jump3    jump4

 1     Lucky      2.3      1.9               3.0 
 2     Spot       4.6      2.5      3.1      .5  
 3     Tubs       7.1                        3.8 
 4     Hop        4.5      3.2      1.9      2.6 
 5     Noisy      3.8      1.3      1.8      1.5 
 6     Winner     5.7

The parenthesis are required.

You can also use informats.

In some cases you will have data that is incomplete. For instance, one row will have 5 records and the next 3. This can confuse SAS into reading the next record. To prevent that you use the MISSOVER option. This only works for records missing at the end of the file.

data mydata;
infile '....';
input id $ a1 a2 a3 a4 a5;
run;
Obs      id      a1    a2    a3    a4    a5
 1     Nguyen    89    76    91    82     .

data mydata;
infile '....' missover;
input id $ a1 a2 a3 a4 a5;
run;

 1     Nguyen     89    76    91    82     .
 2     Ramos      67    72    80    76    86
 3     Robbins    76    65    79     .     .
 4                 .     .     .     .     .
 5                 .     .     .     .     .
Apparently there are some returns in the file.

You can also use the LENGTH statement to set the length of a value prior to the INPUT statement.

Note: The next couple of sections are almost identical to the Study Guide.
data mydata;
infile '....';
length id $10;
input id $;
run;

Sometimes you will run into data that has blanks in a field that you want to combine. Usually you will see this is a name field of sort. Here is an example of how that can go bad.

data mydata;
infile '....';
input id name $;
run;

id    name

 1    NEW     
 2    LOS     
 3    CHICAGO 
 4    HOUSTON 
 5    PHILADEL
 6    DETROIT 
 7    SAN     
 8    DALLAS  
 9    SAN     
10    PHOENIX

That actually went bad a couple of ways but the most obvious is that it separated the names with spaces.

data mydata;
infile '....';
length name $12;
input id name &;
run;

name            id
NEW YORK         1
LOS ANGELES      2
CHICAGO          3
HOUSTON          4
PHILADELPHIA     5
DETROIT          6
SAN DIEGO        7
DALLAS           8
SAN ANTONIO      9
PHOENIX         10

The Ampersand tells SAS to ignore blanks until it encounters consecutive blanks and to delimit the variables based on that. This something that honestly, I’ve never used. Everything is tab-delimited or CSV these days.

One last option here is the colon. The colon applies an informat to the data and is used for nonstandard data. Here’s how it didn’t work, absent the colon, and how it did with a colon.

data mydata (drop=id);
infile '....';
length name $12;
input id name & population comma.;
run;

name            population
NEW YORK             .    
LOS ANGELES          3    
CHICAGO              .    
HOUSTON              .    
PHILADELPHIA         1    
DETROIT              .    
SAN DIEGO            .    
DALLAS               .    
SAN ANTONIO          9    
PHOENIX              .

As you can see that didn’t work very well. But this, so much the better.

data mydata (drop=id);
infile '....';
length name $12;
input id name & population : comma.;
run;

name            population
NEW YORK          7262700 
LOS ANGELES       3259340 
CHICAGO           3009530 
HOUSTON           1728910 
PHILADELPHIA      1642900 
DETROIT           1086220 
SAN DIEGO         1015190 
DALLAS            1003520 
SAN ANTONIO        914350 
PHOENIX            894070

This would probably be super easy to screw up on a test. They would roll in with a fixed format toss a colon in, and,

data mydata (drop=id);
infile '....';
input region $ 1-12 quarter 13 sales : comma9.;
run;

region      quarter     sales
NorthEast       1       158510
SouthWest       4       292801
NorthEast       3       172301
SouthWest       1       384220
NorthEast       2       506510
SouthWest       3       392202
NorthEast       4       148510
SouthEast       1       284720
NorthWest       2       348803
SouthEast       3       122220
NorthWest       4       384840
SouthEast       1       339880

Wait, cool, that actually worked. I guess it still needs to be memorized as it would be easy to stop, look at that, ponder it a bit, and completely forget that it works in fixed format data too.

28 Apr

SAS: Reading Raw Data

This is going to have a serious legacy feel to it.  There are plenty of times you would do this stuff, when processing log files and I’ve even used files to process data.  Flat files are actually really fast.  Cyberquery relied on them and it was crazy fast in some places.  Still, today, I bet most people just load their data into Excel and roll from there.  But, that said, this is for the SAS Base test, so it’s included.

I discussed this in an earlier post but there are two kinds of numeric data, standard and nonstandard numeric data.  Basically, standard data consists of numbers, decimals, exponential notation and +/- signs.  Nonstandard is pretty much everything else.  This won’t matter, just yet, but keep it in the back of your mind, because it does matter when loading your data.

Reading data is done with the INPUT statement, which does this and so much more.  Using the INPUT statement is surprisingly straight-forward for most data.  Here is a sample:

INPUT id 1-5 name $25.

Here is what those items mean.

  • id 1-5: A numeric data type named id that is 5 characters long.
  • name $25.: A character data type named name that is 25 characters long.

You do not need to list all of the fields in the data file.

Data is loaded from the left to the right, usually.

You may also skip spaces in the data by using the + symbol followed by a number.

INPUT +5 name $25.;

You can also position the counter with @ symbol, thereby, skipping fields, and thereby, not always, loading data from the left to the right.

INPUT @6 name $25.;

The @ symbol also allows you to go backwards and read the data in any order.  However, that will reposition the pointer, so it’s possible to read the same data into separate records.

data mydata;
input @5 v1 $2. @3 v2 $2. v3 $2.;
datalines;
1 3 5 7
2 4 6 8
;
run;

 v1    v2    v3
 5      3     5
 6      4     6

SAS uses informats for reading data types. There are all kinds of these and you’ve seen me use some in earlier examples. They include, but are not limited to:

PERCENTw.d
DATEw.
MMDDYYw
$w.

And so on, and so forth. The w stands for the width of the character string and d is the decimal places for the value.

The major reason for informats is to read nonstandard data. Nonstandard data includes blanks, commas, dollars signs, dashes, and more.

The previous example used a fixed-length record. In other words, every value type is going to be in the same spot and the same length. However, you can also have fixed length records that have variable lengths. This usually happens in a name field, for instance. Generally, this won’t matter, and frankly shouldn’t matter at all, except that it does.

If you are using the INFILE statement, and you have variable length records, you can use the PAD option to see the length of the field to the informat length. So, 1.34, if it was read with an 8 character length informat, would be padded with 4 spaces at the end of it.

Note, I tried to do this with a datalines statement and it did not throw an error.

And that’s all I’ve got on this part.  It’s pretty easy actually.

27 Apr

SAS: Arrays

When I saw this section in the Study Guide I was happy. I thought “cool, I’ll knows these cold”. Then I remembered that when it comes to SAS, thinking you know how it works is like playing with gasoline while juggling flares. It’s something you can’t do.

I don’t know how complete this will end up. It will be a really long post if I do everything, and I’m not here to educate, exactly, so I’ll probably skip parts of them. On the other hand, from what I can tell, arrays really are all over the map in terms of how you can define and use them.

An array, in SAS, is a temporary grouping of a set of variables under a single name. They are defined in several ways. Put on your cowboy hat and lets cowboy up.

All of the following are valid one-dimensional arrays.

data mydata;
array var[1];
array var[1] qtr1;
array q(1:4) qtr1 qtr2 qtr3 qtr4;
array q1{4} qtr1-qtr4;
array q2[*] qtr1-qtr4;
var[1] = 100;
var[1] = 'String';
array var[1:4] (2,4,6,8);
run;

You must define an array before you use it with an array statement (there is another option but that’s later.

Both character and numeric values can be assigned to an array.

These, however, are not valid array assignments.

array var[1] 100;
array var[1] "My name is";
array var[1:4] ("2","4","6","Erik");

And I’m sure many more. I would urge you to be creative and simply try as many variations as you can think of, and see what happens.

Parenthesis, brackets, and the squiggly thing (brace) are all valid for arrays.

[],() and {}

You can specify an array using a numbered range of variables.

mo1-mo12

The rules for this are.

  • The variables must have the same name, except for the last character, which must be a number.
  • They are numbered sequentially.

There are some funky keywords that apply to arrays.

OptionActionUsage
_NUMERIC_Applies all previously defined numeric variables to the array.array mystuff[*] _NUMERIC_;
_CHARACTER_Assigns all previously defined character variables to the array.array mystuff(*) _CHARACTER_;
_ALL_Assigns all previously defined variables to the array.array mystuff{*} _ALL_;

The * is not required, it’s just how I wrote the array samples and it can only be use for a single dimension array.  Also, when it comes to SAS, there are specific places to use an equal sign and to not use it.  I have a tendency to use it, which leads to an error.  I’ll probably get burned by that at some point.

You reference an array, by, uh, referencing it.

For instance:

data mydata;
array var[1:4] (2,4,6,8);

do ctr = 1 to 4;
	myvar = var(ctr);
	myvar1 = var[1];
    output;
end;
run;

var1    var2    var3    var4    ctr    myvar    myvar1
  2       4       6       8      1       2         2  
  2       4       6       8      2       4         2  
  2       4       6       8      3       6         2  
  2       4       6       8      4       8         2

The DIM function is a count of the elements in an array. You can use it to loop through an array. The earlier example could be rewritten as:

data mydata;
array var[1:4] (2,4,6,8);

do ctr = 1 to dim(var);
	myvar = var(ctr);
	myvar1 = var[1];
    output;
end;
run;

var1    var2    var3    var4    ctr    myvar    myvar1
  2       4       6       8      1       2         2  
  2       4       6       8      2       4         2  
  2       4       6       8      3       6         2  
  2       4       6       8      4       8         2

Character Variable Arrays

Character arrays require that you add a $ after the array function. I guess you have to tell it that you want to load characters.

data mydata;
array var[1:4] $ ("Erik","Joe","Frank","Bill");
array myvar1[4] $;
run;

Also, by default, the length of an array is set to 8 characters. You can change the length by doing the following:

data mydata;
array name[1] $;
name[1] = "Hello. My name is Inigo Montoya.";
array myname[1] $ 50;
myname[1] = "Hello. My name is Inigo Montoya.";
run;

name1                  myname1
Hello. M    Hello. My name is Inigo Montoya.

Also, SAS is apparently liberal about some things. I would not expect the following to work, but, it does.

data mydata;
array var[1] $;
var[1] = "Hello. My name is Inigo Montoya.";
myvar = var1;
run;

var1           myvar
 Hello. M    Hello. M

So, apparently, once an array is defined, var[1] and var1 become the same thing. This is probably something else to watch out for. Particularly because the following is valid array creation syntax.

data mydata;
array var[4] a1 a2 a3 a4 (100,200,300,400);
do ctr = 1 to 4;
	myvar = var[ctr];
	output;
end;
run;

a1     a2     a3     a4    ctr    myvar
100    200    300    400     1      100 
100    200    300    400     2      200 
100    200    300    400     3      300 
100    200    300    400     4      400

Arrays can be temporary. You just add the _TEMPORARY_ to their creation statement. This will localize them to the data step.

data mydata (drop=ctr);
array var[4] _temporary_ (100,200,300,400);
do ctr = 1 to 4;
	myvar = var[ctr];
	output;
end;
run;

myvar
100 
200 
300 
400

Finally, you can create multidimensional arrays. I’m running out of steam on this topic so this is going to be short. The book has about 10 pages on the topic but it’s mostly built around processing a multidimensional array and I’m good with that (anyone reading this, uh, good luck).

data mydata;
array var[3,4] (100,200,300,400,500,600,700,800,900,1000,1100,1200);
run;

var1    var2    var3    var4    var5    var6    var7    var8    var9    var10    var11    var12
 100     200     300     400     500     600     700     800     900     1000     1100     1200

Because of the way SAS names the array, this is probably more than a bit confusing. Here is how the array actually looks:

1,1    1,2    1,3    1,4        
100    200    300    400           
2,1    2,2    2,3    2,4
500    600    700    800
3,1    3,2    3,3    3.4
900    1000   1100   1200

I hope that makes sense. I’m not going to write any more on it because I write these for me, as a study process, and this part I get.

Note: I used commas in separating array values.  It’s how I’m used to doing it.  However, the following are both valid approaches to creating an array.

array var[1:4] (2 4 6 8);
array var[1:4] (2,4,6,8);
27 Apr

SSIS: Package Deployment

These are the steps for deploying a package from development to a production server.

Step 1: Create a configuration file.

These are actually created at run time and as such are optional.  It’s here because Microsoft has in here.

Step 2: Create a deployment utility.

A deployment utility is a folder that contains the files that you will need to run the package.

The following steps create the utility:

  1. Make sure that the project has been converted to the package deployment model.
  2. Right-click on the package and select properties.
  3. Select the Deployment tab.
  4. Set AllowConfigurationChanges to True.
  5. Set CreateDeploymentUtility to True.
  6. Click OK to close.
  7. Right-click the project and select Build.

http://msdn.microsoft.com/en-us/library/ms137952.aspx

Step 3: Copy packages to the target computer.

Exactly what it sounds like.

Step 4: Run the “Package Installation Wizard” on the target computer.

Perform the following steps:

  1. Open the folder.
  2. Double-click the manifest file (.SSISDeploymentManifest) to start the process.
  3. Follow the steps.

Yeah, this is kind of light on detail.

http://msdn.microsoft.com/en-us/library/ms141693.aspx

The main MSDN for package deployment:

http://msdn.microsoft.com/en-us/library/ms137592.aspx

26 Apr

SAS: DO Loop (Revisited)

In the Study Guide sample test I missed something that should be pretty basic, but, I still missed it.  So I wanted to add a post to clarify it in my mind.  I did mention this in the prior post but I guess it didn’t take.  Here is some code.

ods listing;

data mydata;
do myvar = 1 to 5;
	output;
end;
run;

proc print data=mydata noobs;
run;

data mydata;
do myvar = 1 to 5;
end;
run;

proc print data=mydata noobs;
run;

myvar
  1  
  2  
  3  
  4  
  5  

myvar
  6 

The point of this example is to make it clear what is happening inside of a DO loop. In this example the DO loop runs 6 times, setting myvar to 6 during its final iteration. However, because the first DATA statement has an OUTPUT statement, it only writes the observations inside of the DO loop. The second program writes the variables value at the end of the DATA step. Because the OUTPUT statement overrides DATA step processing, and as such, while myvar is set to 6, that value is never written to the data set.

I guess another way to look at it is, OUTPUT writes all the variables to the data set when it’s called, while the DATA step writes the variables at the end of the process.

Man, I’m still not sure I explained that very well. But, I’m now clear on what’s going on so that will have to be enough.

26 Apr

SSIS: View Code

I’d completely forgotten you could right-click on a package, click “View Code” and change the package’s XML configuration.  But, you can, and we will.

I’m going to create a very simple package.

Step 1:  Create a new project and/or a new package.

You know how to do this so do it and name them whatever you would like.

Step 2: Create the Data Flow task.

  1. Create a Data Flow task (drag and drop or click the Data Flow tab).
  2. Drag a OLE DB Source to the Data Flow.
  3. Configure the connection manager to point to AdventureWorks2012, set “Data access mode” to “Table or view” and pick a table.  I used [Production].[Location].
  4. Drag a Flat File destination to the Data Flow and connect the OLE DB source to it.
  5. Configure it to point to a file of your choosing.  I used C:\testData\junk.txt.  Make sure this is a new file, and not an existing file.  It may work but I didn’t test if it did.
View Code.

View Code.

Step 3: Use “View Code” to change the package.

In the Solution explorer, right-click the package, select “View Code” and do the following:

  1. Hit CTL-f to bring up the “Find and Replace” dialogue window.
  2. I searched for junk.txt but if you named your file something else search for that.
  3. Rename the file.  In my case, I changed junk.txt to notMyJunk.txt.

If you run the package you will see that the process ran, and it renamed the destination file to notMyJunk.txt.

Anyways, one more way to configure a SSIS package and I sincerely hope it’s the last one I write about.

25 Apr

SAS: DO Loops

Earlier I wrote about the DO/END statement.  This is the same function, except that in this article we’ll be using it to loop.  I would imagine that every language has a method of processing loops.  When I started programming it was a FOR loop.  This is pretty much the same thing, except, like always with SAS, it has some exotic tricks.

The basic syntax for a DO loop is

DO variable = 'start' to 'end' by 'increment';
end;

data mydata;
do ctr = 1 to 12;
	myvar = ctr;
end;
run;

Obs    ctr    myvar
1      13      12

As you can see the loop executed 12 times, hence the value of myvar. The more interesting value is ctr. Notice that it is 13. What happens is that a DO loop, in this format, tests the condition at the top of the loop, after it increments the ctr variable. More on increments in a second. Essentially it follows these steps:

  1. Set the ctr variable to 1.
  2. Test if ctr is within the DO loop’s range.
  3. Execute the code.
  4. At the END statement return execution to the top of the loop.
  5. Increment the ctr variable.
  6. Test if ctr is within the DO loop’s range.
  7. Repeat until ctr is outside of the DO loop’s range.

In this example, when the ctr variable hit 13, the program exited the loop. Be aware that the counter increments, or decrements, before the condition is tested.

If you do not specify a BY condition the counter increments by 1. However, you can increment or decrement by different values.

Incrementing by 2.

data mydata;
do ctr = 1 to 12 by 2;
	myvar = ctr;
end;
run;

Obs    ctr    myvar
1      13      11

You can also set a counter to decrement. You do this by setting BY to a negative value.

data mydata;
do ctr = 12 to 1 by -2;
	myvar = ctr;
end;
run;

Obs    ctr    myvar
 1      0       2

You can also use variables.

data mydata;
start = 1;
end = 12;
increment = 1;

do ctr = start to end by increment;
	myvar = ctr;
end;
run;

Obs    start    end    increment    ctr    myvar
1       1       12        1         13      12

Notice that I’m only getting a single row of data. That is because the data step only writes data at the end of each step. Because the loop is inside of the data step, which is only processed once. To produce output after each step, we need to use the OUTPUT step. This also helps loops make more sense so I’ll be using it going forward.

do ctr = 1 to 5 by 1;
	output;
end;
run;

ctr
1 
2 
3 
4 
5

You can also specify a series of items to loop through.

do ctr = 1,2,3,4,5;
    output;
end;
run;

ctr
1 
2 
3 
4 
5

The list can consist of strings and they can also be variables.

I tried a list with a BY statement but it went into an endless loop.

do ctr = 1,2,3,4,5 by 2;
    output;
end;
run;

This, however, worked, just not in the way you might expect.

var1 = 1;
var2 = 2;
var3 = 3;

do ctr = var1-var3;
    output;
end;
run;

var1    var2    var3    ctr
 1       2       3      -2

Essentially, it executed a single time, with the counter equal to -2, and the output did it’s thing. Here’s another example.

do ctr = 1;
    myvar = 'One Time';
end;
run;

ctr     myvar
 1     One Time

You can also nest DO loops.

data mydata;

do ctr = 1 to 3 by 1;
	do myctr = 1 to 2;
		output;
	end; 
end;
run;

ctr    myctr

 1       1  
 1       2  
 2       1  
 2       2  
 3       1  
 3       2

There are two other options for DO loops: UNTIL and WHILE.

The DO/UNTIL loop executes until a condition is met. For example:

do until (ctr > 3);
	ctr + 1;
	output;
end;
run;

ctr
1 
2 
3
4

Unlike a basic DO loop processing the condition is not tested at the top of the DO statement, it is tested at the bottom. As such a DO/UNTIL will always fire at least once, and it will always complete the iteration where the condition is met.

For example:

ctr = 4;
do until (ctr > 3);
	myvar + 1;
	output;
end;
run;

ctr
5

The DO/WHILE loop performs its test at the top of the loop. As such, it’s possible that it will never execute.

do while (ctr < 3);
	ctr + 1;
	output;
end;
run;

ctr
1 
2 
3

As you can see, DO/UNTIL returns 4 records and DO/WHILE returns 3 because DO/WHILE never executes a 4th time. To set it to never execute we simply set ctr = 4.

data mydata;
ctr = 4;
do while (ctr < 3);
	ctr + 1;
	output;
end;
run;

You can also create a hybrid of the various DO types. For instance,

do ctr = 1 to 5 by 1 until (myvar > 30);
	myvar = ctr * 10;
	output;
end;
run;

ctr    myvar
 1       10 
 2       20 
 3       30 
 4       40

and,

do ctr = 1 to 5 by 1 while (myvar < 30);
	myvar = ctr * 10;
	output;
end;
run;

ctr    myvar
 1       10 
 2       20 
 3       30

and this won’t produce any output.

data mydata;

myvar = 100;
do ctr = 1 to 5 by 1 while (myvar < 30);
	myvar = ctr * 10;
	output;
end;
run;

Not using parenthesis gave me an error so keep that in mind.

I would expect that anyone reading this would be well-served by knowing this cold.

25 Apr

Parameterize

Parameterized!

Parameterized!

The parameterize option, is found by right-clicking a Control Flow task or a connection manager, allows you to create a parameter associated with a property.  For example, you can tie a connectionString property to a parameter.  When an object has a property that has been parameterized it announces it with a “fx” icon either in it or nearby it.

To parameterize an object simply right-click it, and select Paramaterize.  The Parameterize window opens and you will have the following options:

  • Property: The property you want to associate with the parameter.
  • Do not use parameter: Removes the reference to a parameter.  The paramater is not deleted.
  • Use existing parameter: Associates the property with an existing parameter.
  • Create new parameter: Creates a new parameter.
  • Scope: A package or project.
  • Sensitive: Sensitive parameters are encrypted in the catalog and appear as a NULL value in SSMS or T-SQL.
  • Required: Specifies that the parameter must have a value.

Here are some parameterized properties for a connection manager set at the project level.

Parameterized Connection Manager.

Parameterized Connection Manager.