31 Mar

SAS: Anatomy of the DATA step

This is another one of those areas that most of us who have written code would probably skip. It just seems so basic but it’s the type of thing that can yield a question or two on a test. So, lets take a look at the DATA step.  I’m probably skipping steps here.

The SAS DATA step consists of two phases: the compilation phase and the execution phase. Within each of these phases there are multiple steps. Lets go through both.

Compilation Phase

  • Input Buffer: Created to hold a record from an external file.  It is created only when raw data is read, not when a SAS data set is read.
  • Program Data Vector: Holds a single observation in memory and creates 2 automatic variables.
    • _N_: Counts the number of times the data step begins to execute.
    • _ERROR_: If there is an error it is set to 1.  The default is 0.
  • Syntax Checking
  • Data Set Variables: As the INPUT statement is generated, a slot is added to the program data vector for each variable.
  • Descriptor: The descriptor part of the data set is created.

Execution Phase

  • Initializing Variables: _N_ set to 1 and _ERROR_ set to 0.
  • INFILE and INPUT statements processed.
  • Lines executed in sequence.

End Of Data Step

  • Data is written to the output data set.
  • Control returns to the top of the data set and _N_ increments.
  • _ERROR_ is reset to zero if necessary.

My guess is that the most important parts of all this are the following:

  • _N_ defaulting to 1 and incrementing each time the data step is processed.
  • _ERROR_ defaulting to 0 and moving to 1 when there is an error.
  • Variables are set to missing at the start of each data step.


31 Mar

SAS: Testing Your Programs

The Study Guide has a section on testing your programs.  I’m going to go through the items listed in the book in what will be a very abbreviated fashion.

Limiting Observations

Use the OBS= statement to limit the results in the data set.

PUT Statement

You can use the PUT statement, in conjunction with an IF statement to write a variable to the log.  This is a lot like using a script component with the msgbox function in SSIS.  PUT examples include:

PUT "Note: Some variable = " variable;

PUT "Note: Some Stuff " someVar=;

When you use someVar=, both the variable and its value are written to the log.

PUT "Note: Some variable = " _N_= _ERROR_=;

Use the _N_ and _ERROR_ automatic variables.

if _N_ = 4 then PUT "NOTE";

If the automatic variable _N_ = 4 write a note to the log.

This pretty much covers the testing your programs section of the Study Guide.

31 Mar

Data Warehousing: Surrogate Keys Vs. Natural Keys

Surrogate Keys

These are keys that have no natural meaning.  An identity property is an example of a surrogate key.

Natural Keys

Natural keys are keys that have a business meaning.  For instance, a SSN, order number or even a warehouse number.

The rule of thumb for data warehousing, actually it’s pretty much law, is to always use surrogate keys. Some of the advantages of surrogate keys include:

  • They separate the operational environment from the data warehouse.
  • There are performance advantages.
  • Slowly Changing Dimensions: Surrogate keys allow for type 2 SCD’s.

I would be willing to argue for natural keys, I don’t think it’s as black and white as many think, but for testing purposes, I think it would be very unlikely that a correct answer would be “use natural keys”.  Of course this probably means I’ll get one.

30 Mar

SAS: Creating a data file

Unfortunately, I can’t test this one.  I don’t have a running SAS installation that will let me do this, but, I can, at least roll through it, and hope for the best.

Creating a raw data file consists of the following:

Use the _null_ keyword in the data statement

This is simple enough, just create a data statement that looks like this:

data _null_;

Specify the data file.

You do this by using the FILE statement like this:

data _null_;
set mydata;
file 'C:\testData\myData.dat';

The file statement can also be a FILENAME that you defined earlier in the program.

Use the PUT statement to write the data.

PUT is essentially the reverse of an INPUT, at least for this part of the discussion, and you use it like this:

data _null_;
set mydata;
file 'C:\testData\myData.dat';
put name $ 1-10 weight 11-13 year 14-17;

Unfortunately, this is all I can do here. Here’s to it being my best guess.

30 Mar

SAS: Instream Data

Instream data is data that you enter directly into your program. It looks something like this:

data mydata;
	input name $ 1-10 weight 11-13 year 14-17;

Frank     2501985
Joe       1651975
Bill      1951965

ods listing;

proc print data=mydata;

Obs   name      weight   year
1     Frank      250     1985
2     Joe        165     1975
3     Bill       195     1965

This statement has some restrictions and rules that apply to it.

  • You can only use one datalines statement per data step.
  • LINES and CARDS are aliases for datalines.
  • If your data includes semicolons use datalines4 and end the statement with four semicolons (;;;;).

The last restriction looks like this:


Frank     25019;5
Joe       1651975
Bill      1951965

And, naturally, that will also blow up the year, and put a period there, but that’s not the point of this post.

30 Mar


These two are very similar and I’ll probably screw something up here. But, here goes.

The IF statement is used in the DATA step. A WHERE can be used in a DATA step (although it’s syntax is different) and in a PROC statement such as PROC PRINT. These are all roughly equivalent:

proc print data=mydata;
where concessionSales > 100;

data mydata1;
	set mydata (where concessionSales > 100);

data mydata;
        infile '....';
	input visitingTeam $ 1-20 concessionSales 21-24 bleacherSales 25-28 
	ourHits 29-31 theirHits 32-34 ourRuns 35-37 theirRuns 38-40;
        if concessionSales > 100;

The textbook doesn’t go much past this but I found this which does.  The major thing I took from it is that while a WHERE can be used in a data set, IF has a great deal more utility.


30 Mar

Data Warehousing: Dimensions

Apparently, explaining a dimension is hard.  Here is an example that I found online:

In a data warehouse, Dimensions provide structured labeling information to otherwise unordered numeric measures. The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labeling.


Microsoft did it a bit better

Dimension tables encapsulate the attributes associated with facts and separate these attributes into logically distinct groupings, such as time, geography, products, customers, and so forth.


Of course you have to know what a Fact table is first.

Now, I’ll try, and probably do it badly.

A dimension is an attribute of a transaction that you use to measure the transaction.  For instance, a customer is a dimensional attribute.  When you record a sales transaction you record the customer with the transaction.  However, recording the customers name, address, city and phone number with the transaction is problematic for several reasons but mostly because it’s inefficient.  It’s easier, and cleaner, to store that information in a separate table.

Other types of dimensional attributes include:

  • Time / Date
  • Warehouse
  • Customer
  • Sales Representative
  • SKU

There are several different types of dimensions.  These are:

  • Conformed Dimension: This is a dimension that applies to more than one Fact table.  For instance, a product might be sold at one location and this same location information might be used to analyze purchasing patterns at the same location.
  • Junk Dimension: Easier to show than explain.  You might run a program that includes 10% off with a coupon.  If the user uses the coupon you want to flag the transaction as Yes, a coupon was used, or, No, it was not.  Rather than cluttering up the fact table with this information you create a separate dimension table.
  • Degenerate dimension: This is a key, that would normally join to a dimension table, but does not.  For instance, the claim number for an insurance claim.  It’s not directly relevant to the claim transaction, which probably has its own key, but it is relevant to the claim.
  • Role-playing dimension: A dimension that is used for multiple roles in the same fact table.  This is most commonly a date field.  For instance, you might have a fact table with the “order date” and the “invoice date”.

Hopefully, that will more than cover anything that shows up on the 70-463 test.




30 Mar

Data Warehousing: Measures

Measure in a data warehouse are, to put it simply, because I’m all about simple these days, measures that you can add.  For instance, sales, is an additive measure.  But, of course, because we’re in data warehouse land it’s not that easy.

There are three types of measures.

Additive Measures

A measure that can be aggregated across all of its dimensions.  For instance, sales is additive by date, by customer, by location and so on, and so forth.

Semi-Additive Measures

This is a measure that can be aggregated by some dimensions but not by all dimensions.  The most common example is inventory.  You can add, or subtract, inventory over a time-period, so it’s additive in the sense that you can know your quantity sold for a time-period.  However, using that same time-period, would probably not yield an exact total of your inventory on-hand.

Non-Additive Measures

This is a measure that cannot be aggregated.  For instance, sales and cost data are additive in that they can be aggregated, separately,  to produce a gross profit percentage.  However, that gross profit percentage can not be aggregated further.  For instance, if you calculate gross profit by invoice, aggregating those gross profit numbers will not yield the gross profit percentage for the business.

29 Mar

SAS: Titles and Footnotes

I was surprised to see Titles and Footnotes in the study guide until I realize this is BASE programming and you are formatting reports at this level.  I felt like I was studying for a second.  Footnotes and Titles are easy to add.  To add a title(s) to your report do the following:

title1 'MyTitle1';
title3 'MyTitle 3';
proc print data=mydata;

                                                             MyTitle 3

Obs          Name          Age    Type     Date    score1    score2    score3    score4    score5
1     Alicia Grossman      13     c      17833      7.8       6.5       7.2       8.0       7.9 
2     Matthew Lee           9     D      17835      6.5       5.9       6.8       6.0       8.1 
3     Elizabeth Garcia     10     C      17834      8.9       7.9       8.5       9.0       8.8 
4     Lori Newcombe         6     D      17835      6.7       5.6       4.9       5.2       6.1 
5     Jose Martinez         7     d      17836      8.9       9.5      10.0       9.7       9.0 
6     Brian Williams       11     C      17834      7.8       8.4       8.5       7.9       8.0

The number after the TITLE statement represents the line number. If you do not specify a line number it will default to 1. If you specify a TITLE and TITLE1 the TITLE1 will win.

Footnotes work exactly the same way except that they appear at the bottom of a page.

If you skip a number then a blank line, or lines, are inserted. For instance, TITLE1 and TITLE4 would have two blank lines.

Footnotes are global so they remain for the life of the session or until you modify or cancel them. You can cancel them by issuing the following statements:

To clear all titles.

To clear all footnotes.

SAS refers to this as specifying a null title or null footnote. You can also cancel an individual TITLE or FOOTNOTE by doing the same thing with the specific line number (footnote3; for instance).

Honestly, I hate to include this sort of thing. In the real world you would just go and never think twice about it. But, in a test environment, it’s exactly the kind of thing test makers like to test on.

29 Mar

SAS: Variables And Expressions

This will be really basic, because I’m still at a basic level in the process but I need to hit variables and expressions.  First, a data source:

data mydata;
	infile '....';
	input visitingTeam $ 1-20 concessionSales 21-24 bleacherSales 25-28
	ourHits 29-31 theirHits 32-34 ourRuns 35-37 theirRuns 38-40;
ods listing;

proc print data=mydata;
                      concession bleacher  our    their  our   their
Obs       visitingTeam   Sales   Sales    Hits    Hits   Runs   Runs
1     Columbia Peaches     35      67       1      10     2      1
2     Plains Peanuts      210       .       2       5     0      2 
3     Gilroy Garlics       15    1035      12      11     7      6  
4     Sacramento Tomatoes 124      85      15       4     9      1

Now, lets add two variables, totalHits and totalRuns. You add them by modifying the data statement as follows:

data mydata;
	infile '....';
	input visitingTeam $ 1-20 concessionSales 21-24 bleacherSales 25-28
	ourHits 29-31 theirHits 32-34 ourRuns 35-37 theirRuns 38-40;
	totalSales = concessionSales + bleacherSales;
	totalHits = ourHits + theirHits;
	totalRuns = ourRuns + theirRuns;
                      concession bleacher  our    their  our   their     total     total    total
Obs       visitingTeam   Sales   Sales    Hits    Hits   Runs   Runs     Sales      Hits    Runs
1     Columbia Peaches     35      67       1      10     2      1        102        11      3 
2     Plains Peanuts      210       .       2       5     0      2          .         7      2 
3     Gilroy Garlics       15    1035      12      11     7      6       1050        23     13 
4     Sacramento Tomatoes 124      85      15       4     9      1        209        19     10

The variables that we created have been added to the report. They should all make sense except for the totalSales value where it is a period. The reason for that is that in the raw data there is a . which is the equivalent of null in the SAS world and since you can’t add a null to anything else and produce anything but a null, well, if you add a period to anytihng else you still wind up with a period.

You can also assign a variable as a constant. For instance,

myVar = 'Something went thataway';

You can also perform calculations using variables. For instance,

myVar = 60;
hits60 = ourHits * myVar;

Expressions, absent parenthesis, are processed in the following order:

negative prefix and exponentiation
multiplication and division
addition and subtraction

This is the kind of thing that is always asked on tests.  You’ll need to know it.

Finally, date constants. This is another thing that I think SAS pulled out of their behinds, but, if you want to assign a date constant you do it as follows:

  • “ddmmmyy”d
  • ‘ddmmmyyyy’d

Single or double quotes are fine.

dd = 1 or 2 digit day number.
mmm = 3 character month.  Such as Jun, Jul, Aug, etc.
yy or yyyy = 2 or 4 digit year.

For example:

myVar = "14Jul1982"d

This is really basic. Keep that in mind if you are thinking of putting the match to the flamethrower if you are so inclined.