31 May

SAS: Indexes

This site needed a photo..

The site needed a photo..

This is a very broad overview of indexes in SAS.  Honestly, for the most part, indexes seem to work like you would expect indexes to work and they aren’t complicated at least in the sense that I’m looking at them.

And, just a reminder, I write these as Study Guides for me, that means that I won’t, usually, go step-by-step in these things.  That will, especially, be the case here.

SAS has a handful of guidelines for creating indexes.  They are:

  • Use fewer indexes to save on storage and update costs.
  • Don’t create indexes for small tables.  In those cases, sequential access, otherwise known as a table scan, is faster.
  • Create indexes on columns that have a large number of distinct values.
  • Use indexes when a relatively small number of rows will be returned.  They specifically state “less than 15%”.

Creating indexes is, almost, straight-forward.  The syntax is

CREATE  INDEX  ON table (column(s))

The only hiccup with indexes is that an index must be named the same as the column, if, you only index a single column. In other words, you can name composite indexes but not a simple index. For example:

CREATE INDEX idx_myjunk ON work.myjunk (number);

ERROR: For a simple index, the index name must be the same as the column name.

On the other hand, here are 3 successfully created indexes.  It’s very, very, very, unlikely that you would create all 3 of these indexes on a table.  However, I haven’t been able to determine if SAS will use a partial composite index, say, if you queried the number1 column.  My guess, based on what I’m reading, and the way I think things work, is that it would not.  But, don’t take that as gospel.  I guess all of that means that I could see you doing it in a specific scenario of some sort.

proc sql;
CREATE TABLE work.myjunk
(
	number int,
	number1 int
);
CREATE INDEX number ON work.myjunk (number);
CREATE UNIQUE INDEX number1 ON work.myjunk(number1);
CREATE INDEX idx_composite ON work.myjunk(number, number1);
DESCRIBE TABLE work.myjunk;
quit;

From the log.

28         DESCRIBE TABLE work.myjunk;
NOTE: SQL table WORK.MYJUNK was created like:

create table WORK.MYJUNK( bufsize=131072 )
  (
   number num,
   number1 num
  );
create index idx_composite on WORK.MYJUNK(number,number1);
create unique index number1 on WORK.MYJUNK(number1);
create index number on WORK.MYJUNK(number);

That is a big values-add on my part.  It’s also OK to create a composite index with the same name as a simple index, however, you cannot name a composite index with the same name as a column in the index.

You can tell if SAS is using an index by using the MSGLEVEL= option. You use it as follows:

options msglevel= i or n;

Yes, it’s a global option and it will stay in effect for the session. There are two setting for message level:

  • N: Displays notes, errors and warnings.  This is the default.
  • I:  Displays additional information regarding index usage, sorts and merges.

You can force SAS to use a specific index or to process indexes in a specific fashion.  To tell SAS to use, or not use an index by using IDXWHERE.  It has two options:

  • IDXWHERE=No: Tells SAS to not use an index and instead to do a sequential scan.
  • IDXWHERE=Yes: Tells SAS to use an index and to not consider a sequential scan.

You use IDXWHERE as follows:

proc sql;
SELECT *
FROM work.myjunk (idxwhere=no)
WHERE number < 5;
quit;

You can force SAS to use a specific index with the IDXNAME= option. You simply name the index and use it the same place you use IDXWHERE as follows:

proc sql;
SELECT *
FROM work.myjunk (idxname=number)
WHERE number < 5;
quit;

And, once again, I’d like to point out that I bring incredible value to the process because the WHERE clause would pretty much guarantee that SAS used that index.

One other note on IDXNAME is that the index you specify must exist, and, it must have its first, or only, column match the condition in the WHERE clause.

SAS indexes are like a clustered index, in a very loose sense, in that they are sorted, in ascending order. However, you cannot create a SAS index in descending order, and unlike a true clustered index, because you can have multiple indexes on a table, the underlying data is not sorted as you would have in a clustered index. Frankly, they aren’t a clustered index, I just wanted to mention them in this article because I think the cool kids would do that.

Finally, you can delete an index as follows:

28         DROP INDEX number, number1 ON myjunk;
NOTE: Index number has been dropped.
NOTE: Index number1 has been dropped.

I figured it would be easier to post the log on this one. As you can see it’s acceptable to drop multiple indexes on a table. You just separate each index name with a comma.

I probably skipped a few things but it looks like, at least at a basic level, that indexes are pretty straight-forward in SAS.

04 Apr

Data Warehousing: Columnstore Index

I’ve put this off for some reason.  And by putting it off, I mean for several days of studying.  It’s not like I don’t know what they are “columnstore make query go fast” but honestly, it also feels like one of those things that could get big, or that I should “prove” it, or something, I dunno.  But, it’s gonna be on 70-463, I’m almost certain, so here goes.  And, besides, it was this or SSIS security.

Because I like my explanations simple, here goes.  A columnstore has the following properties:

  • It is non-clustered.
  • There can only be one on a table.
  • They can not be updated.
  • They are very fast for data warehouse queries.
  • Columnstore indexes are compressed.
  • They are stored as columns and not rows.

There must be 9 million articles on that last point and  I thought this guy did a good job so I’ll let him explain how row and column storage works.

http://rusanu.com/2012/05/29/inside-the-sql-server-2012-columnstore-indexes/

Columnstore indexes have a ton of limitations.  This post will list the one’s I think are most important relative to BI so the fact that they can’t be used in replication won’t be mentioned, again.

Limitations (the section of cannots)

  • Cannot be unique.
  • Cannot be created on a view or indexed view.
  • Cannot act as a primary key or a foreign key.
  • Cannot be changed using the ALTER INDEX statement. Drop and re-create the columnstore index instead.
  • Cannot include the ASC or DESC keywords for sorting the index.
  • Cannot contain a column with a FILESTREAM attribute.
  • Cannot be updated.

And, there’s still more.

  • Cannot be used with change data capture.
  • Cannot be used with change tracking.
  • Cannot be used with replication (wai….).
  • Cannot be used with a bunch of data types (the main, common ones, are text, ntext, varchar(max), nvarchar(max)), however, the one’s you probably want to use are on the OK list.

At least one of those, probably more, but I’m only going to do one, deserves special treatment: Updating.

To update a columnstore index, the official documentation recommends 3 options:

  • Drop the index, INSERT, UPDATE and DELETE to your heart’s contentment, then recreate the columnstore index.
  • Partition the table and switch partitions.  If it’s a bulk insert, build a staging table, do your thing, build the columnstore index, and insert the table into an empty partition.  Otherwise, remove a partition, use it as a staging table, drop the index, do the deed, recreate the columnstore index, and add it back to the main table.
  • Leave the main table as it is.  Create a second table, without a columnstore index, use it to update, delete and insert data, then use a UNION ALL to combine the two tables.  As long as the second table is relatively small it should have a minor impact on query performance.  Note: This idea was worth this article alone.  I would not have considered it.

Source: http://msdn.microsoft.com/en-us/library/gg492088.aspx#Update

Lastly, what kind of column is best for a columnstore index?  The kind with a lot of duplicated values like a zip code, state, sales region, etc.  Seriously, that’s the best kind.

I have to say that I’m glad that I wrote this.  There is more to columnstore indexes than, “they go fast”.

Here is the MSDN for columnstore indexes.  It goes into everything I did and more.

http://msdn.microsoft.com/en-us/library/gg492088.aspx