19 Mar

SSIS: Hash Functions

Hash functions are, for purposes of SSIS, a fast way to check for a change in a dimension.  Effectively what you do is hash the dimension and compare the hash of the incoming values and if they don’t find a match, something has changed.

You implement a hash in SSIS in a few ways.

  • Use T-SQL to write a SELECT with the HASHBYTES function.
  • Use the script component transformation.
  • Use a custom component.

For testing purposes my guess would be that knowing there is such a thing would be most of the battle.

I did a quick demo hash using MD5 for both SQL Server and MySQL below.

SQL Server

SELECT HASHBYTES('MD5', 'Erik Haselhofer');

0x05D2C805985ACEFD34CBA674B47C5FDD

MySQL

mysql> SELECT MD5('Erik Haselhofer');
+----------------------------------+
| MD5('Erik Haselhofer')           |
+----------------------------------+
| 05d2c805985acefd34cba674b47c5fdd |
+----------------------------------+
1 row in set (0.03 sec)

Interestingly, they are different. Both the 2nd and 3rd letters as well as it’s case. I checked a couple of other places online and it looks like MySQL wins the “everybody does it like me” award. I don’t know the reasoning behind SQL Server doing what they did but for purposes of SCD’s it won’t matter.

02 Mar

Email Validation, T-SQL, SSIS And DQS.

Everyone who has worked with data, or written code, knows 2 things. First, email addresses should be validated prior to insertion into the database. Second, the first thing often doesn’t happen. So, at some point you’ll get handed a list of email addresses, told to clean them up, and do something with them, and it’s not the something with them that will be your first thought.

This means you have to validate your email addresses, somehow. The first thing you’ll try is a bunch of combinations in a WHERE clause, which will look great, the first time, but fail miserably upon further testing. Then you’ll do what I just did, search Google, find a regular expression, and roll with it, because you write regular expression about every time Mars completes it’s third orbit around the sun, and hope for the best.

Here is a function that I liked and modified slightly because he had a nested custom function in it that I don’t have.

Source: http://stackoverflow.com/questions/229824/tsql-email-validation-without-regex

IF OBJECT_ID('dbo.fnAppEmailCheck', 'FN') IS NOT NULL
	DROP FUNCTION dbo.fnAppEmailCheck;
GO
CREATE FUNCTION dbo.fnAppEmailCheck(@email VARCHAR(255))   
RETURNS bit  
AS  
BEGIN  
     DECLARE @valid bit  
     IF @email IS NOT NULL   
          SET @email = LOWER(@email)  
          SET @valid = 0  
          IF @email like '[a-z,0-9,_,-]%@[a-z,0-9,_,-]%.[a-z][a-z]%'  
             AND @email NOT like '%@%@%'  
             AND CHARINDEX('.@',@email) = 0  
             AND CHARINDEX('..',@email) = 0  
             AND CHARINDEX(',',@email) = 0  
             AND RIGHT(@email,1) between 'a' AND 'z'  
               SET @valid=1  
     RETURN @valid  
END

SELECT dbo.fnAppEmailCheck('mickey@mouse.com') as 'test 1',
dbo.fnAppEmailCheck('mickey@@mouse.com') as 'test 2',
dbo.fnAppEmailCheck('mic.key@mouse.com') as 'test 3',
dbo.fnAppEmailCheck('mickey@m.ouse.com') as 'test 4',
dbo.fnAppEmailCheck('mickey@mouse..com') as 'test 5';

test 1	test 2	test 3	test 4	test 5
1	0	1	1	0

Cool, my crack, in-depth QA effort confirms that it is 100% effective.

However, I got to thinking, could we do this another way, maybe with SSIS and data quality services, and, in fact, you pretty much can. However, as with many things in SSIS, you just have to take 13 steps to take 3. So, you might not want to do this, unless you have to validate a lot of email on a regular basis.

step1So, lets fire up DQS. You say you haven’t installed DQS? Well, too bad for you, this is only for people who opened up their SQL Server folder, hit the Data Quality Services folder, clicked the install icon and fought their way through that. If you have, then good for you, if not, go find some kittens to watch on youtube.

Start DQS and do the following:

  1. Name the project remove “Remove Demons From Email”.
  2. Select None in the “Create Knowledge Base From” box.
  3. Select “Knowledge Discover” in the bottom right.

Technically, this part isn’t necessary.  You don’t have to do the Knowledge discovery to make all of this work.  You can, instead, just go straight to the Domain Management process and work from there.  This is useful if you’ve never been in DQS before and it will help it all make sense.

——————  Skip If You’ve Been Here Before —————————

Click Next.

  1. Select “SQL Server” in the “Data Source” box.
  2. Select “AdventureWorks2012” or some variant of that in the “Database” box.  If you don’t have it, get it.
  3. Select “EmailAddress” in the Table/View box.
  4. Click the “Add A Domain Icon” and create a domain as follows:
    1. Name: Email
    2. Uncheck the Speller option.
    3. Uncheck the “Syntax Error Algorithms.
    4. Click OK.
  5. In the Source Column select EmailAddress and in the Domain column select Email.

step2Now, if you are like me, you probably thought they would alphabetize the Table / View section.  Or, they’d leave the schema names in the names, but, nope, you get none of that.  You are actually looking for the person.emailaddress table and it is in the order you would find it in SSMS, it’s just not called that.  It’s perfectly logical in Microsoft land, I guess.

Click Next.

Click Start and wait.

Click Next.

Don’t change anything here but feel free to review the items.  This is a manual step that allows you to review each value individually and approve or delete as necessary.  We’re going to accept everything at this point.

Click Finish, but DO NOT PUBLISH.  This will save the project in an incomplete state.

——————  End Skip —————————

You should be looking at the start screen, click “Open Knowledge Base”, then click “Purge Demons From Email” and hit Next.

This will open the Domain Management screen.

step4Click on Domain Rules

  1. Click the add a new domain icon on the right.
  2. In the Name field give it a name.  I called mine “HopeItWorks” because I wasn’t positive this would work the first time.
  3. Ignore the error that will probably show up at this point.
  4. In the “Build A Rule” section do the following.
    1. Select “Value matches a regular expression” and paste “[a-z,0-9,_,-]@[a-z,0-9,_,-].[a-z][a-z]”.
    2. Right-click and select “Add Clause”.
      1. Select “Value does not contain and @@ for the option.
    3. Repeat the prior steps for “.@”, “..” and “,”
  5. Finally add two more clauses for “Value begins with” and “Value ends with” as [a-zA-Z].  Note, this is the part where HopeItWorks took on real meaning.
  6. Click “Apply Changes” and it should say that your rules will have no effect, on anything.

step5

Please keep in mind that these are not all the rules that could be applied to email.  See the link at the bottom of this article for a more complete discussion.

You can test the rules by going to Domain Values, click “Add New Domain Value” and testing it out.  The following will all work for testing.

mickey@mouse.com
mi..ckey@kmouse.com
mickey@@mouse.com
1mickey@mouse.com

Check “Show Only New” to see just  the items you added.

You should see something that looks like this with only 1 as valid.

Delete them all by clicking on the “Delete selected domain values” icon.

Click Finish and Publish the knowledge base.

And guess what, you are only half way there.  Now you actually have to use SSIS.

Open Microsoft Visual studio and create a new Integration package.  Name it what you will, put it where you will and name the package in it what you like.

And, here, we, go.

Step 1

Create a file with some random email addresses in them.  These are the one’s that I used.

joe@lumberjack.com
bill.bill@bill.com
bill..bill@joe.com
whompa.com
whompa@whompa.com
1pancakesaregood@whoops.com
pancakesaregood@whoops.com
pancakesaregood@@oops.com

Name the file and put it somewhere that you’ll remember where it is.

Step 2

Create a flat file connection manager to it.  A basic old generic one will be fine.

Step 3

Drag a “data flow source” to your control flow, or just click data flow, and hit the link, like everyone but the writers of books do.

Step 4

Drag a flat file source to the data flow and point it to the connection manager you just created.

Note: I’m assuming you know this stuff.  If you don’t, you are in the wrong place.

Step 5

Drag a “DQS Cleansing” transform to the data flow, connect the flat file source to it and do the following in the Connection Manager tab.

  1. Connect to the DQS connection on localhost, or wherever your DQS project is stored.
  2. For the “Data Quality Knowledge Base”, select “Purge Demons From Email” or whatever you called it.

In the Mapping tab

  1. Check the box next to Email.
  2. Select Email as the domain.

Ignore the Advanced tab and click OK to exit.

Step 6

ssisDrag a “Flat File Destination”, uh, destination and connect the cleansing task to it.  Create a place to store the file and hit the Mapping tab where you should see 4 fields.  Click OK.

Step 7

Technically, we’re done, but because I’m lazy, I just want to see the results in SSIS.  So, do the following.

  • Click the connection between the “DQS Cleansing” transform, and the flat file destination and hit edit, or just double-click on it.
  • Click the “Data Viewer” option, in this completely different UI, from what you were just working with, and check the “Enable Data Viewer” box.  Click OK to exit.

This will allow you to run the package and see the results in SSIS.

If you run the package, using my email list earlier you should see a popup in SSIS that contains the following:

Email_Source			Email_Output			Email_Status	Record Status
joe@lumberjack.com		joe@lumberjack.com		New		New
bill.bill@bill.com		bill.bill@bill.com		New		New
bill..bill@joe.com		bill..bill@joe.com		Invalid		Invalid
whompa.com			whompa.com			Invalid		Invalid
whompa@whompa.com		whompa@whompa.com		New		New
1pancakesaregood@whoops.com	1pancakesaregood@whoops.com	Invalid		Invalid
pancakesaregood@whoops.com	pancakesaregood@whoops.com	New		New
pancakesaregood@@oops.com	pancakesaregood@@oops.com	Invalid		Invalid

All things considered, for my in-depth and thorough test, I’d say that this exercise worked. It tagged the invalid fields and we could put this data almost anywhere, including handing it back to marketing, and we’d be good to go.

Now, honestly, would you do this? I’d have to think about it. One you establish the Knowledge Base this is actually a pretty easy process and you could just split the results and fire the bad one’s off to someone else. On the other hand, you could toss the records into a database, run a query against the data, like we started with, and just process it off that way. If this was a regular event, say in a ETL process, then setting up what I’ve done here makes great sense. If it’s a one-off, I’d just do it with SQL and be done with it.

One last note. Here are the rules for a valid email address courtesy of wikipedia. Have fun, it’s very exciting.

http://en.wikipedia.org/wiki/Email_address

NOTE: Email address also have length rules which this post doesn’t deal with.  It was one thing too many and this was really about setting up a DQS knowledge base and applying it to SSIS.