17 Mar

SSIS Toolbox: Data Profiling Task

How do I describe this?  Well, it profiles data.  How is that for a great start?  There are 8 options, each with their own properties that you can apply to a column, or columns in some cases,.  The results are written to a XML file that you read using the Data Profile Viewer.  You can do things like count the number of NULL values in a column, or calculate the min or max value of a column.

Did you just say that I can do that in SQL?  No?  Sorry, that was me.

Anyways, let’s do this.

Step 1: Create the Data Profiling task.

  1. Create a new project, or package, name it something you like and put it someplace you like.
  2. Drag a Data Profiling task to the control flow and name it, or don’t.

Step 2: Configure the Data Profiling task.

You can also shortcut this process by clicking the “Quick Profile” button on the lower left part of the screen.  It includes the two options I used and it’s easier to run through.  If you want to do that you’ll still need to do step 1 in this section but you can skip the rest.

  1. Double-click the task to open it.  Under the General tab configure the following properties.
    1. Timeout: Sets a time out value for the task.  0 means that there is no timeout.  Leave it set at 0.
    2. Destination Type: The can be variable or a file.  Set it as FileConnection.
    3. Destination: This will create a connection (it will appear in the connection manager) and you can use either an existing file or a new connection (file).  I used “New File Connection” and pointed it to C:\testData\dqTest.xml.
    4. OverwriteDestination: I set it to True because I want to create a new file every time I run this task.
  2. Click the “Profile Requests” tab.  Note: When you are configuring options they won’t immediately show up when you select a type.  You need to click around until the option takes.  It’s kind of annoying.
  3. Click in the “Profile Type” field and select “Column Statistics Profile Request”.  This will calculate various statistics very similar to a PROC MEAN statement in SAS.  Configure its properties as follows:
    1. ConnectionManager: You’ll need a ADO.NET connection for this so click “New Connection” and point it to AdventureWorks2012.
    2. TableOrView: [Sales].[SalesOrderHeader]
    3. Column: TotalDue.
  4. One too many selected.

    One too many selected.

    Click in the “Profile Type” field and select “Candidate Key Profile”.  This evaluates whether a given column is a key or a good choice for one.

    1. ConnectionManager: Same as the earlier one.  You don’t need to create a new one.
    2. TableOrView: [Sales].[SalesOrderHeader].
    3. CandidateKey: *, lets look at all of them.
    4. ThresholdSetting: Determines whether to use the next two options or to only report when there is an exact match.
    5. KeyStrengthThreshold: Strength that the test must meet to report a candidate key.
    6. MaxNumberOfViolations: I let it rip at 100.
  5. It’s probably worth mentioning that you can’t remove an option once you select it.  You’ll need to rebuild the task or let it rip.
I'm very lonely.

I’m very lonely.

You can run this task by itself so fire it up.

Step 3: Read the results with the Data Profile viewer.

  1. The viewer can be found in the SQL Server Folder.  Navigate to it using the following path: Start > All Programs > Microsoft SQL Server 2012 > Integration Services > Data Profile Viewer.
  2. You should now have an incredibly magnificent tool in front of you.
  3. Click the Open button and go to the location that you stored the file you created earlier.  In my case, C:\testData\dqTest.xml and open it.
  4. Navigate through the tree, there should only be one path if you followed my instructions.
  5. You should see two profiles.  Feel free to study them to your heart’s content.  I gave them 5 seconds each.

Here’s a complete list of the options.

  • Column Length Distribution Profile: Measures lengths of columns and aggregates the results.  This is analogous to a GROUP BY LENGTH().
  • Column Null Ratio Profile: Number of NULL value relative to the table.
  • Column Pattern Profile: Uses regular expressions to evaluate data.  For instance, validating zip codes.
  • Column Statistics Profile: Various statistical data on a column, such as, min, max, average and standard deviation.
  • Column Value Distribution Profile: Number of times distinct values appear in a table.
  • Candidate Key Profile: Determines if a column is a key or a candidate for a key.  Can be used for multiple columns.
  • Functional Dependency Profile: Compares columns to see if there is consistency.  For instance, states matching up against a zip code.
  • Value Overlap Profile: Determines the overlap between columns.  It can be used to determine if a column can be used as a foreign key for another column.
Quick Profile Results.

Quick Profile Results.

This is another one of those things that I’m not sure we needed.  I can see why it’s there, I get that, but I guess I’m still old school in that I tend to think “Couldn’t I just write a query?”

Anyways, that is the Data Profile task, signed, sealed and delivered.