01 Apr

Data Quality Services: What It Does

For some reason I have had a hard time getting my head wrapped around what this product does and how it does it.  I just did some research on the topic and it amazes me how many articles out there that say “the first step is creating a knowledge base, and, here’s how you do it”.  Far too many articles jump into the “how” but completely neglect the “what it does” piece.

The Big 3.

The Big 3.

What I’m going to do is break down the DQS sections and explain them in general terms.  I’m certain that I’ll see at least one generic type of question where DQS is one of a list of potential solutions to a problem and I want to get at least that much right about the product.

As such, this is a 10,000 foot version post.  I’m not going step-by-step through a process or into any real detail on the product.  All of those other websites have already done that.  I just want to nail down what each section does in a general sense because I simply couldn’t find this information in a single place.

DQS, consists of three things: Knowledge Base Management, Data Quality Projects, and Administration.  In more detail they do the following things.

Knowledge Base Management

A knowledge base is a collection of metadata that you use for cleansing, matching or profiling.  It is, roughly, equivalent to a table schema, except that it’s more involved than that.   But, as far as I’m concerned, conceptually, that works.

Domain Management.

Domain Management.

To create a knowledge base you must first create a domains.  Domains are attributes in a very broad and very loose sense.  Specifically, there are 5 parts to the demain management section of DQS.

  • Domain properties: Attributes such as name, definition, format, etc.
  • Reference data: This is where you would add external data sources.  For instance, a list of valid zip codes could probably be obtained from a 3rd party data resource here.
  • Domain Rules: Conditions that you can apply to a domain.  A valid zip code must have at least 5 letters is a condition that you could apply to a domain.
  • Domain Values: If you have a data source that you are working with, you can view it here and confirm or clean it up.  So, if the potential data in your domain consists of 94520 and 94522, this is where you would adjust 94522 to 94523.
  • Term-Based Relations: You can create a list of values that map to another value.  Biz, Bus. and Bus could all be mapped to the word Business which could help to clean up your data.

So, a knowledge base consists, more or less, of rules used to restrict and clean up your data.

Data Quality Projects

Managing the results of a cleanse.

Managing the results of a cleanse.

In this section you take a data source and apply the knowledge base you created to it.  A data project consists of the following:

  • Map: You map your target data to an existing knowledge base.  It can be in a Excel file or SQL Server database.
  • Cleanse: Where you process the previous mapping step.
  • Manage and view results: This is where you review the results of the cleansing process.  You can correct and make adjustments to the results.
  • Export: You can export your results to SQL Server, Excel or a CSV file.

Data quality projects are how you process your data through a knowledge base.


This part has two sections: Activity Monitoring and Configuration.

Activity Monitoring is a list of all activities performed in DQS.  This includes things like domain management, creating a knowledge base and cleansing data.  Right-clicking gives you additional options, including the ability to terminate a running process.  The Profiler tab at the bottom provides more information on each specific activity.

Activity Monitoring.

Activity Monitoring.

Configuration consists of 3 sections:

  • Reference Data: Manage access to third-party data providers.
  • General Settings: Some, uh, general settings for the program with help buttons that didn’t work when I clicked on them.  I’m guessing my default browser isn’t supposed to be Firefox, or, maybe they just don’t work.
  • Log Settings: Set logging options, specifically when to trigger logging.  You can drill it down to specific modules.

I hope this helps someone.  Personally, I really struggled with getting this clear in my head and writing this has definitely helped.

28 Mar

Data Quality Services Security

This will be a short article.  It simply looks at the security roles in DQS.  There are only three, technically four, and there’s really nothing to do but list them.  So here goes.

DBA / sysadmin

  • Installs DQS.
  • Maintains the database.
  • Adds and creates users.
  • Is a member of the sysadmin group.

The following 3 roles manage data within DQS.  They are, in effect, data stewards.

DQS Administrator (dqs_administrator)

Things this role can do:

  • This role can do almost anything in the product.
  • Create, edit or execute a project.
  • Create and edit a knowledge base.
  • Terminate an activity.
  • Change configuration options.

Things this role cannot do:

  • Add users.
  • Install the DQS Server.

DQS KB Editor (dqs_kb_editor)

This is a more limited role in that it can view and edit activities but it cannot do administrative functions.

Things this role can do:

  • Create, edit or execute a project.
  • Create or edit a knowledge base.
  • View activity data.

Things this role cannot do:

  • Cannot end or stop an activity.
  • Change configuration options.

DQS KB Operator (dqs_operator)

The most limited of the three roles.

Things this role can do:

  • Create, edit or execute a project.
  • View activity data.

Things this role cannot do:

  • Create or edit a knowledge base.
  • Cannot end or stop an activity.
  • No administrative functions at all.

There’s an even easier way to think about it and that’s to look at it from the bottom up.


  • Create, edit or execute a project.
  • View activity data.


  • Do everything a dqs_operator can do.
  • Create and edit knowledge bases.


  • Everything a dqs_kb_editor can do.
  • Create, edit or execute a project.
  • Terminate an activity.
  • Change configuration options.

And that’s pretty much it.  For me, the disconnect was between the DBA and the administrator.  I kept thinking that the dq_administrator role would have some control of users when they do not.

The following article was my primary source.


27 Mar

Data Quality Services Installation

On the MeasureUp practice tests I keep getting asked about DQS installation, and, I keep failing the questions. I’m deliberately not memorizing answers, and I’m not overly fond of working with this product, so that’s to be expected. I guess it’s time to stop screwing those up.

Data Quality Services is a three-step installation process.

  1. Install the Data Quality server during SQL Server installation.
  2. Install the Data Quality client during SQL Server installation.
  3. Run the DQSInstaller.

You would think I could handle that.  Anyways, the installer does some additional things.

It creates the following databases.

  • DQS_MAIN: The guts of DQS.  It includes the engine, knowledge bases and stored procedures.
  • DQS_PROJECTS: The data for DQS projects
  • DQS_STAGING_DATA: Where you store staging data for projects.

It creates the following roles in DQS_MAIN.

  • dqs_administrator
  • dqs_kb_editor
  • dqs_kb_operator role

I’ll detail those in a different post.

It also creates additional logins, the install .log file and the DQInitDQS_MAIN stored procedure.

Finally, if Master Data Services is present, a user is mapped to Master Data Services with the dq_administrator role in the DQS_MAIN database.

You can read more here,


and this guy put some crazy time in on the product.


02 Mar

Email Validation, T-SQL, SSIS And DQS.

Everyone who has worked with data, or written code, knows 2 things. First, email addresses should be validated prior to insertion into the database. Second, the first thing often doesn’t happen. So, at some point you’ll get handed a list of email addresses, told to clean them up, and do something with them, and it’s not the something with them that will be your first thought.

This means you have to validate your email addresses, somehow. The first thing you’ll try is a bunch of combinations in a WHERE clause, which will look great, the first time, but fail miserably upon further testing. Then you’ll do what I just did, search Google, find a regular expression, and roll with it, because you write regular expression about every time Mars completes it’s third orbit around the sun, and hope for the best.

Here is a function that I liked and modified slightly because he had a nested custom function in it that I don’t have.

Source: http://stackoverflow.com/questions/229824/tsql-email-validation-without-regex

IF OBJECT_ID('dbo.fnAppEmailCheck', 'FN') IS NOT NULL
	DROP FUNCTION dbo.fnAppEmailCheck;
CREATE FUNCTION dbo.fnAppEmailCheck(@email VARCHAR(255))   
     DECLARE @valid bit  
     IF @email IS NOT NULL   
          SET @email = LOWER(@email)  
          SET @valid = 0  
          IF @email like '[a-z,0-9,_,-]%@[a-z,0-9,_,-]%.[a-z][a-z]%'  
             AND @email NOT like '%@%@%'  
             AND CHARINDEX('.@',@email) = 0  
             AND CHARINDEX('..',@email) = 0  
             AND CHARINDEX(',',@email) = 0  
             AND RIGHT(@email,1) between 'a' AND 'z'  
               SET @valid=1  
     RETURN @valid  

SELECT dbo.fnAppEmailCheck('mickey@mouse.com') as 'test 1',
dbo.fnAppEmailCheck('mickey@@mouse.com') as 'test 2',
dbo.fnAppEmailCheck('mic.key@mouse.com') as 'test 3',
dbo.fnAppEmailCheck('mickey@m.ouse.com') as 'test 4',
dbo.fnAppEmailCheck('mickey@mouse..com') as 'test 5';

test 1	test 2	test 3	test 4	test 5
1	0	1	1	0

Cool, my crack, in-depth QA effort confirms that it is 100% effective.

However, I got to thinking, could we do this another way, maybe with SSIS and data quality services, and, in fact, you pretty much can. However, as with many things in SSIS, you just have to take 13 steps to take 3. So, you might not want to do this, unless you have to validate a lot of email on a regular basis.

step1So, lets fire up DQS. You say you haven’t installed DQS? Well, too bad for you, this is only for people who opened up their SQL Server folder, hit the Data Quality Services folder, clicked the install icon and fought their way through that. If you have, then good for you, if not, go find some kittens to watch on youtube.

Start DQS and do the following:

  1. Name the project remove “Remove Demons From Email”.
  2. Select None in the “Create Knowledge Base From” box.
  3. Select “Knowledge Discover” in the bottom right.

Technically, this part isn’t necessary.  You don’t have to do the Knowledge discovery to make all of this work.  You can, instead, just go straight to the Domain Management process and work from there.  This is useful if you’ve never been in DQS before and it will help it all make sense.

——————  Skip If You’ve Been Here Before —————————

Click Next.

  1. Select “SQL Server” in the “Data Source” box.
  2. Select “AdventureWorks2012” or some variant of that in the “Database” box.  If you don’t have it, get it.
  3. Select “EmailAddress” in the Table/View box.
  4. Click the “Add A Domain Icon” and create a domain as follows:
    1. Name: Email
    2. Uncheck the Speller option.
    3. Uncheck the “Syntax Error Algorithms.
    4. Click OK.
  5. In the Source Column select EmailAddress and in the Domain column select Email.

step2Now, if you are like me, you probably thought they would alphabetize the Table / View section.  Or, they’d leave the schema names in the names, but, nope, you get none of that.  You are actually looking for the person.emailaddress table and it is in the order you would find it in SSMS, it’s just not called that.  It’s perfectly logical in Microsoft land, I guess.

Click Next.

Click Start and wait.

Click Next.

Don’t change anything here but feel free to review the items.  This is a manual step that allows you to review each value individually and approve or delete as necessary.  We’re going to accept everything at this point.

Click Finish, but DO NOT PUBLISH.  This will save the project in an incomplete state.

——————  End Skip —————————

You should be looking at the start screen, click “Open Knowledge Base”, then click “Purge Demons From Email” and hit Next.

This will open the Domain Management screen.

step4Click on Domain Rules

  1. Click the add a new domain icon on the right.
  2. In the Name field give it a name.  I called mine “HopeItWorks” because I wasn’t positive this would work the first time.
  3. Ignore the error that will probably show up at this point.
  4. In the “Build A Rule” section do the following.
    1. Select “Value matches a regular expression” and paste “[a-z,0-9,_,-]@[a-z,0-9,_,-].[a-z][a-z]”.
    2. Right-click and select “Add Clause”.
      1. Select “Value does not contain and @@ for the option.
    3. Repeat the prior steps for “.@”, “..” and “,”
  5. Finally add two more clauses for “Value begins with” and “Value ends with” as [a-zA-Z].  Note, this is the part where HopeItWorks took on real meaning.
  6. Click “Apply Changes” and it should say that your rules will have no effect, on anything.


Please keep in mind that these are not all the rules that could be applied to email.  See the link at the bottom of this article for a more complete discussion.

You can test the rules by going to Domain Values, click “Add New Domain Value” and testing it out.  The following will all work for testing.


Check “Show Only New” to see just  the items you added.

You should see something that looks like this with only 1 as valid.

Delete them all by clicking on the “Delete selected domain values” icon.

Click Finish and Publish the knowledge base.

And guess what, you are only half way there.  Now you actually have to use SSIS.

Open Microsoft Visual studio and create a new Integration package.  Name it what you will, put it where you will and name the package in it what you like.

And, here, we, go.

Step 1

Create a file with some random email addresses in them.  These are the one’s that I used.


Name the file and put it somewhere that you’ll remember where it is.

Step 2

Create a flat file connection manager to it.  A basic old generic one will be fine.

Step 3

Drag a “data flow source” to your control flow, or just click data flow, and hit the link, like everyone but the writers of books do.

Step 4

Drag a flat file source to the data flow and point it to the connection manager you just created.

Note: I’m assuming you know this stuff.  If you don’t, you are in the wrong place.

Step 5

Drag a “DQS Cleansing” transform to the data flow, connect the flat file source to it and do the following in the Connection Manager tab.

  1. Connect to the DQS connection on localhost, or wherever your DQS project is stored.
  2. For the “Data Quality Knowledge Base”, select “Purge Demons From Email” or whatever you called it.

In the Mapping tab

  1. Check the box next to Email.
  2. Select Email as the domain.

Ignore the Advanced tab and click OK to exit.

Step 6

ssisDrag a “Flat File Destination”, uh, destination and connect the cleansing task to it.  Create a place to store the file and hit the Mapping tab where you should see 4 fields.  Click OK.

Step 7

Technically, we’re done, but because I’m lazy, I just want to see the results in SSIS.  So, do the following.

  • Click the connection between the “DQS Cleansing” transform, and the flat file destination and hit edit, or just double-click on it.
  • Click the “Data Viewer” option, in this completely different UI, from what you were just working with, and check the “Enable Data Viewer” box.  Click OK to exit.

This will allow you to run the package and see the results in SSIS.

If you run the package, using my email list earlier you should see a popup in SSIS that contains the following:

Email_Source			Email_Output			Email_Status	Record Status
joe@lumberjack.com		joe@lumberjack.com		New		New
bill.bill@bill.com		bill.bill@bill.com		New		New
bill..bill@joe.com		bill..bill@joe.com		Invalid		Invalid
whompa.com			whompa.com			Invalid		Invalid
whompa@whompa.com		whompa@whompa.com		New		New
1pancakesaregood@whoops.com	1pancakesaregood@whoops.com	Invalid		Invalid
pancakesaregood@whoops.com	pancakesaregood@whoops.com	New		New
pancakesaregood@@oops.com	pancakesaregood@@oops.com	Invalid		Invalid

All things considered, for my in-depth and thorough test, I’d say that this exercise worked. It tagged the invalid fields and we could put this data almost anywhere, including handing it back to marketing, and we’d be good to go.

Now, honestly, would you do this? I’d have to think about it. One you establish the Knowledge Base this is actually a pretty easy process and you could just split the results and fire the bad one’s off to someone else. On the other hand, you could toss the records into a database, run a query against the data, like we started with, and just process it off that way. If this was a regular event, say in a ETL process, then setting up what I’ve done here makes great sense. If it’s a one-off, I’d just do it with SQL and be done with it.

One last note. Here are the rules for a valid email address courtesy of wikipedia. Have fun, it’s very exciting.


NOTE: Email address also have length rules which this post doesn’t deal with.  It was one thing too many and this was really about setting up a DQS knowledge base and applying it to SSIS.