For some reason I have had a hard time getting my head wrapped around what this product does and how it does it. I just did some research on the topic and it amazes me how many articles out there that say “the first step is creating a knowledge base, and, here’s how you do it”. Far too many articles jump into the “how” but completely neglect the “what it does” piece.
What I’m going to do is break down the DQS sections and explain them in general terms. I’m certain that I’ll see at least one generic type of question where DQS is one of a list of potential solutions to a problem and I want to get at least that much right about the product.
As such, this is a 10,000 foot version post. I’m not going step-by-step through a process or into any real detail on the product. All of those other websites have already done that. I just want to nail down what each section does in a general sense because I simply couldn’t find this information in a single place.
DQS, consists of three things: Knowledge Base Management, Data Quality Projects, and Administration. In more detail they do the following things.
Knowledge Base Management
A knowledge base is a collection of metadata that you use for cleansing, matching or profiling. It is, roughly, equivalent to a table schema, except that it’s more involved than that. But, as far as I’m concerned, conceptually, that works.
To create a knowledge base you must first create a domains. Domains are attributes in a very broad and very loose sense. Specifically, there are 5 parts to the demain management section of DQS.
- Domain properties: Attributes such as name, definition, format, etc.
- Reference data: This is where you would add external data sources. For instance, a list of valid zip codes could probably be obtained from a 3rd party data resource here.
- Domain Rules: Conditions that you can apply to a domain. A valid zip code must have at least 5 letters is a condition that you could apply to a domain.
- Domain Values: If you have a data source that you are working with, you can view it here and confirm or clean it up. So, if the potential data in your domain consists of 94520 and 94522, this is where you would adjust 94522 to 94523.
- Term-Based Relations: You can create a list of values that map to another value. Biz, Bus. and Bus could all be mapped to the word Business which could help to clean up your data.
So, a knowledge base consists, more or less, of rules used to restrict and clean up your data.
Data Quality Projects
In this section you take a data source and apply the knowledge base you created to it. A data project consists of the following:
- Map: You map your target data to an existing knowledge base. It can be in a Excel file or SQL Server database.
- Cleanse: Where you process the previous mapping step.
- Manage and view results: This is where you review the results of the cleansing process. You can correct and make adjustments to the results.
- Export: You can export your results to SQL Server, Excel or a CSV file.
Data quality projects are how you process your data through a knowledge base.
This part has two sections: Activity Monitoring and Configuration.
Activity Monitoring is a list of all activities performed in DQS. This includes things like domain management, creating a knowledge base and cleansing data. Right-clicking gives you additional options, including the ability to terminate a running process. The Profiler tab at the bottom provides more information on each specific activity.
Configuration consists of 3 sections:
- Reference Data: Manage access to third-party data providers.
- General Settings: Some, uh, general settings for the program with help buttons that didn’t work when I clicked on them. I’m guessing my default browser isn’t supposed to be Firefox, or, maybe they just don’t work.
- Log Settings: Set logging options, specifically when to trigger logging. You can drill it down to specific modules.
I hope this helps someone. Personally, I really struggled with getting this clear in my head and writing this has definitely helped.