05 Mar

SSIS Fuzzy Grouping Transform

I wanted to run through this because I like the concept and I think while it’s tough to dial in, it can be useful. So, lets look at the Fuzzy Grouping transform.

First, create a text file, enter the following data, and save it somewhere that’s easy to access.

Business Intelligence
BI
Bus. Int.
Business Intel.
Bus. Intel.
B. Intelligence
B.I.
Business Int.
Bus. Intelligence
Bus Intel

Open Microsoft Visual Studio, create a new project, call it what you will and save it. Or, just create a new package in an existing package. I’m assuming you can get this far without explicit directions.

  1. Drag a data flow object to the control flow, or just click the data flow tab, and hit the link to create a data flow.
  2. Drag a flat file source to the data flow, click new, and point it to the file you created earlier. Make sure “Columns in the first data row” is not selected.
  3. Drag a fuzzy grouping transform to the data flow and connect the flat file source to it.
  4. Double-click the fuzzy grouping transform.
    1. Select, or create, an OLE DB Connection.  Which one doesn’t matter for this example.
    2. In the columns tab, click the column, it should be called “Column 0” unless you renamed it earlier.  You don’t need to change anything else, however, here are the more interesting columns
      • Match Type: Fuzzy or exact.  You want to use fuzzy but you can set it to be an exact match.
      • Minimum Similarity.  Determines how accurate the match should be.  The higher the number the more accurate.  Leave it alone as well set it the next tab.
      • Numerals:  Looks at leading or trailing numbers for comparison purposes.
      • Comparison Flags: A handful of options to set specific rules for the matching.
    3. In the advanced tab, for now, leave everything alone but this is where you can set the similarity for matching.
    4. Click OK
  5. Drag a flat file destination to the data flow, connect the fuzzy grouping to it, and configure it however you like.
  6. Click on the connector you just created, hit edit, or just double-click it, click “Data Viewer” and check the “Enable data viewer” option.

This will enable a fuzzy grouping transform.  What I’m going to do now is run through a few different similarity settings just so you get an idea of how it works.  When I glanced through the books the first time I basically missed it and this kind of explanation would have helped.  Essentially, a similarity setting of 100 means nothing will change, while a setting of 0 rolls everything up into what the algorithm thinks is the best match.

I’m using the Advanced tab in the fuzzy grouping transform to set the similarity value for these examples.

Similarity 80

_key_in	_key_out _score	Corrected      	       Column_clean             _Similarity_Column
1	1	 1	Business Intelligence  Business Intelligence	1
2	2	 1	BI	               BI	                1
3	3	 1	Bus. Int.	       Bus. Int.	        1
10	3	 0.9875	Bus Int	               Bus. Int.	        0.9875
4	4	 1	Business Intel.	       Business Intel.	        1
5	5	 1	Bus. Intel.	       Bus. Intel.	        1
11	5	 0.9875	Bus Intel	       Bus. Intel.	        0.9875
6	6	 1	B. Intelligence	       B. Intelligence	        1
7	7	 1	B.I.	               B.I.	                1
8	8	 1	Business Int.	       Business Int.	        1
9	9	 1	Bus. Intelligence      Bus. Intelligence	1

As you can see the transform combined two rows into two groups based on the relatively high similarity score. Lets lower it, a bit.

Similarity 50

_key_in	_key_out _score	   Corrected	 	 Column_clean	 _Similarity_Column
2	2	 1	   BI	                 BI	         1
5	5	 1	   Bus. Intel.	         Bus. Intel.	 1
11	5	 0.9875	   Bus Intel	         Bus. Intel.	 0.9875
3	5	 0.7435962 Bus. Int.	         Bus. Intel.	 0.7435962
10	5	 0.7435962 Bus Int	         Bus. Intel.	 0.7435962
4	5	 0.6795985 Business Intel.	 Bus. Intel.	 0.6795985
9	5	 0.6315233 Bus. Intelligence	 Bus. Intel.	 0.6315233
6	6	 1	   B. Intelligence	 B. Intelligence 1
7	7	 1	   B.I.	                 B.I.	         1
8	8	 1	   Business Int.	 Business Int.	 1
1	8	 0.609375  Business Intelligence Business Int. 	 0.609375

Still two sets of matches but a lot more rows.

Similarity 25

_key_in	_key_out _score	   Corrected	 	 Column_clean	_Similarity_Column
2	2	 1	   BI	                 BI	        1
5	5	 1	   Bus. Intel.	         Bus. Intel.	1
11	5	 0.9875	   Bus Intel	         Bus. Intel.	0.9875
10	5	 0.7435962 Bus Int	         Bus. Intel.	0.7435962
3	5	 0.7435962 Bus. Int.	         Bus. Intel.	0.7435962
4	5	 0.6795985 Business Intel.	 Bus. Intel.	0.6795985
9	5	 0.6315233 Bus. Intelligence	 Bus. Intel.	0.6315233
8	5	 0.4735985 Business Int.	 Bus. Intel.	0.4735985
1	5	 0.3835568 Business Intelligence Bus. Intel.	0.3835568
6	5	 0.299975  B. Intelligence	 Bus. Intel.	0.299975
7	7	 1	   B.I.	                 B.I.	        1

Here we have one big grouping of matches as the earlier potential matches merged.

Similarity 0

_key_in	_key_out _score	   Corrected	 	 Column_clean	_Similarity_Column
2	2	 1	   BI	                 BI	        1
5	5	 1	   Bus. Intel.	         Bus. Intel.	1
11	5	 0.9875	   Bus Intel	         Bus. Intel.	0.9875
10	5	 0.7435962 Bus Int	         Bus. Intel.	0.7435962
3	5	 0.7435962 Bus. Int.	         Bus. Intel.	0.7435962
4	5	 0.6795985 Business Intel.	 Bus. Intel.	0.6795985
9	5	 0.6315233 Bus. Intelligence	 Bus. Intel.	0.6315233
8	5	 0.4735985 Business Int.	 Bus. Intel.	0.4735985
1	5	 0.3835568 Business Intelligence Bus. Intel.	0.3835568
6	5	 0.299975  B. Intelligence	 Bus. Intel.	0.299975
7	7	 1	   B.I.	                 B.I.	        1

Ah, the nice thing about that one, is I can just copy the prior results.  If we’d been paying more attention we would have seen that the similarity columns have no gain below 29.  So there was no reason to go to 0.

The other thing I didn’t mention is that when you have a match between key_in, and key_out the lookup is effectively designating that value as unique, or dominant.  It’s the alpha male of the group, so to speak.  Just in case you happened not to notice that.

There’s more going on here but this is, at least it was for me, a decent start with it.