18 Mar

SSIS Toolbox: Merge, Merge Join and Lookup

Merge, Merge Join and Lookup are 3 transformations that related in the sense that they combine two sets of data in some fashion.  They are very straight-forward so let’s get right to it.

First we need some data.  Run the following code in a databases of your choice.

CREATE TABLE table1
(
id INT IDENTITY(1,1),
someCol1 VARCHAR(20)
);
INSERT INTO table1 (someCol1) VALUES
('1Joe'),('1Frank'),('1Bill'),('1Tom');
CREATE TABLE table2
(
id INT IDENTITY(1,1),
someCol2 VARCHAR(20)
);
INSERT INTO table2 (someCol2) VALUES
('2Jill'),('2Francine'),('2Brenda');

Create two OLE DB data sources and add a sort transformation to each of them.

I’m only going to do this section once but it applies, mostly, to the other transformations.  Although the Lookup transformation only uses a very small piece of it.

  1. Create a project and a package, name them as you will, then create a Data Flow task.
  2. Drag two OLE DB sources to the Data Flow.
  3. Name the them Table1 and Table2.
  4. Configure Table1 as follows:
    1. OLE DB connection manager: click New and point it to the table the database where you ran the CREATE TABLE statements in earlier.
    2. Data access mode: Table or view.
    3. Name of the table or view: [dbo].[table1].
  5. Configure the Table2 as follows:
    1. OLE DB connection manager: click New and point it to the table the database where you ran the CREATE TABLE statements in earlier.
    2. Data access mode: Table or view.
    3. Name of the table or view: [dbo].[table2].
  6. Sort Configuration.

    Sort Configuration.

    Drag two Sort transformations to the Data Flow and connect them to the OLE DB sources.

  7. Configure them as follows:
    1. Place a check in the box next to id and make sure that the someCol field is checked in the “Pass Through” column.
    2. Accept the defaults, they should be id, id, ascending and 1 in that order.

Merge Join Transformation

  1. Drag a Merge Join transformation and connect the two Sort transformations to it.  Configure the first Sort connection as “Merge Join Left Input” and the second as “Merge Join Right Input”.
  2. Double click the Merge Join transformation and edit it as follows.
    1. Join type: Left outer join.  Ignore the swap inputs.
    2. Ignore the swap inputs button.  It only switches the position of the Sorts.
    3. Make sure that the two Sorts are connected on the id column and that all 4 boxes are checked.
  3. Click OK to close the box.
Finished Merge Join.

Finished Merge Join.

This part is the same for all 3 transformations.   It’s only done so you can see a result.

  1. Drag a UNION ALL to the Data Flow and connect it the Merge Join.
  2. Click on the path, select the “Data Viewer” tab, and check the “Enable data viewer” box.

Run the package and you should see a result that looks like this:

id	someCol1	id (1)	someCol2
1	1Joe	        1	2Jill
2	1Frank	        2	2Francine
3	1Bill	        3	2Brenda
4	1Tom	        NULL	NULL

In other words, it’s a classic LEFT JOIN. Exactly as if I’d written this,

SELECT *
FROM Table1 t1
LEFT JOIN Table2 t2
ON t1.id = t2.id

Merge Transformation

A merge transformation is different in that it combines columns.  As such it functions much more like a UNION SQL statement.

Merge Configuration.

Merge Configuration.

In order to configure this you have 2 options.  You can repeat everything we just did or you can delete the Merge Join transformation in your Data Flow and replace it with it a Merge transformation.  I vote the latter.

  1. Delete the Merge Join transformation.
  2. Drag a Merge transformation to it and connect the first sort to it as “Merge Input 1”.  Connect the second sort to it, which will default to the “Merge Input 2” option.
  3. Double-click the Merge transformation and accept the defaults.

Connect the UNION ALL per the instructions in the Merge Join transformation.

If you run the package you should get the following result:

id	someCol1
1	NULL
1	1Joe
2	NULL
2	1Frank
3	NULL
3	1Bill
4	1Tom
Finished Merge Transformation.

Finished Merge Transformation.

That’s almost what you want but not quite. Double-click the Merge transformation and in the second row (someCol1), set the Merge Input column to somecol2 and run the package, again.

id	someCol1
1	2Jill
1	1Joe
2	2Francine
2	1Frank
3	2Brenda
3	1Bill
4	1Tom

That is what we want and it’s equivalent to the following SQL.

SELECT id, someCol1
FROM table1
UNION
SELECT id, someCol2
FROM table2

You can save, delete or whatever you’d like with the first two examples, although the easiest thing to do would be to delete everything but the Table1 OLE DB source.

Lookup Transformation

The Lookup transformation is similar to a join, in that it works the same way, but it’s actually two queries in place of one.  In other words, you can have multiple outputs based on whether you match the Lookup or not.  You also build it in a much different fashion than we did in the earlier transformations.  Let’s finish this off.

  1. Either leave the Table 1 OLE DB source, and delete everything else, or create a new package and a OLE DB source configured like we did earlier.
  2. Drag a Lookup transformation to the Data Flow and connect the Table1 OLE DB source to it.
  3. Double-click the Lookup transformation and configure it as follows:
    1. In the General tab set the “Specify how to handle rows with no matching entries” to “Redirect rows to no match output”.
    2. Click the Connection tab, use the connection you created earlier and set “Use a table or a view” to “[dbo].[table2].
    3. In the Columns tab, connect the id fields to each other and check both boxes in the “Available Lookup Columns”.
    4. Click OK to finish.
  4. Drag a UNION ALL to the Data Flow and connect it to the Lookup transformation.  When the prompt appears choose “Lookup Match Output”.  Drag a second UNION ALL to the Data Flow and connect the Lookup to it as well.  It will default to “Lookup No Match Output”.
  5. Enable the Data Viewer for both of these.
Finished Lookup.

Finished Lookup.

Right now you should have a red-X in the first UNION ALL.  You can practice a philosophy of “don’t sweat the small stuff” because it will run, or you can double-click the UNION ALL and rename the “Output Column Name” for the second id value to id1.

Running the package should return the following results:

id	someCol1	id	someCol2
1	1Joe	        1	2Jill
2	1Frank	        2	2Francine
3	1Bill	        3	2Brenda

and

id	someCol1
4	1Tom

In other words, one group for the matched records and one group for the unmatched records.

The SQL For this would look like the following:

SELECT *
FROM table1 t1
INNER JOIN table2 t2
ON t1.id = t2.id;

and 

SELECT t1.*
FROM table1 t1
LEFT JOIN table2 t2
ON t1.id = t2.id
WHERE t2.id IS NULL

The Lookup transformation also has a number of options which I didn’t touch on here, but it wouldn’t surprise me at all if the 70-463 test. Specifically, there are the modes you can run the transformation in. They are,

  • full cache – Reads the source and writes it to the cache, otherwise known as memory.
  • Partial cache – Reads the source row-by-row, checks the cache, and if the record is not there it is written to cache, otherwise it will use the record in the cache.
  • No cache – It doesn’t cache.  When you select the option a OLE DB source is your only option.

This is a very good article on the different types of caching.

Finally, you can use both Oracle and DB2 databases for the source. Funny that you never see MySQL as an option.

So there you have it. In summary, a Merge Join is a plain old regular JOIN, a merge resembles a UNION ALL statement and the Lookup transformation most closely resembles a split transformation or two separate joins.

One more note on sorting.  I used a Sort transformation, because I wanted to use one, but it’s probably better to sort the query at the source by using a ORDER BY in the OLE DB Source.  If you do that, however, SSDT will not know that the source data is ordered.  You have to manually tell it that you data source is sorted by doing the following:

  1. Right-click the OLE DB data source and select “Show Advanced Editor”.
  2. Click the “Input and Output Properties” tab.
  3. Click the “OLE DB Source Output” icon in the “Inputs and outputs” column.
  4. In the “Common Properties” column set the IsSorted property to true.
  5. Click the cross-hair next to “OLE DB Source Output”.
  6. Click the cross-hair next to “Output Columns”.
  7. Select the field that your data source is sorted on and in the “Common Properties” column set the SortKeyPosition property to 1.  If you sort more than a single column you would need to set those values as well.

You can read more at this link.

http://msdn.microsoft.com/en-us/library/ms137653.aspx