20 Mar

SSIS: Change Data Capture

This is part of my random ongoing saga with SSIS.  Like previous disclaimers, a million guys have written this and it’s all on the MSDN.  I perhaps take a creative swing at it but more than anything this is a learning process with the pressure of getting it right for the internet.  It’s also really long.

Note: Make sure SQL Agent is running. You’ll regret it, if it isn’t.

Change Data Capture captures changes to tables in the form of inserts, deletes and updates.  It’s primarily used in data warehouses to capture changes made to data for tracking purposes. It will also track DDL changes to tables.

51F295JJ2HL._SY300_

The time stream must be at risk, no one would have made a sequel to Timecop.

So, lets load some data.  Of course, me being me, I immediately pick the kind of data that changes rapidly, a list of the first 10 presidents of the United States.  Look, with the Timecop program wrecking havoc with our history you’ve got to stay on top of this stuff.  Actually, I grabbed some data, and it happened to mess up in a perfect way for this post.

CREATE DATABASE cdctest;
GO
USE cdctest;
CREATE TABLE presidents
(
	id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
	president varchar(50),
	vicepresident varchar(50),
	party varchar(50),
	startTerm int,
	endTerm int
);

SELECT *               -- Creates the destination table.
INTO presidents_dw
FROM presidents;
INSERT INTO presidents
(president, vicepresident, party, startTerm, endTerm)
VALUES
('George Washington','John Adams','No Party Designation',1789,1797),
('John Adams','Thomas Jefferson','Federalist',1797,1801),
('Thomas Jefferson','Aaron Burr','',0,0),
('George Clinton','Democratic-Republican','',1801,1809),
('James Madison','George Clinton','',0,0),
('Elbridge Gerry','Democratic-Republican','',1809,1817),
('James Monroe','Daniel D Tompkins','Democratic-Republican',1817,1825),
('John Quincy Adams','John C Calhoun','Democratic-Republican',1825,1829),
('Andrew Jackson','John C Calhoun','',0,0),
('Martin Van Buren','Democratic','',1829,1837),
('Martin Van Buren','Richard M. Johnson','Democratic',1837,1841),
('William Henry Harrison','John Tyler','Whig',1841,1841),
('John Tyler','None','Whig',1841,1845);

What happened here is that I highlighted a list I found on a website, copied it, replaced the delimiters so I could insert, and blew up when a president had multiple vice-presidents. It’s perfect for a demo although in real-life you would use CDC for a customer table or some other purpose where you needed to track changes to your tables. A list of presidents is about the worst possible use for it you could think of.

And because a journey of a thousand miles begins with a single step, lets start.

Step 1: Enable CDC, and make a second table, and a backup table, and more.

Make a backup. Trust me, I needed it.

SELECT *
INTO presidentsBack
FROM presidents;

(13 row(s) affected)

(13 row(s) affected)

Now, we enable CDC. We have to do this for the database and for the table we want to track.

USE cdctest;
EXEC sp_cdc_enable_db;
GO

SELECT name, is_cdc_enabled
FROM sys.databases
WHERE name LIKE 'cdc%';

exec sys.sp_cdc_enable_table 
    @source_schema = 'dbo', 
    @source_name = 'presidents',
    @role_name = 'CDCRole',  -- Role to gate access to change data.
    @supports_net_changes = 1;  -- Allows changes to capture net changes.
GO
SELECT name, is_tracked_by_cdc
FROM sys.tables
WHERE name LIKE 'president%';

name	is_cdc_enabled
cdctest	1

name	        is_tracked_by_cdc
presidents	1
presidents_dw	0
presidentsBack	0

The 1’s are a good thing. It means we’ve enabled CDC. The most complex piece is enabling a table. Because this is a really long process I’m going to direct you to the MSDN for more detail on what it’s doing.

http://msdn.microsoft.com/en-us/library/bb522475.aspx

At this point we’re ready to start working on the change capture process. We have a source table, a destination table and CDC is running.

Step 2: Build a Load Package in SSIS for tracking.

This is one of those things where it’s easier to do than explain.

We keep selecting paper, and no one wins!

We keep selecting paper, and no one wins!

Create a new project, call it what you want, and put it somewhere that makes you happy and “lets go”.

  1. Rename package.dtsx to load.dtsx.
  2. Create 3 connection managers (you don’t need all 3, and they don’t have to be the same database, but it does help to name them this way):
    1. OLE DB, name it Source and point it to the cdctest database.
    2. OLE DB, name it Destination and point it to the cdctest database.
    3. ADO.Net, name it CDCState and point it to the cdctest database.
    4. Convert them to project connections by right-clicking and selecting “Convert To Project Connection”.
  3. CDC Control Task

    CDC Control Task

    Drag a CDC Control Task to the control flow and configure it as follows.

    1. CDC Server Database ADO.Net: CDCState.
    2. CDC control operations: Mark initial load start.
    3. Variable containing  the CDC state: click New, accept the default, and click OK.
    4. Automatically store state in a database table: Check it.
    5. Connection manager for the database where the state is stored: It should say CDCState, accept it.
    6. Table to use for storing state: Click New, click Run, click OK.
    7. State Name: It should be CDC_State.
  4. Click OK.
  5. Drag a Data Flow Task and connect the CDC Control Task to it.
  6. Double click on the Data Flow Task to open it.
  7. Drag a OLE DB Source to the Data Flow and configure it as follows:
    1. OLE DB Connection Manager: Source.
    2. Data Access Mode: Table or view.
    3. Name of the table: presidents.
    4. Click the Columns tab to make everything happy.
    5. Click OK to close the connection.
  8. Drag a OLE DB Destination to the Data Flow, connect the OLE DB source to it, and configure it as follows:
    1. OLE DB Connection Manager: Destination.
    2. Data Access Mode: Table or view.
    3. Name of the table: presidents_dw.
    4. Click the Mappings tab to make sure everything mapped correctly.
    5. Click OK to close the connection.
  9. Ignore the Red X, we’ll deal with that in a second.
  10. Go back to the Control Flow tab, add a CDC Control Task, connect the Data Flow Task to it, and configure it as follows:
    1. CDC Server Database ADO.Net: CDCState.
    2. CDC control operations: Mark initial load end.
    3. Variable containing  the CDC state: User::CDC_State.
    4. Automatically store state in a database table: Check it.
    5. Connection manager for the database where the state is stored: It should say CDCState, accept it.
    6. Table to use for storing state: [dbo].[cdc_states];
    7. State Name: It should be CDC_State.
Format > Align > Left to clean it up.

Format > Align > Left to clean it up.

Now, we still have the problem with the red-X.  Because I did a copy of the presidents table it created the presidents_dw table id field with an identity property, so the mapping won’t work.  For my example, and because it’s easier, we’ll drop the identity property on the destination id column.

ALTER TABLE presidents_dw DROP COLUMN id;
ALTER TABLE presidents_dw ADD id INT;

Now go back to the Data Flow task, you’ll need to click on one of the OLE DB tasks (I hit the Destination) and the red-X will be gone.

At this point the package should run fine, go ahead and execute it.

We can verify it with the following query,

SELECT *
FROM presidents_dw

president	       vicepresident	  party	                start	end     id
George Washington      John Adams	  No Party Designation	1789	1797	1
John Adams	       Thomas Jefferson	  Federalist	        1797	1801	2
Thomas Jefferson       Aaron Burr		                0	0	3
George Clinton	                          Democratic-Republican	1801	1809	4
James Madison	       George Clinton		                0	0	5
Elbridge Gerry	                          Democratic-Republican	1809	1817	6
James Monroe	       Daniel D Tompkins  Democratic-Republican	1817	1825	7
John Quincy Adams      John C Calhoun	  Democratic-Republican	1825	1829	8
Andrew Jackson	       John C Calhoun		                0	0	9
Martin Van Buren	                  Democratic		1829	1837	10
Martin Van Buren       Richard M. Johnson Democratic	        1837	1841	11
William Henry Harrison John Tyler	  Whig	                1841	1841	12
John Tyler	       None	          Whig	                1841	1845	13

As you can see we have a fair amount of clean up to do. The main problem is when a president had two vice-presidents.

SELECT *  -- In case you didn't do it earlier.
INTO presidentsBack
FROM presidents;
UPDATE presidents SET vicepresident = 'Aaron Burr / George Clinton', 
party = 'Democratic-Republican', startTerm = 1801, endTerm = 1809 WHERE id = 3;
UPDATE presidents SET vicepresident = 'George Clinton / Elbridge Gerry', 
party = 'Democratic-Republican', startTerm = 1809, endTerm = 1817 WHERE id = 5;
UPDATE presidents SET vicepresident = 'John C Calhoun / Martin Van Buren', 
party = 'Democratic', startTerm = 1829, endTerm = 1837 WHERE id = 9;
DELETE FROM presidents WHERE id IN (4,6,10);
SELECT * FROM presidents;

id  president	           vicepresident	             party	            end  start
1   George Washington	   John Adams	                     No Party Designation   1789 1797
2   John Adams	           Thomas Jefferson	             Federalist	            1797 1801
3   Thomas Jefferson	   Aaron Burr / George Clinton	     Democratic-Republican  1801 1809
5   James Madison	   George Clinton / Elbridge Gerry   Democratic-Republican  1809 1817
7   James Monroe	   Daniel D Tompkins	             Democratic-Republican  1817 1825
8   John Quincy Adams	   John C Calhoun	             Democratic-Republican  1825 1829
9   Andrew Jackson	   John C Calhoun / Martin Van Buren Democratic	            1829 1837
11  Martin Van Buren	   Richard M. Johnson	             Democratic	            1837 1841
12  William Henry Harrison John Tyler	                     Whig	            1841 1841
13  John Tyler	           None	                             Whig	            1841 1845

I abbreviated start and end so they’d fit in the post.

So, finally, some clean data. Lets set up and run a CDC process against it to capture the changes.

  1. Create a new package and name it captureChanges.dtsx.
  2. Drag a CDC Control flow task to the Control Flow and configure it as follows,
    1. CDC Server Database ADO.Net: CDCState.
    2. CDC control operations: Get Processing Range.
    3. Variable containing  the CDC state: click New, accept the default, and click OK.
    4. Automatically store state in a database table: Check it.
    5. Connection manager for the database where the state is stored: It should say CDCState, accept it.
    6. Table to use for storing state: [dbo].[cdc_states].
    7. State Name: It should be CDC_State
    8. Click OK.
  3. Drag a Data Flow Task to the Control Flow, connect the Control Flow task to it, and double-click the Data Flow task to enter it.
  4. dataflowDrag a CDC Source to the Data Flow and configure as follows,
    1. ADO.NET connection manager: CDCState.
    2. CDC enable table: [dbo].[presidents].
    3. Capture instance: dbo_presidents
    4. CDC Processing Mode: ALL.
    5. Variable containing the CDC state; User::CDC_State
    6. Click the Columns tab, there should be stuff there.
    7. Click OK.
  5. Drag a CDC Splitter transform to the Data flow and connect the CDC Source to it.  You can accept the defaults here but it’s useful to look around in it.
  6. Drag 3 Union All transforms to the Data Flow and name them INSERT, UPDATE and DELETE.
  7. Connect the splitter to them and choose the output that matches the name for each of them.
  8. Click on the output path for each of the three paths, click “Data Viewer” and check the “enable data viewer” checkbox.  This is the lazy way to see what happens.
  9. Return to the Control Flow tab, drag a CDC Control Task to it, connect the Data Flow task to it, and edit it as follows:
    1. CDC Server Database ADO.Net: CDCState.
    2. CDC control operations: Mark processed range.
    3. Variable containing  the CDC state: User:: CDC_State.
    4. Automatically store state in a database table: Check it.
    5. Connection manager for the database where the state is stored: It should say CDCState, accept it.
    6. Table to use for storing state: [dbo].[cdc_states].
    7. State Name: It should be CDC_State
    8. Click OK.
Hopefully?

Hopefully?

If you did all of that right, and I did too, you should have a working CDC process. Now, obviously you would use this data for data warehousing type things, but this post is clocking in at over 1700 words, and I’m done, so maybe I’ll deal with that another day.

One other note.  CDC can track DDL changes.  DDL changes will not make you Mr. Happy Pants, when they happen.  But, this stored procedure can help.

sys.sp_cdc_get_ddl_history [ @capture_instance = ] 'capture_instance'

You can read more at the MSDN.