14 Mar

SSIS Toolbox: Pivot, Aggregate And Sort.

Initially I had wanted to do just do a simple PIVOT transform but it turned into more than that so I added a Aggregate transformation to the process, actually, two.  Lets get to it.

A Pivot transform works similarly to a PIVOT statement in T-SQL.  In fact, this is the statement we’re going to duplicate using SSIS.

 

WITH cte AS
(
select YEAR(orderdate) AS yrDate, 
MONTH(orderdate) AS moDate,
totalDue
from sales.salesorderheader
)
SELECT *
FROM cte
PIVOT (SUM(totalDue) for yrDate IN ([2005],[2006],[2007],[2008])) p
ORDER BY moDate

moDate	2005	        2006	        2007	        2008
1	NULL	        1462448.8986	1968647.184	3359927.2196
2	NULL	        2749104.6546	3226056.1486	4662655.6183
3	NULL	        2350568.1264	2297692.9898	4722357.5175
4	NULL            1727689.5793	2660723.7481	4269365.0103
5	NULL	        3299799.233	3866365.1263	5813557.453
6	NULL	        1920506.6177	2852209.8283	6004155.7672
7	1074117.4188	3253418.7629	3998942.7051	56178.9223
8	2292182.8828	4663508.0154	5712201.2697	NULL
9	1836827.5791	3638980.3689	5702087.8686	NULL
10	1518540.2014	2488758.6715	3767722.1252	NULL
11	3218764.4696	3809633.4035	5250314.3052	NULL
12	2752818.0747	3099432.1035	5868526.2471	NULL

Really, there’s no reason not to just put this in the OLE DB source and go from there, but, since this is an SSIS article were going to do it there.

Step 1: Create a OLE DB source

  1. Create a new project and package in SSIS.  Name them whatever you like.
  2. Drag a Data Flow task to the Control Flow and double-click it to edit it.
  3. Drag a OLE DB source to the Data Flow and configure it as follows:
    1. OLE DB connection manager: AdventureWorks2012.  If necessary click new and create it.
    2. Data access mode:  SQL Command
    3. SQL Command Task: Use “SELECT YEAR(orderdate) AS yrDate, MONTH(orderdate) AS moDate,totalDue
      FROM sales.salesorderheader”
    4. Click the Columns tab to make sure everything mapped correctly.
    5. Click OK to close the OLE DB source.
Aggregate transform.

Aggregate transform.

Step 2: Create an Aggregate transformation

  1. Drag an Aggregate transformation to the Data Flow and connect the OLE DB source to it.
  2. Double-click the Aggregate transformation.  Check yrDate, moDate and totalDue.
  3. In the operation column, set tolalDue = SUM, yrDate = GROUP BY and moDate = GROUP BY.
  4. Click OK to close the transformation.

Step 3: Create a Pivot transformation

  1. Drag a Pivot transformation to the Data Flow and connect the Aggregate transformation to it.
  2. Pivot Transformation.

    Pivot Transformation.

    Double-click the Pivot transformation to open it and configure it as follows:

    1. Set Key: This is the verticall axis of the query.  Set it to moDate.
    2. Pivot Key: This is the horizontal axis of the query.  Set it to yrDate.
    3. Pivot Value: This is the data part of the query.  Set it to totalDue.
    4. Generate pivot output values from columns.  This is, for lack of a better term, the X-Axis label.  Use [2005],[2006],[2007],2008] for this section.
    5. Click “Generate Columns Now” which will alias the columns for you.
    6. Click OK to close the transformation.
  3. Show Advanced Editor.

    Show Advanced Editor.

    Clean up the header results.

    1. Right-click the Pivot transformation and select “Show Advanced Editor”.
    2. Click on the “Input and Output Properties” tab.
    3. Drill down though “Pivot Default Output” > “Output Columns”.
    4. Click on C_2005_totalDue and navigate to the Name field in the right section.  Change the name to 2005.
    5. Repeat step 4 for the 2006 through 2008 values.

The Pivot transformation is different from the T-SQL version in that it doesn’t aggregate the values. It just pivots the data so you need to do a second Aggregate transformation to make it all the way to the finish line.

Aggregate 2.

Aggregate 2.

Step 4: Create a final Aggregate transformation

  1. Drag an Aggregate transformation to the Data Flow and connect the Pivot transformation to it.
  2. Double-click the Aggregate transformation.  Check yrDate, moDate and totalDue.
  3. In the operation column, set tolalDue = SUM, yrDate = GROUP BY and moDate = GROUP BY.
  4. Click OK to close the transformation.

Step 4 is actually identical to step 2.

Step 5: Create a Sort transformation to guarantee sort order.

This likely isn’t necessary but it gets one more transform out of the picture so here it is.

Sort transformation.

Sort transformation.

  1. Drag a Sort transformation to the Data Flow and connect the second Aggregate to it.
  2. Double-click the Sort transformation to open it.
  3. Check the moDate box in the “Available Input Columns.
  4. You can accept the defaults in the next section but this is where you can set the Sort Type (ascending or descending) and the Sort Order a numerical value for the sequence.  Lower numbers are sorted first.
  5. Click OK to close the Sort transformation.

Finally, add a UNION ALL and connect the second Sort transformation to it.  Double-click the path, select the “Data Viewer” tab and check the “Enable data viewer” box. If we’ve both done everything right you should have a working Pivot and Aggregate Data Flow series.

Finished package.

Finished package.

This is one of the easier transformation but it had a, for me at least, non-intuitive piece in the way it transforms.  I struggled with  it a bit.  I also had something got out of whack earlier in the transform when I started working with the package.  I was overwriting an existing package and somehow, behind the scenes, something was hung up.  It would be wise to always start a new package, in a new package.

 

08 Mar

Grouping By An Unknown Range

Lets imagine that you wanted to break out your order data in 10% increments from the lowest sales to the largest sale?  And, suppose that your data was actually functional for this sort of thing and you didn’t have widgets that sold for $0.09 cents right next to 90,000 Lamborghini’s?  And, suppose this kind of analysis was actually useful for something besides a SQL GROUP BY example, OK, just suppose.  Anyways, here is, a process that you might go through and a query that would satisfy the problem.

Step 1 – Generate a range for your data.

USE adventureworks2012;
GO
DECLARE @minval decimal(18,2);
DECLARE @maxval decimal(18,2);

SELECT @maxval = MAX(totaldue), @minval = MIN(totaldue)
FROM sales.salesorderheader;
SELECT @maxval AS 'Max', @minval AS 'Min';

Max	        Min
187487.83	1.52

Right off the bat we need to contact Houston because this won’t work, well, it’ll work when we imagine it working later, but this is telling me that we have no negative amounts in our order data. In other words, that we’ve never issued a single credit. In other words, that our data is probably useless.

However, this is AdventureWorks, dammit, we don’t issue credits, and since it’s true in the data, I’ll return you to your regular programming.

Step 2 – Create a clause that segments our data range into 10% increments.

Here’s how you would do this. You create a case statement based on the @maxval value you created, and because we can in this case, 0, use it to segment the GROUP BY, as follows.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
	WHEN totaldue / @range <= .10 THEN '.0000 <= .10'
	WHEN totaldue / @range <= .20 THEN '.1001 <= .20'
	WHEN totaldue / @range <= .30 THEN '.2001 <= .30'
	WHEN totaldue / @range <= .40 THEN '.3001 <= .40'
	WHEN totaldue / @range <= .50 THEN '.4001 <= .50'
	WHEN totaldue / @range <= .60 THEN '.5001 <= .60'
	WHEN totaldue / @range <= .70 THEN '.6001 <= .70'
	WHEN totaldue / @range <= .80 THEN '.7001 <= .80'
	WHEN totaldue / @range <= .90 THEN '.8001 <= .90'
	ELSE '.9001 <= 1.0' END AS Range,
CAST(totaldue AS DECIMAL(18,2)) AS totaldue
FROM sales.salesorderheader
)
SELECT Range, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY Range

Range	        Total Due	Transaction Count	Average Sale
.0000 <= .10	41521958.70	29858	                1390.65
.1001 <= .20	17921756.80	615	                29141.07
.2001 <= .30	22384650.62	490	                45682.96
.3001 <= .40	15449160.66	240	                64371.50
.4001 <= .50	10416627.45	125	                83333.02
.5001 <= .60	8452682.22	82	                103081.49
.6001 <= .70	4923353.28	41	                120081.79
.7001 <= .80	1116949.70	8	                139618.71
.8001 <= .90	489622.37	3	                163207.46
.9001 <= 1.0	540019.13	3	                180006.38

Does this tell us a lot? Actually, it does tell us a few things. First, most of our transactions are small relatively speaking with a few large outliers in the data. If we really wanted to analyze this in some meaningful fashion we’d have to further segment the range. But this post really isn’t about analysis, it’s more about CASE, so I’m going to add an additional designation to the data so the ranges are clearer.

USE adventureworks2012;
GO
DECLARE @range decimal(18,2);

SELECT @range = MAX(totaldue)
FROM sales.salesorderheader;

WITH cte AS
(
SELECT CASE
    WHEN totaldue / @range <= .10 THEN 0*@range
    WHEN totaldue / @range <= .20 THEN .1001*@range
    WHEN totaldue / @range <= .30 THEN .2001*@range
    WHEN totaldue / @range <= .40 THEN .3001*@range
    WHEN totaldue / @range <= .50 THEN .4001*@range
    WHEN totaldue / @range <= .60 THEN .5001*@range
    WHEN totaldue / @range <= .70 THEN .6001*@range
    WHEN totaldue / @range <= .80 THEN .7001*@range
    WHEN totaldue / @range <= .90 THEN .8001*@range
    ELSE CAST(.9001*@range AS INT) 
END AS beginRange,
totaldue
FROM sales.salesorderheader
),
cte1 AS
(
SELECT beginRange, 
CAST(SUM(totaldue) AS DECIMAL(18,2)) AS [Total Due], 
COUNT(totaldue) AS [Transaction Count], 
CAST(AVG(totaldue) AS DECIMAL(18,2)) AS [Average Sale]
FROM cte
GROUP BY beginRange
)
SELECT CAST(beginRange AS INT) AS beginRange,
CAST(LEAD(beginRange) OVER (ORDER BY beginRange) - 1 AS INT) AS endRange,
[Total Due],
[Transaction Count],
[Average Sale]
FROM cte1
ORDER BY beginRange;

beginRange	endRange	Total Due	Transaction Count	Average Sale
0	        18766	        41521963.87	29858	                1390.65
18767	        37515	        17921756.81	615	                29141.07
37516	        56264	        22384650.61	490	                45682.96
56265	        75012	        15449160.67	240	                64371.50
75013	        93761	        10416627.45	125	                83333.02
93762	        112510	        8452682.28	82	                103081.49
112511	        131259	        4923353.25	41	                120081.79
131260	        150008	        1116949.69	8	                139618.71
150009	        168756	        489622.37	3	                163207.46
168757	        NULL	        540019.12	3	                180006.37

There is a lot going on here so lets go through it.

  1. The first thing I change was the MIN and MAX to a @range variable because since there were no negative values in the data I could just start at 0 and work my way up.  Hence, all I needed was a single variable passed into it by the MAX aggregate function.
  2. There are to CTE’s in the query.  The first uses a CASE function, with the @range value to calculate the beginning of each data range.
  3. The second CTE exists because I didn’t want to do a second CASE function for the ending value in each data range.  Instead I use it to aggregate the data from the first CTE into a result set that will allow me to use the LEAD function in a final query that unites all of this.
  4. I converted the beginRange and endRange values to INT types because it looks better.  In real-life I’d probably do the same with the “Total Due” and the “Average Sale” columns.
  5. As mentioned earlier I used the LEAD function to get the endRange value instead of a CASE function.

Once again, I don’t think this specific query is all that useful.  It’s too broad across the entire data range when it would really need to focus on the lower ordertotal transactions.  However, this kind of analysis can be useful.  Very large, or very small orders can distort data analysis in unexpected ways.  For instance, a sales rep can have his sales margin destroyed because he writes a large sale at a low profit.  If your commission structure has a gross profit calculation, it is even possible that he could even lose money on a sale that benefits the company.

Finally, this query is missing the most useful thing, profit margin.  I didn’t include it because the adventureworks2012 data requires joins to get at the cost information and this process was complex enough without it.  What I really wanted to show here is that it is possible to dynamically write a GROUP BY to group data based on the data it’s grouping.

Or should I say, discretize our data?