SQL Server 2005 Integration Services - Part 38 - Pivot Transformation
December 8, 2006
Extensive ETL (Extraction, Transformation, and Loading) capabilities that SQL Server 2005 Integration Services are based on, deliver such essential functionality as the combination and cleanup of data originating from heterogeneous sources or scheduling and coordination of activities that frequently take place beyond the boundaries of database management systems. While a substantial number of these features help with traditional database administration tasks, there are also a few intended primarily for assisting with data analysis. In this article, we will cover Pivot transformation, which is one of the more popular choices in this category.
Pivot operation (just like its T-SQL equivalent) modifies the way in which a recordset is presented; typically by rotating row data into columns, (SSIS also offers Unpivot transformation, which reverses this process). Even though these changes do not introduce any new data, they tend to enhance the ability to analyze existing content by simplifying comparisons and uncovering less apparent trends. In order to help you understand this concept, we will present an example illustrating pivot operation. As our data source, we will use sample spreadsheets available on the Microsoft Web site in the form of Excel 2002 Sample: PivotTable Reports which you need to extract to an arbitrary folder by running the downloadable Report.exe. The target location will host SampleSalespersonReports.xls (which we will manipulate throughout the course of this article), SampleProductReports.xls, SampleOrderReports.xls, and SampleCustomerReports.xls Excel workbooks. Even though they were designed with Excel Pivot Table functionality in mind, we will be able to leverage them for the purpose of our demonstration.
The spreadsheet serving as our data source ('Source Data' in the SampleSalespersonReports.xls) contains inventory of orders handled between July 2003 and May 2005 by nine salespeople located in the USA and the UK. Our intention is to convert it into a recordset that would allow us to easily determine the total value of orders for each salesperson during each year. More specifically, we want the outcome to consist of five columns - Salesperson, Country, 2003 Orders Amount, 2004 Orders Amount, and 2005 Orders Amount. Since the values stored in the last three need to be calculated by adding individual order amounts on per salesperson and per year basis, we will use the Derived Column and Aggregate transformations for this purpose. Once the summarized data is available (still in the original format), we will reorganize it by applying Pivot. The final result will be saved in a spreadsheet by using the Excel Destination Data Flow component.
To accomplish this, start by initiating a new project of Integration Services type in the Business Intelligence Development Studio. Add to the newly created project a Data Flow task (by dragging its icon from the Toolbox onto Designer interface) and double-click on it to switch to its tabbed area. Create Excel Source (listed under Data Flow sources in the Toolbox) and display its Editor (by selecting the Edit... entry from its context sensitive menu). Within the Editor window, click on the New... command button to provide parameters for Connection Manager, pointing to SampleSalespersonReports.xls. Ensure that the "Table or view" option appears in the "Data access mode" listbox and pick 'Source Data$' as Name of the Excel sheet. Switch to the Column section within the Editor window and mark Country, Salesperson, Order Date, and Order Amount in the Available External Columns listing. Once you complete these steps, click on the OK button to close the Editor window.
Next, drag Derived Column transform from the Toolbox and connect the output
of our Excel Source with its input. Launch its Editor window and define a new
derived column named Order Year, calculated using the
Next transformation that needs to be included in our package is Pivot. Once you have dropped it onto the Data Flow area from the Toolbox, connect the output of Aggregate to its input and select Edit or Show Advanced Editor - interestingly both present you with the same Advanced Editor for Pivot window. Once there, review the Component Properties tab and switch to the Input Columns tab. Ensure that all available input columns (Salesperson, Order Amount, Country, and Order Year) are selected and switch to the Input and Output Properties. This is where the majority of configuration takes place.
As mentioned before, our goal is to display the outcome in the specific format, with three extra columns (2003 Orders Amount, 2004 Orders Amount, and 2005 Orders Amount), in addition to the two original ones - Salesperson and Country. With the assistance of Derived Column and Aggregate, we have so far managed to create a recordset with Salesperson, Country, Order Year, and Order Amount fields, which contains the total amount of orders for a specific salesperson in a given year, giving us 27 rows (9 salespeople times 3 years) - down from 799 rows in the SampleSalespersonReports.xls spreadsheet. At this point, we want to rearrange records in such way that instead of Order Year and Order Amount columns, we will have three columns, one per each year covered by our sales inventory (giving us a table with 9 rows and 5 columns - with a single row for each salesperson) listing the amount of sales for an individual salesperson in that year. According to pivot nomenclature, Salesperson and Country function as SetKeys (values in these input columns identify records that need to be grouped together in the same output row), Order Year serves the role of the PivotKey (column which values are used to determine additional columns in the resulting recordset), and Order Amount contains PivotedValues (which are copied to the new columns created by pivot). Keep in mind that entries in SetKey and PivotKey columns have to be unique on the per-row basis (which is the case, since the data has been aggregated prior to applying the pivot).
Continue our configuration by expanding the Pivot Default Input node, which lists all input columns. For each, you need to define its role in the pivot process, by setting the PivotUsage custom property, which can take on one of the following values:
For all input columns, take a note of the values of their LineageID property, since you will need to know them to proceed with the next step. Once completed, switch to the Pivot Default Output node and create the following output columns:
Confirm your choices by clicking on the OK button and return to the Data Flow tab area. To capture the results, create an Excel Destination, connect its input with the output of the Pivot transformation and specify the target spreadsheet by assigning appropriate values in its Excel Connection Manager. The outcome should contain five columns and nine rows, listing aggregate order values for each salesperson in each of the three years covered by the SampleSalespersonReports.xls.
It is important to remember that correct output requires that SetKeys entries containing identical values appear in adjacent input rows. In our example, this was handled by the Aggregate component (grouping all records by salesperson), however in cases where this operation is not needed, make sure you introduce Sort transformation prior to performing pivot. Otherwise, you will end up with a separate row for each non-adjacent value in the SortKey column (and NULLs entries in some of pivoted columns for this row). For example, if three rows for a given salesperson were not grouped together in our Pivot input data, we would end up with three output rows sharing the same SetKey value (i.e. for the total of 11 rows in the output recordset). One of them would contain total sales in the 2003 Orders column as well as two NULLs under 2004 and 2005 Orders, while the remaining two would have a single value in the 2004 Orders and 2005 Orders columns, respectively (and NULLs in the other two columns).