In the recent tutorials published on this forum, we have been discussing the primary characteristics of Slowly Changing Dimension (SCD) Data Flow Transformation in SQL Server 2008 R2 Integration Services. So far, we have covered its Changing Attribute and Historical Attribute options (corresponding to Type 1 and Type 2 changes) as well as support for inferred member records. While these features facilitate a variety of different data analysis scenarios, its wizard-driven configuration is relatively inflexible and generates a Data Flow structure whose performance tends to be suboptimal, especially when dealing with larger data sets. In this article, we will present a few methods of remediating these shortcomings.
Among the most apparent drawbacks of Slowly Changing Dimension Wizard is its uni-directional and somewhat destructive nature. Any attempt to modify the existing transformation via Edit... option in its context sensitive menu triggers the wizard again, and while your previously selected choices are remembered, any custom changes that you applied after its completion are automatically overwritten. Fortunately, it is possible to modify some of the transformation properties directly using Advanced Editor (accessible via the Show Advanced Editor... menu entry), which takes the form of a dialog box divided into the following four tabs:
- Connection Managers tab - allows you to review and modify connection managers to be used by the data flow component. The latter can be accomplished by selecting a desired one from the listbox in the Connection Manager column (note that there is no option to create a new one from this interface).
- Component Properties tab - provides the ability to modify a number of Custom Properties (most of which are also accessible when using the wizard interface) including:
- CurrentRowWhere - represents the WHERE clause of the SELECT statement that identifies the current row among all rows with the matching business key in the dimension table (for example, in the case of historical changes applied to dbo.DimProduct table in the AdventureWorksDW2008R2 Data Warehouse database, this could take the form of [StartDate] IS NOT NULL AND [EndDate] IS NULL or [Status]='Current').
- DefaultCodePage - a numeric value, designating the column code page to use when code page information is not available from the data source.
- EnableInferredMember - a Boolean value indicating whether inferred member updates should be detected and processed (which provides the same functionality as the Enable inferred member support checkbox on the Inferred Dimension member's page of the Slowly Changing Dimension Wizard).
- FailOnFixedAttributeChange - a Boolean value that determines whether the transformation fails when input columns designated as fixed attributes contain values different from their equivalents in the dimension table. This is equivalent to the Fail the transformation if changes are detected in a fixed attribute checkbox on the Fixed and Changing Attribute Options page of the Slowly Changing Dimension Wizard.
- FailOnLookupFailure - a Boolean value (set by default to False), which triggers transformation failure when a lookup of an existing record does not yield a successful outcome.
- IncomingRowChangeType - dictates whether all input rows should be treated as new (and simply redirected to the New Output) or whether the transformation should apply the appropriate change type. This is accomplished by assigning the value of All New or Detect (default) to it. This property (not exposed when running the wizard) comes handy during initial load of the dimension table (which considerably improves performance of this operation).
- InferredMemberIndicator - identifies the name of a Boolean column that indicates whether the current record represents an inferred member. This is one of two possible approaches to detecting inferred members, equivalent to the second option appearing on the Inferred Dimension Members page of the Slowly Changing Dimension Wizard.
- SqlCommand - contains the SELECT statement, which is used to create a schema rowset. This includes all columns identified as either fixed, changing, or historical attributes, as well as those that provide their chronological context (such as status, start date or end date).
- UpdateChangingAttributeHistory - a Boolean value that indicates whether changing attribute updates are supposed to be applied to expired rows in the dimension table (resulting in modifications of historical records, in addition to the current ones). This is accomplished by populating the Changing Attribute Updates Output with entries from Historical Attributes Inserts Output (and matches the functionality controlled by the Change all the matching records, including outdated records, when changes are detected in a changing attribute checkbox on the Fixed and Changing Attribute Options of the Slowly Changing Dimension Wizard).
- Column Mappings tab - gives you access to mappings between Available Input Columns and Available Destination Columns, representing fixed, changing, and historical attributes, as well as fields intended to provide chronological context during updates to out-of-date records.
- Input and Output Properties tab - contains the listing of all inputs and outputs of the component:
- Slowly Changing Dimension Input - provides an alternative view of the content displayed on the Column Mappings tab, with some additional, data type related information (such as DataType or Length).
- Unchanged Output, New Output, Fixed Attribute Output, Changing Attribute Updates Output, Historical Attribute Inserts Output, and Inferred member Updates Output - offer limited practical use, since their properties are read-only and adding new or removing existing outputs or columns is not possible.
Interface aside; probably the most common complaints regarding Slowly Changing Dimension transformation refer to its performance. Since this behavior results from a sub-optimal design generated by the wizard (whose inferiority surfaces when dealing with larger volumes of data or number of dimensions), you have to either modify its properties or resort to developing an alternative approach providing the same functionality. We will focus on the first of these two options.
One of the primary culprits affecting processing speed is repetitive lookups, triggered automatically for every incoming row. To mitigate this issue, eliminate the need for a new connection during subsequent lookups by setting the RetainSameConnection property of the Connection Manager referenced on the Connection Managers tab of Advanced Editor for Slowly Changing Dimensions to True (its default value is False). A more drastic approach involves offloading this functionality to an appropriately optimized (e.g. with appropriately sized cache or a temporary lookup table containing a subset of referenced records) Lookup component. This gives you the ability to identify all rows without matching business key, which can be inserted directly into the dimension table without relying on Slowly Changing Dimension Transformation.
Another weakness of the default pipeline created by the wizard results from an extensive use of the OLE DB Commmand component, which is also invoked individually for each of its input rows. While you are likely to see the performance benefit of changing the AccessMode property of the Insert Destination Data Flow component from its default OpenRowset to OpenRowset Using FastLoad (note, however, that this might lead to locking problems), it is more advantageous to stage corresponding data (respectively for updates and inserts) and use OLE DB Destination instead. In addition, as mentioned earlier, you can take advantage of the IncomingRowChangeType property to the AllNew value when populating the dimension table with new data.
See all articles by Marcin Policht