SQL Server 2005 Integration Services - Lookup Transformation - Part 33
September 25, 2006
Continuing the overview of Data Flow components in SQL Server 2005 Integration Services, which we initiated in the previous installment of this series, we will focus now on the Lookup transformation. Its basic purpose is to combine data from two sources, by matching one or more of their fields (commonly used when populating fact tables in Data Warehousing), although there are also other, numerous scenarios in which such functionality comes in handy, such as, producing output containing rows for which such match does not exist. Obviously, the majority of these actions can be fairly easily accomplished with standard T-SQL statements that contain JOIN clause, however, there are situations where this might not be preferred or even possible (e.g. incorporating lookup into a data flow involving disparate sources or referencing values that are available only at runtime).
The Lookup transformation populates its output by equi-joining rows being fed through its single input with reference (lookup) dataset, formed using a table, view, or (potentially parameterized) T-SQL statement (leveraging OLE DB connection manager). The join is based on the comparison between values of designated fields in each input row against lookup keys in the reference dataset. Depending on whether a matching row is found, this operation generates standard output by combining both data sources or triggering an error condition (details of this behavior are dictated by a number of configuration settings that we will be discussing shortly).
Following this short introduction, let's analyze an example demonstrating the functionality of this transformation and examine its characteristics more closely. We will emulate a situation where an inventory of Windows user accounts (including names of computers that belong to their owners and their logon names for some third-party applications) needs to be correlated with Human Resources records stored in the AdventureWorks database, in order to determine the first and last name for each. The sample will have the following format (with individual fields separated by semicolons):
adventure-works\kevin0;usny-wks001;KBrown adventure-works\roberto0;usny-wks002;RTamburello adventure-works\marcin1;usny-wks666;MPolicht adventure-works\ruth0;usny-wks004;REllerbrock
The Windows accounts data is stored in the LoginID column of the HumanResources.Employee table, which, once joined with the Person.Contact table (via ContactID column) can be used to determine each person's first and last names (by examining the content of the FirstName and and LastName columns). While the first two and the last rows in our sample contain valid references to AdventureWorks data, the third one will produce a mismatch.
Start by storing the sample data in a text file (we will call it AWLogins.txt). Next, launch Business Intelligence Development Studio and initiate a new project of Integration Services type. In Designer interface of its package, create a Data Flow task. Using the context sensitive menu within the Connection Managers area, bring up the Flat File Connection Manager Editor window. Name it according to its purpose (AWLogins) and point to the newly created text file using the Browse... button next to the File name text box. Confirm that its content is displayed properly in the Columns section (with the semicolon listed as a delimiter). Switch to the Advanced section and provide meaningful names for each column (such as Login, Computer, and AppID). In addition, we will need to ensure that the Integration Services data type for the first column will map properly to the SQL Server data type for the LoginID column, by changing its default string [DT_STR] to Unicode string [DT_WSTR] (for more information regarding this topic, refer to the SQL Server 2005 Books Online). Verify the outcome by examining the content of the Preview section and close the Editor window. Once you are back to the Designer interface, drag a Flat File Source icon from the Toolbox onto the Data Flow area, configure it by setting its Flat File Connection Manager entry to the newly created AWLogins and click on the OK button to make the change effective.
Now it is time to add the Lookup transformation icon from the Toolbox, connect its input to the output of the Flat File Source, and select Edit from its context sensitive menu to bring up its Editor window. On the Reference Table tab, you need to provide a connection manager (create a new one that references SQL Server 2005 AdventureWorks database) and specify a table, view, or SQL query that will define the content of the lookup recordset. (As mentioned in one of our earlier articles dealing with the topic of SSIS performance, it is recommended to construct SQL queries that narrow down range of target columns). In our case, enter the following statement in the "Use results of an SQL query" text box (to verify its correctness, click on Parse Query command button or check the outcome of its execution with Preview...):
SELECT E.ContactID, E.LoginID, C.FirstName, C.LastName FROM HumanResources.Employee AS E INNER JOIN Person.Contact AS C ON E.ContactID = C.ContactID
Switch to the Columns tab where you will see a graphical representation of two data sets in the form of two rectangles labeled Available Input Columns and Available Lookup Columns. Drag the entry for Login from the first one and drop it on the LoginID entry of the second (without the earlier adjustment of data type to Unicode string [DT_WSTR], this operation would generate an error). Underneath, you can define which lookup columns you want to include in the output. Select FirstName and LastName (as indicated earlier) with <add as new column> as the Lookup Operation for both.
Click on the "Configure Error Output..." command button to display the identically labeled dialog box. From here, you can define the behavior of the transformation for a variety of error conditions. In our case, you should see three entries here. The first one (with Lookup Output in the Input or Output column) defines the outcome of the mismatch in the lookup operation and can be configured to Ignore failure (copying mismatched rows to standard output with NULLs instead of lookup values), Redirect row (copying mismatched rows to error output with error code instead of lookup values), or Fail component. The remaining two entries for LastName and FirstName column exist due to the potential for Truncation errors (which can be addressed by applying the same three actions - Ignore failure, Redirect row, or Fail component). We will test the Redirect row behavior for the Lookup operation, so ensure that this action is selected in the Error column for the first row. Close the Configure Error Output dialog box and switch to the Advanced tab, containing the settings affecting memory utilization.
By default, Lookup transformation stores the referenced recordset in memory, in order to optimize lookup operations. This is reflected by the value of its CacheType property, which is set to Full (you can verify this in its Properties window or the Component Properties tab of its Advanced Editor window) as well as by the cleared "Enable memory restrictions" check box on the Advanced tab of its Editor window. This way, the entire lookup data is preloaded in memory, which, in most cases, speeds up processing, but might also negatively impact other applications (especially for large data sets). If this happens to be the case, you can also set the CacheType property to Partial (causing each row from the lookup recordset to be loaded into memory the first time it is referenced and remain there until cache size reaches the limit specified on the Advanced tab of the Editor window) or None (which eliminates caching altogether and is equivalent to turning on "Enable memory restriction" checkbox on the Advanced tab of the Editor window and keeping "Enable caching" cleared). None should be used with small reference recordsets, while Partial is geared towards scenarios where a large number of duplicate lookup key values in input data is expected.
It is important to realize that, depending on your choice of caching method, you might experience different outcomes when comparing the same sets of data. This surprising behavior results from the fact that with Full caching, all of reference data is loaded into memory and compared within the SSIS environment, which takes into account case and accent sensitivity. On the other hand, with None or Partial settings, comparison is handled by SQL Server, with collation and sensitivity settings defined typically on the database level (and commonly less strict). You can easily test this by altering one of the Login entries in our example (e.g. changing "adventure-works\ruth0" to "adventure-works\Ruth0") and switching between Full and None/Partial caching methods. To address this issue, convert input data into desired case with Character Map transformation and apply T-SQL UPPER or LOWER functions when generating reference recordsets.
In order to capture both standard and error outputs, set up two Flat File Destinations. After you connect each to appropriate outputs of the Lookup transformation, configure their Flat File Connection Managers, pointing to two distinct files (e.g. AWLoginsOut.txt and AWLoginsErr.txt). Once you execute the package, the first one should contain the following entries:
adventure-works\kevin0,usny-wks001,KBrown,Kevin,Brown adventure-works\roberto0,usny-wks002,RTamburello,Roberto,Tamburello adventure-works\ruth0,usny-wks004,REllerbrock,Ruth,Ellerbrock
while the second one will consist of a single line with input row values followed by error code indicating the cause of failure. If you decide to apply Ignore failure setting (in the Configure Error Output dialog box of Lookup Transformation Editor), the standard output (and AWLoginsOut.txt file) would contain four rows, with the mismatched one ending with two NULLs (instead of FirstName and LastName values).