SQL Server 2005 Integration Services - Fuzzy Lookup - Part 34
October 13, 2006
In the recent articles of our series dedicated to SQL Server 2005 Integration Services, we have been covering the most popular Data Flow components. The latest article focused on Lookup Transformation, which combines records from two sources based on exact match between their corresponding fields. This functionality, while very useful, depends to large extent on accuracy of data and does not take into consideration any of its potential inconsistencies (resulting from the use of abbreviations and diminutives, or from accidental misspellings). Fortunately, it is fairly straightforward to remediate this shortcoming by employing Fuzzy Lookup Data Flow transformation, which provides the ability to perform general data cleanup.
Earlier versions of SQL Server included methods helpful in identifying approximate matches, such as LIKE and CONTAINS T-SQL constructs or more elaborate SOUNDEX and DIFFERENCE functions. Even though they were applicable in a variety of scenarios, they suffered from several limitations that impacted their usefulness (for example, SOUNDEX and DIFFERENCE could not properly handle delimiters, including spaces). SQL Server 2005 still supports all of them, but delivers much more flexibility and power by introducing fuzzy lookups. Instead of searching for exact matches between columns in input data rows and reference dataset (as in standard lookup), this transformation employs an algorithm based on approximation. Without getting deeper into its ugly details, the basic idea involves breaking down compared content (which must be expressed in compatible string data types) into smaller pieces, known as tokens (based on a set of default, but modifiable delimiters), which subsequently are used to generate a match index, permitting more granular and versatile analysis.
In addition to editing the list of delimiters (represented by the Delimiters property of the Fuzzy Lookup Transformation), output of the matching process can be altered by changing the value of MaxOutputMatchesPerInput property (which designates the upper limit on the number of results returned from a comparison for each input row) from its default (1). Increasing this number will likely result in more hits, which you can review afterwards to verify their accuracy. Each of them is rated according to similarity and confidence grading (with values ranging between 0 and 1), which are included in the component's output columns (labeled _Similarity and _Confidence, respectively).
Similarity indicates how close a particular match between respective strings in input and lookup rows is, according to the token-based comparison (with 0 representing no similarities and 1 designating identical strings). The MinSimilarity property determines the lower limit of the similarity values for the result set. This property exists on both the component and individual join (between the respective input and lookup table columns) levels. Information conveyed by the confidence parameter comes in handy when you have several results with comparable similarity values, since its role is to distinguish the ones that are more likely candidates for a match (this might, for example, be determined based on the degree of their uniqueness among other results). In cases where matches cannot be found, values of 0's are returned for both similarity and confidence (and null values appear in place of lookup entries otherwise copied to the output).
Keep in mind that the complexities of all calculations associated with fuzzy lookups have significant impact on the utilization of memory, disk space and processing resources. There are several ways of controlling the extent of this utilization. By manipulating the MaxMemoryUsage property of the Fuzzy Lookup component, you can adjust the upper limit of memory (in MB) dedicated to its activities (with the default of 0, the allocation is dynamic, operating on an as needed basis). For larger data sets, consider assigning a False value to Exhaustive property, which, while making lookup algorithm less thorough, enhances its efficiency (with its value set to True, input rows are compared against every row in the reference table, regardless of index-based statistics). Another improvement can be achieved by saving the match index after the initial execution (in the same database that hosts the reference table) and maintaining it (by placing triggers on the reference table to detect and automatically replicate changes to it), thus reducing the initial delay when re-launching the same package (these options are configurable from the General tab of the Fuzzy Lookup Transformation Editor window). Note, however, that you might encounter contention issues in cases where the reference table is frequently modified by multiple applications. Also be aware of significant space usage, due to the high number of temporary objects created when lookups are performed (both of these issues can be corrected by launching your package from a server with sufficient amount of disk space and a local, non-production copy of the reference table).
We will explore characteristics of the Fuzzy Lookup, by attempting to match the following records against the HumanResources.Employee table (of the AdventureWorks database), based on its LoginID, Title and Gender columns:
adventure-works\alex0;ANayberg;production;M adventure-works\alex0;ANayberg;production;F adventure-works\kevin6;KBrown;marketing;M adventure-dorks\roberto0;RTamburello;production;M adventure-works\brownie0;MBrown;emergency management;M adventure-works\ruth0;REllerbrock;production;F
Start by storing our sample data in an arbitrarily named text file. Once this is completed, launch Business Intelligence Development Studio and initiate a new Integration Services project. Switch to the second tab of the Designer interface and create a Data Flow task. Drag the Flat File Source icon from the Toolbox onto the task area. Bring up its Editor window by selecting the appropriate option from its context-sensitive menu. In the Connection Manager section, click on the New... command button. This will display the Flat File Connection Manager Editor window. Assign a descriptive name to it, point to the text file that you saved earlier, and switch to the Advanced section. Label the columns LoginName, UserName, Department and Gender, respectively (you also need to change the OutputWidthColumn of the first to 256, the third to 50, and the fourth to 1, in order to match the length of LoginID, Title and Gender fields of the HumanResources.Employee table). Verify the output in the Preview section and click on OK twice to close both Editor windows.
Drag the Fuzzy Lookup transformation icon from the Toolbox, drop it onto the
Data Flow Task surface and connect the output of the Flat File Source with its
input. Invoke the Editor interface from its context sensitive menu. Create a
new OLE DB Connection Manager to the AdventureWorks on your target SQL Server
(by specifying the server and database names as well as an appropriate
authentication method). Make sure that the "Generate new index"
option is chosen and pick the [HumanResources].[Employee] entry from the
"Reference table name" list box. Note that (as explained earlier) you
have an option of storing and maintaining the index (or use an existing one, if
you have generated it earlier). On the second tab of the Editor window labeled
Columns, designate mappings (represented by dashed lines) between the
corresponding input and lookup fields that need to be matched (they must have
equivalent data types in order to qualify), specify which input columns are
supposed to pass through and which lookup columns are to be added to the
output, along with their output alias (choose LoginID, Title, and Gender, with LoginIDOutput,
TitleOutput, and GenderOutput aliases). Each of mappings can be further
configured in the Create Relationships dialog box (displayed after right-clicking
on any of dashed lines and selecting the Edit Mappings option in their context
sensitive menu). From here, for each Input Column/Lookup Column pair, you can
select the Mapping Type (Exact or Fuzzy), Comparison Flag (Ignore case, Ignore
kana type, Ignore nonspacing character, Ignore character width, Ignore symbols,
or Sort punctuation as symbols), Minimum Similarity (described earlier), and
Similarity Output Alias (the name of the output column that contains the similarity
score for the selected column). The third tab, named Advanced, contains record
wide (rather than applicable to individual column) settings, such as
"Maximum number of matches to output per lookup" (change it to 5 from
the default of 1), Similarity threshold (keep it positioned at the default of 0
at the sliding scale), as well as Token delimiters (including by default space,
carriage return, tab, Line feed, as well as
In order to capture results, drag the Flat File Destination icon from the Toolbox to the Designer interface. Connect the output of the Fuzzy Lookup transformation to its input and bring up its Editor interface (using the Edit option in its context sensitive menu). Create a new Flat File Connection Manager with the target file in delimited format, assign to it a descriptive name, preview the output columns, confirm mappings, and click on OK to finalize the changes.
Initial execution might take considerably longer since it requires creation of the fuzzy match index. The output should contain LoginName, UserName, Gender, and Department columns (copied from the input), LoginIDOutput, GenderOutput, and TitleOutput (coming from the lookup table), as well as columns indicating similarity and confidence on the join and individual column levels (named _Similarity, _Confidence, _Similarity_Gender, _Similarity_LoginName, _Similarity_Department). This initial run should give you a general understanding of the outcome returned by fuzzy lookups configured with a fairly wide margin of error. In order to narrow down the result set, you can specify that certain input rows be required to have higher minimum similarity values (or even use exact match as mapping type). For example, in our case, we can specify that Gender value, in both input and lookup tables, is expected to be identical. To accomplish this, bring up the Fuzzy Lookup Transformation Editor window, switch to the Columns tab and display the Create Relationship dialog box. From here, you can adjust the minimum similarity value for individual pairs of columns (e.g. set the one comparing Gender from input and lookup tables to 1 and increase similarity for the remaining two pairs bringing it closer to 1). Before launching the package again, make sure you alter options on the Reference Table tab of the Editor window (assuming that you decided to turn on the "Store new index" and "Maintain stored index" options) by choosing the "Use existing index" option and pointing to the one created in the original run. Unfortunately, there are no hard rules determining how configuration parameters should be adjusted, since this depends largely on data characteristics, which potential variations need to be reviewed on case-by-case basis. Your final decision should take into consideration both similarity and confidence statistics (within each group of matches) as well as business rules dictating possible data values.