In the recent articles of our series dedicated to SQL Server 2005
Integration Services, we have been covering the most popular Data Flow
components. The
latest article focused on Lookup Transformation, which combines records
from two sources based on exact match between their corresponding fields. This
functionality, while very useful, depends to large extent on accuracy of data
and does not take into consideration any of its potential inconsistencies
(resulting from the use of abbreviations and diminutives, or from accidental
misspellings). Fortunately, it is fairly straightforward to remediate this
shortcoming by employing Fuzzy Lookup Data Flow transformation, which provides
the ability to perform general data cleanup.
Earlier versions of SQL Server included methods helpful in identifying
approximate matches, such as LIKE and CONTAINS T-SQL constructs or more
elaborate SOUNDEX and DIFFERENCE functions. Even though they were applicable in
a variety of scenarios, they suffered from several limitations that impacted
their usefulness (for example, SOUNDEX and DIFFERENCE could not properly handle
delimiters, including spaces). SQL Server 2005 still supports all of them, but
delivers much more flexibility and power by introducing fuzzy lookups. Instead
of searching for exact matches between columns in input data rows and reference
dataset (as in standard lookup), this transformation employs an algorithm based
on approximation. Without getting deeper into its ugly details, the basic idea
involves breaking down compared content (which must be expressed in compatible
string data types) into smaller pieces, known as tokens (based on a set of
default, but modifiable delimiters), which subsequently are used to generate a
match index, permitting more granular and versatile analysis.
In addition to editing the list of delimiters (represented by the Delimiters
property of the Fuzzy Lookup Transformation), output of the matching process
can be altered by changing the value of MaxOutputMatchesPerInput property
(which designates the upper limit on the number of results returned from a comparison
for each input row) from its default (1). Increasing this number will likely
result in more hits, which you can review afterwards to verify their accuracy.
Each of them is rated according to similarity and confidence grading (with
values ranging between 0 and 1), which are included in the component’s output
columns (labeled _Similarity and _Confidence, respectively).
Similarity indicates how close a particular match between respective strings
in input and lookup rows is, according to the token-based comparison (with 0
representing no similarities and 1 designating identical strings). The MinSimilarity
property determines the lower limit of the similarity values for the result
set. This property exists on both the component and individual join (between
the respective input and lookup table columns) levels. Information conveyed by
the confidence parameter comes in handy when you have several results with
comparable similarity values, since its role is to distinguish the ones that
are more likely candidates for a match (this might, for example, be determined
based on the degree of their uniqueness among other results). In cases where matches
cannot be found, values of 0’s are returned for both similarity and confidence
(and null values appear in place of lookup entries otherwise copied to the
output).
Keep in mind that the complexities of all calculations associated with fuzzy
lookups have significant impact on the utilization of memory, disk space and
processing resources. There are several ways of controlling the extent of this
utilization. By manipulating the MaxMemoryUsage property of the Fuzzy Lookup
component, you can adjust the upper limit of memory (in MB) dedicated to its
activities (with the default of 0, the allocation is dynamic, operating on an as
needed basis). For larger data sets, consider assigning a False value to
Exhaustive property, which, while making lookup algorithm less thorough, enhances
its efficiency (with its value set to True, input rows are compared against
every row in the reference table, regardless of index-based statistics). Another
improvement can be achieved by saving the match index after the initial
execution (in the same database that hosts the reference table) and maintaining
it (by placing triggers on the reference table to detect and automatically
replicate changes to it), thus reducing the initial delay when re-launching the
same package (these options are configurable from the General tab of the Fuzzy
Lookup Transformation Editor window). Note, however, that you might encounter
contention issues in cases where the reference table is frequently modified by
multiple applications. Also be aware of significant space usage, due to the
high number of temporary objects created when lookups are performed (both of
these issues can be corrected by launching your package from a server with sufficient
amount of disk space and a local, non-production copy of the reference table).
We will explore characteristics of the Fuzzy Lookup, by attempting to match
the following records against the HumanResources.Employee table (of the
AdventureWorks database), based on its LoginID, Title and Gender columns:
adventure-worksalex0;ANayberg;production;M adventure-worksalex0;ANayberg;production;F adventure-workskevin6;KBrown;marketing;M adventure-dorksroberto0;RTamburello;production;M adventure-worksbrownie0;MBrown;emergency management;M adventure-worksruth0;REllerbrock;production;F
Start by storing our sample data in an arbitrarily named
text file. Once this is completed, launch Business Intelligence Development
Studio and initiate a new Integration Services project. Switch to the second
tab of the Designer interface and create a Data Flow task. Drag the Flat File
Source icon from the Toolbox onto the task area. Bring up its Editor window by
selecting the appropriate option from its context-sensitive menu. In the
Connection Manager section, click on the New… command button. This will
display the Flat File Connection Manager Editor window. Assign a descriptive
name to it, point to the text file that you saved earlier, and switch to the
Advanced section. Label the columns LoginName, UserName, Department and Gender,
respectively (you also need to change the OutputWidthColumn of the first to
256, the third to 50, and the fourth to 1, in order to match the length of LoginID,
Title and Gender fields of the HumanResources.Employee table). Verify the
output in the Preview section and click on OK twice to close both Editor
windows.
Drag the Fuzzy Lookup transformation icon from the Toolbox, drop it onto the
Data Flow Task surface and connect the output of the Flat File Source with its
input. Invoke the Editor interface from its context sensitive menu. Create a
new OLE DB Connection Manager to the AdventureWorks on your target SQL Server
(by specifying the server and database names as well as an appropriate
authentication method). Make sure that the "Generate new index"
option is chosen and pick the [HumanResources].[Employee] entry from the
"Reference table name" list box. Note that (as explained earlier) you
have an option of storing and maintaining the index (or use an existing one, if
you have generated it earlier). On the second tab of the Editor window labeled
Columns, designate mappings (represented by dashed lines) between the
corresponding input and lookup fields that need to be matched (they must have
equivalent data types in order to qualify), specify which input columns are
supposed to pass through and which lookup columns are to be added to the
output, along with their output alias (choose LoginID, Title, and Gender, with LoginIDOutput,
TitleOutput, and GenderOutput aliases). Each of mappings can be further
configured in the Create Relationships dialog box (displayed after right-clicking
on any of dashed lines and selecting the Edit Mappings option in their context
sensitive menu). From here, for each Input Column/Lookup Column pair, you can
select the Mapping Type (Exact or Fuzzy), Comparison Flag (Ignore case, Ignore
kana type, Ignore nonspacing character, Ignore character width, Ignore symbols,
or Sort punctuation as symbols), Minimum Similarity (described earlier), and
Similarity Output Alias (the name of the output column that contains the similarity
score for the selected column). The third tab, named Advanced, contains record
wide (rather than applicable to individual column) settings, such as
"Maximum number of matches to output per lookup" (change it to 5 from
the default of 1), Similarity threshold (keep it positioned at the default of 0
at the sliding scale), as well as Token delimiters (including by default space,
carriage return, tab, Line feed, as well as ,.;:-"'&/@!?()<>[]{}|#*^%
).
Once completed, click on OK to close the Fuzzy Lookup Transformation Editor
window.
In order to capture results, drag the Flat File Destination icon from the
Toolbox to the Designer interface. Connect the output of the Fuzzy Lookup
transformation to its input and bring up its Editor interface (using the Edit
option in its context sensitive menu). Create a new Flat File Connection
Manager with the target file in delimited format, assign to it a descriptive
name, preview the output columns, confirm mappings, and click on OK to finalize
the changes.
Initial execution might take considerably longer since it requires creation
of the fuzzy match index. The output should contain LoginName, UserName,
Gender, and Department columns (copied from the input), LoginIDOutput, GenderOutput,
and TitleOutput (coming from the lookup table), as well as columns indicating
similarity and confidence on the join and individual column levels (named
_Similarity, _Confidence, _Similarity_Gender, _Similarity_LoginName, _Similarity_Department).
This initial run should give you a general understanding of the outcome
returned by fuzzy lookups configured with a fairly wide margin of error. In
order to narrow down the result set, you can specify that certain input rows be
required to have higher minimum similarity values (or even use exact match as
mapping type). For example, in our case, we can specify that Gender value, in
both input and lookup tables, is expected to be identical. To accomplish this,
bring up the Fuzzy Lookup Transformation Editor window, switch to the Columns
tab and display the Create Relationship dialog box. From here, you can adjust the
minimum similarity value for individual pairs of columns (e.g. set the one
comparing Gender from input and lookup tables to 1 and increase similarity for
the remaining two pairs bringing it closer to 1). Before launching the package
again, make sure you alter options on the Reference Table tab of the Editor
window (assuming that you decided to turn on the "Store new index"
and "Maintain stored index" options) by choosing the "Use
existing index" option and pointing to the one created in the original
run. Unfortunately, there are no hard rules determining how configuration
parameters should be adjusted, since this depends largely on data
characteristics, which potential variations need to be reviewed on case-by-case
basis. Your final decision should take into consideration both similarity and
confidence statistics (within each group of matches) as well as business rules
dictating possible data values.