Continuing the topic of data cleansing through applying approximate match
algorithms, which we started in the most recent
article of our series dedicated to SQL Server 2005 Integration Services, we
will now present another Data Flow task in this category called Fuzzy Grouping.
Unlike previously discussed fuzzy lookups, grouping does not rely on a reference
table for comparison, but instead matches input rows against each other.
However, before we focus on the new transformation, we need first to clean
up after our latest exercise. As you might recall, in order to expedite
subsequent executions of a sample package we created, we suggested storing and
maintaining a match index that serves as a basis for fuzzy lookups. This is
accomplished by setting up triggers on a reference table (which keeps the table
and the index based on that table consistent, replicating every change from the
first one to the second). While getting rid of the match index is
straightforward and involves simply deleting the appropriate table (hosting the
index) in the same database where the reference table resides, it should be
preceded by the removal of match-specific triggers. The recommended way of
handling this is by running the sp_FuzzyLookupTableMaintenanceUnInstall stored
procedure, with the match index name as its sole parameter.
With the HumanResources.Employee table back to its original state, let’s
turn our attention to fuzzy grouping. As the name indicates, its primary
purpose is grouping of records according to their common characteristics,
which, in turn, can be used in order to eliminate duplicates. This process
starts by breaking down string values of arbitrarily chosen columns for each
input row into individual substrings (by applying the same, delimiter-based
parsing that is used for fuzzy lookups). These substrings are then indexed and
used for comparisons across entire data sets, allowing for identification of
rows that likely (according to match types and similarity thresholds) represent
the same data (and therefore are prime candidates for de-duplication). Such
rows are grouped (by having assigned a common numeric identifier) and, for each
group, the algorithm designates one of its rows (known as canonical row) as the
representative, containing data that is considered as 100% accurate (selection
and grouping leverage fuzzy lookup methodology using temporary tables to
generate a match index and perform necessary comparisons). To analyze this
process in more detail and become familiar with its implementation, we will
review its impact on the following sample data set:
adventure-worksalex0;ANayberg;production;M adventure-dorksalex0;ANayberg;production;M adventure-worksflex0;ANayberg;production;M adventure-workskevin6;KBrown;marketing;M adventure-corkskevin6;KBrown;marketing;M adventuwe-wowkswobewto0;RTamburello;production;M adventure-worksroberto0;RPamburello;production;M adventure-worksroberto0;RTamburello;production;M adventure-worksruth0;REllerbrock;production;F adventure-workstooth0;REllerbrock;production;F adventure-worksrush0;REllerbrock;production;F
Save this listing into a text file, which we will use as the data source for
our package. Launch Business Intelligence Development Studio and initiate a new
Integration Services project. From its designer interface, create a Data Flow
task. Switch to the Data Flow tab and drag onto it a Flat File Source Toolbox
icon. Using its context sensitive menu, launch Flat File Source Editor. In the
Connection Manager section, click on the New… command button to display the
Flat File Connection Manager window. Type in a descriptive name in the top text
box and point to the file you saved earlier after clicking on the Browse…
command button. Switch to the Columns section to verify that the Preview window
accurately reflects the file content. Using entries in the Advanced section,
assign descriptive names to each column and narrow down their width to reflect
the size of data they will contain (set OutputColumnWidth of Gender to 1). Once
you have completed the configuration, click on OK to close the Editor window
and return to the Designer interface.
Next, locate the Fuzzy Grouping icon in the toolbox, drop it
into Data Flow task area, and connect the output of the Flat File Source with
its input. Select Edit… option from its context sensitive menu to display the
Editor window. Using New… command button on the first tab, create OLE DB
Connection Manager pointing to a SQL Server 2005 database where the temporary
object created by the fuzzy grouping procedure will be located (you will need
to provide server and database names as well as appropriate authentication
method in the Configure OLE DB Connection Manager dialog box). Note that with
larger input data sets, size occupied by these objects will be significant, so
it is recommended that you use a non-production system for this purpose. Once
this is completed, switch to the Columns tab. From here, designate which input
columns will be taken into account when performing comparisons (determined by
the state of the first column of checkboxes in the Available Input Columns
listing) and generating output of the transformation (based on the state of
checkboxes in the column labeled Pass Through of the same listing). In our
case, we will exclude Description from the fuzzy grouping algorithm (by
clearing the checkbox in the first column) but will keep it in our result set
(by keeping the Pass Through column checkbox turned on). In the lower portion
of the Columns tab (underneath the Available Input Columns listing), you can
specify settings for each input column participating in the fuzzy grouping
operation that are more granular, including the following:
Output Alias – name assigned to an output column corresponding to
the given input column,
Group Output Alias – name of the output column containing the
canonical value for each row (identified by fuzzy grouping as the correct value
for a group that this row is part of),
Match Type – method used to compare values in the current column
across all rows. This can be either Exact (which ensures that only identical
values are viewed as duplicates and which you should apply if you are certain
about their accuracy) or Fuzzy (which uses described earlier approximation
algorithm to deliver grouping results).
Minimum Similarity – number between 0 and 1 (applicable when
Fuzzy Match Type is used) that determines how closely (in terms of fuzzy
matching calculations) values within the given column for all rows need to be
to each other in order to be considered part of the same group (typically, the
higher this value is, the larger number of small groups you will end up with).
Similarity Output Alias – name of the output column containing the
similarity rating calculated by comparing value in the given column for each
row against value of the same column in canonical row (within the same group).
There is one such Alias for each column that is used for fuzzy grouping.
Numerals – setting that designates relevance of leading or
trailing digits in the string of characters contained within the given column
(it can be assigned to Neither, Leading, Trailing, or LeadingAndTrailing,
depending on which ones are regarded as significant).
Comparison Flags – collection of options that affect details of
the string matching process for a given column, including such criteria as case
sensitivity (Ignore case), differences between Japanese character sets (Ignore
kana type), use of diacritics, such as tilde (Ignore nonspacing characters) or
symbols, such as punctuation (Ignore symbols), presence of single- and
double-byte characters (Ignore character width), and sorting order for
punctuation characters (Sort punctuation as symbols).
In our case, we will leave the defaults for all of the existing entries (UserName,
LoginName, and Gender input columns) including their preassigned labels, Fuzzy
match type, as well as Numerals and Similarity settings. The initial run will
give you an indication how well the fuzzy grouping process performs with
existing data. Depending on the results (and your ability to identify incorrect
data values), you can adjust relevant parameters affecting number and size of
groups as well as canonical value for each.
Switch to the Advanced tab of the Editor window, where you can set the component
level similarity threshold (with a slider), designate token delimiters
(including space, carriage return, tab, line feed, and
as well as customize names of three additional output columns:
_key_in – a consecutive integer assigned to each row processed by
the Fuzzy Grouping transformation,
_key_out – an integer identifying _key_in value of canonical row
for the group to which the current row has been assigned. For example, in our
sample data, the first three rows, once processed and directed to the
transformation output, will have _key_in values of 1, 2, and 3, respectively.
As expected, the first row would be designated as canonical; hence, its _key_in
value (i.e. 1) would appear in _key_out column for the second and third (as
well as the first) output rows.
_score – a number between 0 and 1 representing similarity between
the current and canonical rows (with 1 designating exact match).
As in our previous example, we will capture results into a text file. To accomplish
this, drag the Flat File Destination from the Toolbox onto Data Flow task area
and connect output of the Fuzzy Grouping transformation to its input. Next,
select Edit from its context sensitive menu to display the Editor window.
Create a new Flat File connection manager pointing to a delimited text file and
accept its settings preassigned based on the format of Fuzzy Grouping output
columns (including _key_in, _key_out, and _score).
During package execution, you should expect an initial delay introduced when
input data is buffered and temporary objects necessary for fuzzy matching are
created. For larger data sets this will likely have significant implications on
memory and space utilization so, as mentioned before, you should consider
offloading processing (along with the input data) to a non-production SQL
Server or at least setting upper limit on the memory used by Fuzzy Grouping
with its MaxMemoryUsage custom property (on Component Properties tab of its
Advanced Editor window).