SQL Server 2005 Integration Services - Fuzzy Grouping - Part 35
October 23, 2006
Continuing the topic of data cleansing through applying approximate match algorithms, which we started in the most recent article of our series dedicated to SQL Server 2005 Integration Services, we will now present another Data Flow task in this category called Fuzzy Grouping. Unlike previously discussed fuzzy lookups, grouping does not rely on a reference table for comparison, but instead matches input rows against each other.
However, before we focus on the new transformation, we need first to clean up after our latest exercise. As you might recall, in order to expedite subsequent executions of a sample package we created, we suggested storing and maintaining a match index that serves as a basis for fuzzy lookups. This is accomplished by setting up triggers on a reference table (which keeps the table and the index based on that table consistent, replicating every change from the first one to the second). While getting rid of the match index is straightforward and involves simply deleting the appropriate table (hosting the index) in the same database where the reference table resides, it should be preceded by the removal of match-specific triggers. The recommended way of handling this is by running the sp_FuzzyLookupTableMaintenanceUnInstall stored procedure, with the match index name as its sole parameter.
With the HumanResources.Employee table back to its original state, let's turn our attention to fuzzy grouping. As the name indicates, its primary purpose is grouping of records according to their common characteristics, which, in turn, can be used in order to eliminate duplicates. This process starts by breaking down string values of arbitrarily chosen columns for each input row into individual substrings (by applying the same, delimiter-based parsing that is used for fuzzy lookups). These substrings are then indexed and used for comparisons across entire data sets, allowing for identification of rows that likely (according to match types and similarity thresholds) represent the same data (and therefore are prime candidates for de-duplication). Such rows are grouped (by having assigned a common numeric identifier) and, for each group, the algorithm designates one of its rows (known as canonical row) as the representative, containing data that is considered as 100% accurate (selection and grouping leverage fuzzy lookup methodology using temporary tables to generate a match index and perform necessary comparisons). To analyze this process in more detail and become familiar with its implementation, we will review its impact on the following sample data set:
adventure-works\alex0;ANayberg;production;M adventure-dorks\alex0;ANayberg;production;M adventure-works\flex0;ANayberg;production;M adventure-works\kevin6;KBrown;marketing;M adventure-corks\kevin6;KBrown;marketing;M adventuwe-wowks\wobewto0;RTamburello;production;M adventure-works\roberto0;RPamburello;production;M adventure-works\roberto0;RTamburello;production;M adventure-works\ruth0;REllerbrock;production;F adventure-works\tooth0;REllerbrock;production;F adventure-works\rush0;REllerbrock;production;F
Save this listing into a text file, which we will use as the data source for our package. Launch Business Intelligence Development Studio and initiate a new Integration Services project. From its designer interface, create a Data Flow task. Switch to the Data Flow tab and drag onto it a Flat File Source Toolbox icon. Using its context sensitive menu, launch Flat File Source Editor. In the Connection Manager section, click on the New... command button to display the Flat File Connection Manager window. Type in a descriptive name in the top text box and point to the file you saved earlier after clicking on the Browse... command button. Switch to the Columns section to verify that the Preview window accurately reflects the file content. Using entries in the Advanced section, assign descriptive names to each column and narrow down their width to reflect the size of data they will contain (set OutputColumnWidth of Gender to 1). Once you have completed the configuration, click on OK to close the Editor window and return to the Designer interface.
Next, locate the Fuzzy Grouping icon in the toolbox, drop it into Data Flow task area, and connect the output of the Flat File Source with its input. Select Edit... option from its context sensitive menu to display the Editor window. Using New... command button on the first tab, create OLE DB Connection Manager pointing to a SQL Server 2005 database where the temporary object created by the fuzzy grouping procedure will be located (you will need to provide server and database names as well as appropriate authentication method in the Configure OLE DB Connection Manager dialog box). Note that with larger input data sets, size occupied by these objects will be significant, so it is recommended that you use a non-production system for this purpose. Once this is completed, switch to the Columns tab. From here, designate which input columns will be taken into account when performing comparisons (determined by the state of the first column of checkboxes in the Available Input Columns listing) and generating output of the transformation (based on the state of checkboxes in the column labeled Pass Through of the same listing). In our case, we will exclude Description from the fuzzy grouping algorithm (by clearing the checkbox in the first column) but will keep it in our result set (by keeping the Pass Through column checkbox turned on). In the lower portion of the Columns tab (underneath the Available Input Columns listing), you can specify settings for each input column participating in the fuzzy grouping operation that are more granular, including the following:
In our case, we will leave the defaults for all of the existing entries (UserName, LoginName, and Gender input columns) including their preassigned labels, Fuzzy match type, as well as Numerals and Similarity settings. The initial run will give you an indication how well the fuzzy grouping process performs with existing data. Depending on the results (and your ability to identify incorrect data values), you can adjust relevant parameters affecting number and size of groups as well as canonical value for each.
Switch to the Advanced tab of the Editor window, where you can set the component
level similarity threshold (with a slider), designate token delimiters
(including space, carriage return, tab, line feed, and
As in our previous example, we will capture results into a text file. To accomplish this, drag the Flat File Destination from the Toolbox onto Data Flow task area and connect output of the Fuzzy Grouping transformation to its input. Next, select Edit from its context sensitive menu to display the Editor window. Create a new Flat File connection manager pointing to a delimited text file and accept its settings preassigned based on the format of Fuzzy Grouping output columns (including _key_in, _key_out, and _score).
During package execution, you should expect an initial delay introduced when input data is buffered and temporary objects necessary for fuzzy matching are created. For larger data sets this will likely have significant implications on memory and space utilization so, as mentioned before, you should consider offloading processing (along with the input data) to a non-production SQL Server or at least setting upper limit on the memory used by Fuzzy Grouping with its MaxMemoryUsage custom property (on Component Properties tab of its Advanced Editor window).