As we have demonstrated in the recent installments of our series, in
addition to handling standard tasks associated with data processing (available
also in legacy Data Transformation Services), SQL Server 2005 Integration
Services offers some new, unconventional capabilities. Fuzzy
Lookup and Grouping,
covered in my two latest articles, help dealing with potentially erroneous data
(by identifying duplicates and performing approximate matches). Equally
innovative functionality is delivered by two other SSIS Data Flow
transformations known as Term Extraction and Term Lookup, which we will be the
topic of today’s discussion.
Development of computing technologies (accelerated by the advent of the Internet,
which ushered in an era of worldwide communication) provided access to a variety
of data sources on an unprecedented scale. This, however, also led to
information overload, making its management increasingly complex and time
consuming. One way to address this issue involves the use of a discipline of
computer science called data mining, whose primary goal is to formulate
methodologies for extracting relevant data from massive data sets based on
identifying its patterns (in essence, turning unorganized data into meaningful
information). While the majority of techniques necessary for implementing this
approach in SQL Server environment are available within SQL Server 2005
Analysis Services, SSIS offers text mining capabilities in the form of Term
Extraction and Term Lookup Data Flow transformations, which can be used to
handle such diverse data sources as RSS feeds or Web Services.
The purpose of Term Extraction is to generate a dictionary of terms (that
will be subsequently used to categorize multiple data sets), based on samples
that are considered to be representatives of each category. This is
accomplished by identifying repeating nouns or noun phrases within text read
via input column of the transformation. Entire input content is broken into
individual words (through the process called tokenization that takes into
consideration relevant punctuation marks and special characters), which are
then normalized (which results in articles and pronouns being discarded),
tagged, and stemmed (yielding individual, singular, non-capitalized generic
nouns, properly capitalized proper nouns, numbers, and adjectives that are part
of noun phrases, but ignoring other parts of speech). Note that the analyzed
text must be written in English and has to be delivered to transformation as
DT_WSTR or DT_NTEXT data type (depending on circumstances, this might require
introducing the Data Conversion component prior to passing text for term
extraction). The process of extraction can be customized (using entries on the Advanced
tab of the Term Extraction Transformation Editor) by specifying one of three types
of terms to search for (nouns, noun phrases – which includes in the scope of
analysis adjectives and proper nouns, as well as numbers – or both nouns and
noun phrases) and whether to consider or disregard their case (with the default
of context-insensitive term extraction). It is also possible to set the number
of times each term needs to be repeated in order to be added to the dictionary
(the default is 2) and the maximum length of each term (with the default of 12
tokens), which is applicable only when noun phrases are part of the search.
There are two rating systems that can be used within the transformation
(controlled by options in the Score type section on the Advanced tab of Term
Extraction Transformation Editor). The first one (Frequency) is simply the
count of occurrences of a term in input text. The second one is a value known
as TFIDF (an acronym for "term frequency–inverse document
frequency"), calculated according to the formula (Frequency of the term) * log((Number of input
, which helps
rows)/(Number of rows containing the term))
determine uniqueness of a specific term across all input rows. A higher TFIDF
value indicates that the term appears only in few of them and therefore, it is
more relevant for proper categorization (term that is part of every input row
does not help with selection process). Another customizable option is the
ability to specify a database table containing generic words that are supposed
to be ignored (effectively excluding them from the dictionary). Note that you
are responsible for populating this table based on expected content of your
data sources.
To better understand factors involved in implementing Term Extraction, let’s
consider a simple example that will use one of our earlier articles on SQL
Server 2005 Part 1 – Security (Authentication) as the basis for forming a
term dictionary. Start by copying the content of the article to a text file and
store it in an arbitrary location. When saving it in Notepad, set the Encoding
option to Unicode (this will eliminate the need for adding Data Conversion
transformation to alter its data type to DT_NTEXT, which, as we have mentioned
earlier, is required by the Term Extraction component). Next, launch the
Business Intelligence Development Studio and initiate a new Integration
Services project. Within its Designer interface, create a Data Flow task.
Switch to the Data Flow tab and drag Flat File Source from the Toolbox onto it.
Select Edit… from the context sensitive menu of Flat File Source and click on
the New… command button within its Editor window to create a new Connection
Manager. Assign a descriptive name to it in the Flat File Connection Manager
Editor dialog box, click on the Browse… command button, and point to the
newly created text file. Ensure that English is selected in the Locale entry
and mark on the Unicode checkbox next to it. Switch to the Advanced section and
ensure that the DataType property of our input column is set to "Unicode
text stream", with the default column delimiter {CR}{LF} (which will
generate a separate row for each paragraph from the source article). Confirm
all choices and return to the Data Flow area. Now it is time to add the Term
Extraction transformation by dragging its icon from the Toolbox. After you
connected its input with the output of Flat File Source, right-click on it and
select the Edit… option from the context sensitive menu. This will display the
Term Extraction Transformation Editor. On the first tab, labeled Term
Extraction, simply confirm that our input column is selected and take a note of
output columns names (Term and Score). For the time being, do not modify
entries on the Exclusion tab, since this requires having a table with terms
that are considered generic (which we will determine in the next step). Review
the Advanced tab and verify that it contains the customization options
presented in the previous paragraph (leave them unchanged with their defaults).
To capture the outcome, create a Terms table in an arbitrary database with
two columns – Term of nvarchar(128) type and Score of type float. Add an OLE DB
Destination and Flat File Destination (connected, respectively, to standard and
error outputs of our Term Extraction transformation) to your Data Flow, with
the first one based on OLE DB Connection Manager pointing to the database
hosting Terms table and the second based on Flat File Connection Manager (note
that since the error output includes a copy of an entry in Input column that
caused a failure, its Connection Manager needs to have the Unicode option
turned on). Once you execute your package, the Terms table will be populated
with rows containing each extracted term and frequency of its occurrence in our
source file. As you can imagine, some of the extracted words are fairly generic
and not related in any particular way to the subject of the article, such as,
for example, value, default, number, name, character, combination, need, part,
note, etc. (note that choice of these terms depends largely on your specific
business scenario). To eliminate them from the scope of analysis, copy them to
another database table and point to it on the Exclusion tab of the Term
Extraction Transformation Editor. Just as with Fuzzy Lookup and Fuzzy Grouping
transformations, you might want to consider testing Term Extraction with
different data sources that contain text relevant to your data mining criteria,
reviewing results after each run, in order to come up with an optimum
dictionary of terms and set of exclusions (those might need to be periodically
updated so emerging trends are taken into account).
Once you have established your final list of terms, you are ready to test
the Term Lookup process. As its input, you can use another
one of our articles, dealing with SQL Server 2005 Security authorization.
Start by creating a new Integration Services project with a Data Flow task (or
simply add to the existing project the second Data Flow task with a precedence
constraint enforcing dependency on successful completion of the first one). Add
a Flat File Source and its Connection Manager (with the Unicode option turned
on) pointing to a Unicode-encoded text file containing the second article. Next,
drag the Term Lookup transformation icon from the Toolbox onto the Data Flow
area, connect output of the Flat File Source to its input, and select the
Edit… option from its context sensitive menu. On the Reference Table tab,
point to the Terms table populated by the Term Extraction transformation.
Review the content of the Term Lookup tab and verify that the input column is
compared against the Term column of our Reference Table (Terms). Finally, note
that the Advanced tab gives you an option to perform case-sensitive term lookup
(leave it in its default, turned off state). If you decide to store the results
in text files, ensure that both standard and error Flat File Destination
Connection Managers are configured for Unicode and English locale. Execution of
this Data Flow task will generate output divided into three columns, with the
first one listing a dictionary term, the second its frequency for an individual
input row, and the third one the input row itself (note that each input row
might appear in the output multiple times, depending on how many dictionary
terms appear in it). At this point, any additional actions depend primarily on the
specific goals of your data mining procedure and might involve the use of other
SSIS transformations and components or applying procedures available in SQL
Server 2005 Analysis Services.