Full Text Index Management in SQL Server 2005 Express Edition

In the recent installment of our series dedicated to the most important
features of SQL Server 2005 Express Edition, we have been focusing on its Full
Text Indexing
capabilities. So far, we have covered its basic
characteristics as well as described the steps involved in implementing file-based
catalog and the indexes hosted by it (the latter was accomplished by leveraging
CREATE FULLTEXT INDEX and ALTER FULLTEXT INDEX statements). In
this article, we will cover other full text index management methods, leaving
discussion about a unique syntax of index-based searches until our next
encounter.

SQL Server 2005 Express Edition provides an alternative approach to creating
or modifying full text indexes from the one we have presented earlier, which
involves the use of sp_fulltext_table
stored procedure. (You should, however, keep in mind that this stored procedure
is being deprecated in favor of CREATE,
ALTER, and DROP FULLTEXT
INDEX
constructs), whose behavior is controlled by its @action parameter (the other three identify,
respectively, names of the indexed table, full-text catalog hosting it, and a
unique, non-null column on which the full-text index is based). Depending on
its value, you can use it to:

  • create metadata for a new full-text index (with @action='Create'). Note that in this
    case, all parameters are mandatory. It is important to note that no index-based
    searches can be conducted until you add a column or columns to be indexed and
    populate the index. The first of these tasks is typically accomplished by
    invoking sp_fulltext_column
    stored procedure, which takes as its arguments the names of a target table (@tabname), a column to be indexed (@colname), the desired action to be
    carried out ( @action set to
    'add'), language code (@language), and a column identifying the
    document type when referencing image
    or varbinary(max) data types.
    (For example, in our case the resulting syntax would take the form sp_fulltext_column @tabname='Production.Document', @colname='Document',
    @action='add', @language='1033', @type_colname='FileExtension'
    ).
    In order to determine the list of data formats (and their version) supported by
    your SQL Server installation, examine the output of SELECT * FROM sys.fulltext_document_types
    query. Carrying out the second task can be done by taking advantage of other
    action types of the sp_fulltext_table
    stored procedure.
  • drop an existing full-text index (with @action='Drop'). The only required
    parameter is the table name (since there is a limit of a single full-text index
    per table).
  • activate or deactivate an existing full text index (respectively,
    with @action='Activate' and @action='Deactivate'). The current state
    can be viewed by examining the is_enabled
    column in the sys.fulltext_indexes
    system view. Activation is typically not necessary, since it takes place
    automatically when the first column is added to an index, however, it might be
    needed if that index has been previously deactivated. Deactivation allows you
    to prevent an index from being used without the need for dropping it.
  • start or stop change tracking (respectively, with @action='start_change_tracking' and @action='stop_change_tracking'). The
    former sets the change tracking state to MANUAL
    and keeps tab of all future changes to both indexed and key columns. The
    latter, deletes the record of these changes and resets the change tracking
    state to OFF. Change
    tracking is beneficial when indexed data is moderately dynamic.
  • update an index (with @action='update_index').
    The update is triggered immediately following invocation of the stored
    procedure and takes advantage of the change tracking mechanism (which must be
    enabled to successfully execute this statement).
  • start or stop background update (respectively, with @action='start_background_updateindex'
    and @action='stop_background_updateindex').
    This initiates or terminates change tracking and background index population.
    The first of these options sets the change tracking to AUTO, while the latter, reverts it to MANUAL.
  • start or stop full or incremental population of an index
    (respectively with @action='start_full',
    @action='start_incremental',
    and @action='stopl'),
    without regard for change tracking mechanism. Note that the ability to initiate
    incremental population is dependent on the presence of a timestamp column in the indexed table.
    Full population (with change tracking disabled) is typically a preferred
    mechanism if changes to indexed columns are frequent and widespread. On the
    other hand, an incremental approach (without change tracking) is recommended
    when the volume of modifications is low.

As we have mentioned earlier, the algorithm that is used to populate
full-text indexes takes into consideration additional unique factors, not
included in the standard SQL Server indexing methodology. Among them, the one
that most profoundly impacts the efficiency of searches involves the use of
noise words. This term designates common, primarily supportive words (such as pronouns,
prepositions, conjunctions, or interjections), which tend to occur so
frequently and without particular distinguishing significance that their
indexing would be essentially meaningless, contributing only to bloated catalog
size and decreased performance. Noise words are language specific and are stored
in a file named noiseXXX.txt,
where XXX designates
a three letter language code (for example, noiseENU.txt
is intended for US English entries), residing in Program FilesMicrosoft SQL ServerMSSQL.xMSSQLFTData
folder, where x
designates an integer assigned to a local SQL Server 2005 Express Edition
instance. (This location can be identified by checking the value of the NoiseFile registry entry of the REG_SZ datatype under HKLMSOFTWAREMicrosoftMicrosoft SQL ServerMSSQL.xMSSearchLanguageeng
key).

Sometimes it might make sense to alter the default content of the noise
words file (for example, in order to avoid indexing terms customary to your
business, which tend to appear more frequently across your data). In order to
accomplish this, simply add new (or, if appropriate, remove any present) words
by editing the text file. Keep in mind, however, that such changes have to be
followed by full repopulation of existing indexes. This can be accomplished on a
per index-basis (by leveraging the methods described above) or en masse for all
indexes contained within a given catalog by invoking the ALTER FULLTEXT CATALOG statement with REBUILD parameter. In addition, note
that the same set of noise words is shared across all catalogs associated with
the same instance of SQL Server 2005 Express Edition (this restriction has been
eliminated in SQL Server 2008 through introduction of stop lists).

Another factor that affects full-text search behavior is the use of
thesaurus. This feature, relying on an XML formatted document named tsXXX.xml (where XXX designates a three letter
language code) along with its tsGLOBAL.xml
counterpart (both residing in the same Program
FilesMicrosoft SQL ServerMSSQL.xMSSQLFTData
folder as
the noise word files) gives you the ability to define synonyms, which in turn
facilitate more intelligent queries, giving you the ability to locate phrases
with an equivalent meaning but different wording. In order to customize its
content, you need to first familiarize yourself with its syntax. By default,
the majority of it is commented out. Before removing the comments, decide
whether thesaurus-based searches should be accent sensitive or not (this
involves setting the value of <diacritics_sensitive>
element to either 1 or 0). Next, consider that synonyms are
specified by employing either <expansion>
or <replacement> sets.
The first of them associates similar terms (enclosed within <sub> elements) with each other,
such that searches yield content that includes any of them. For example, by
implementing the sample tsENU.xml
file, which contains <sub>
elements within <expansion>
for Internet Explorer, IE, and IE5, a search for any one of them would return every row
that contains at least one of them (as long as the query syntax indicates that
thesaurus should be used). <replacement>
on the other hand, uses a value within the <sub>
element to substitute every pattern specified within the <pat> elements (effectively, only
rows containing the former are included in the result set). For example, if you
refer to the same sample file tsENU.xml,
you will find there a single <sub>
element with Windows 2000
assigned to it, along with two <pat>
elements, set to NT5 and W2K. This implies that any query for
either of these two values would actually return rows containing their
substitute. Note that, as with the noise word modifications, any changes to
thesaurus files do not take effect until you fully rebuild all existing
indexes. In addition, keep in mind that such changes should be saved in Unicode
when using a text editor.

In our next article of this series, we will conclude our discussion of
Full-Text Index technology implemented in SQL Server 2005 Express Edition by
presenting syntax of search queries and providing a few examples demonstrating
their power and flexibility.

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Latest Articles