Full Text Index Queries using CONTAINS and CONTAINSTABLE

In the recent installments of our series dedicated to the most important
features of SQL Server 2005 Express Edition, we have been covering its
implementation of Full Text Indexing. After we described its basic
characteristics and presented examples illustrating the implementation of file
system-resident catalogs and indexes, which they provide storage for, we turned
our attention to queries that facilitate full-text searches. In the previous
article, we focused on the FREETEXT
predicate and its functional equivalent FREETEXTTABLE.
Now it is time for the remaining two constructs CONTAINS and CONTAINSTABLE.

As we pointed out earlier, FREETEXT
and FREETEXTTABLE are
relatively easy to use, but they also lack flexibility, which might be needed
when performing searches that are more granular. This need is addressed by the CONTAINS predicate and the corresponding
CONTAINSTABLE function.
Unlike their previously discussed counterparts, the CONTAINS predicate and CONTAINSTABLE function locate exact,
rather than approximate matches, by default. At the same rate, they accept a
wide range of parameters that can be used to alter and customize this behavior.
In particular, CONTAINS (in
addition to accepting the name of a column or columns on which the match will
be evaluated, as well as a language identifier that will provide context for
inflections and thesaurus rules) is capable of handing any of the following
search conditions:

  • simple terms – a sequence of characters (typically a word or
    group of words) enclosed in single quotes (and an additional set of double
    quotes when using multiple word terms) that is to be used to locate exact
    matches in target columns (punctuation marks are ignored). A target can be
    further refined by employing logical operators, such as AND, OR,
    and AND NOT. For example, to
    locate all rows that contain bicycle
    component
    or mountain,
    you would include CONTAINS (DocumentSummary,
    '"bicycle component" OR mountain')
    in the WHERE clause.
  • prefix terms – a sequence of characters (one or more words)
    followed by a wildcard (enclosed in a pair of single and double quotes) that
    are matched against target columns. (Note that if double quotes were missing,
    this would become a simple terms condition, resulting in a search for the exact
    match, which is bound to fail since asterisks are excluded from indexing). The
    search is conducted without taking into account inflections or thesaurus (which
    makes it equivalent to a typical LIKE
    query). For example, to identify all rows with words starting with appl, you would specify CONTAINS (DocumentSummary, '"appl*"')'
    in the WHERE clause.
  • generation terms – a sequence of characters forming the base for
    a search that implements inflectional rules or thesaurus entries (indicated by
    the presence of FORMSOF, INFLECTIONAL, and THESAURUS entries). This facilitates
    approximate – rather than exact – matches, equivalent to those performed when
    leveraging FREETEXT and FREETEXTTABLE constructs. For example,
    to locate all rows that contain any inflection of the word include, you would append CONTAINS (DocumentSummary, N'FORMSOF (INFLECTIONAL,
    include)')
    in the WHERE
    clause. Similarly, with our already customized tsENU.xml thesaurus file, you could use CONTAINS (DocumentSummary, N'FORMSOF (THESAURUS,
    bike)')
    in the WHERE
    clause to find every row containing either bike
    or bicycle.
  • proximity terms – two or more sequences of characters designating
    words or phrases (joined by the NEAR
    operator) that are supposed to appear close to each other within the same
    column entry. This approach allows you to evaluate degree of proximity, rather
    than mere presence (search will not return a match if targeted words are too
    far apart). For example, to find rows where repair
    and instructions are located
    near each other, you would include CONTAINS
    (DocumentSummary, 'repair NEAR instructions')
    in the WHERE clause.
  • weighed terms – assigns an arbitrary preference value (a weighed
    number between 0 and 1) that affects ranking calculated when executing CONTAINSTABLE function. While this
    search condition is allowed in CONTAINS
    predicate, it does not impact its outcome. Its syntax utilizes the ISABOUT operator along with WEIGHT keywords associating individual
    terms with their respective weighted values. For example, when searching for
    information on different bicycle parts, you would be able to indicate your
    preferences by specifying 'ISABOUT (reflectors
    WEIGHT (.6), pedals WEIGHT (.5), seats WEIGHT (.4))'
    as part of
    the CONTAINS predicate or CONTAINSTABLE function.

CONTAINSTABLE
is a function, which utilizes the same set of search conditions implemented by the
CONTAINS predicate. At the
same rate, it shares its basic syntax and format of its result set (i.e. a
table consisting of KEY and RANK columns for each row where a match
to a search string has been found) with FREETEXTTABLE
function. For example, a query equivalent to the one we posted in our previous
article that returns all rows in the DocumentSummary
column that contain the word bike
(including its inflections and synonyms defined in the thesaurus file), along
with their respective rank and key values would take the form below. (Note that
in this case, you also have the ability to limit the result set to a specific
number of top matches only, by specifying that number as the last parameter of CONTAINSTABLE function).


SELECT f.*, d.DocumentSummary
FROM CONTAINSTABLE(Production.Document, DocumentSummary, N’FORMSOF (THESAURUS, bike)’, LANGUAGE 1033) f
INNER JOIN Production.Document d
ON f.[KEY] = d.DocumentID

To further explore the capabilities of CONTAINSTABLE,
let’s take a look at an example demonstrating the use of weighed terms. Our
first query lists DocumentSummary
entries, which contain reflectors,
pedals, or seats words by leveraging the OR logical operator. The outcome
includes the content of the DocumentSummary
column, along with corresponding KEY
and RANK values.


SELECT f.*, d.DocumentSummary
FROM CONTAINSTABLE(Production.Document, DocumentSummary, N’reflectors OR pedals OR seats’) f
INNER JOIN Production.Document d
ON f.[KEY] = d.DocumentID
ORDER BY f.RANK

The ranking values reflect the fact that the first of them appears most
frequently and the last one is least common. (Note that even though the order
of ranking is the same as the one we obtained by running an equivalent query
using FREETEXTTABLE function
in the previous article of our series, their respective values are different,
since ranking algorithms in both cases are different). This time, we have the ability
to assign arbitrary weight values that reflect our preferences. This is
accomplished by employing the ISABOUT
and WEIGHT constructs we
described earlier:


SELECT f.*, d.DocumentSummary
FROM CONTAINSTABLE(Production.Document, DocumentSummary,
N’ISABOUT (seats WEIGHT(.8), pedals WEIGHT(.1), reflectors WEIGHT(.3))’) f
INNER JOIN Production.Document d
ON f.[KEY] = d.DocumentID
ORDER BY f.RANK

This concludes our discussion regarding Full Text indexing functionality
available in SQL Server 2005 Express Edition.

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Latest Articles