SHARE

Data searches that leverage existing indexes

Written By

Dec 28, 2009

5 minute read

In the recent installments of our series dedicated to the most prominent features
of SQL Server 2005 Express Edition, we have been discussing its implementation
of Full Text Indexing. In the course of our discussion, we have covered its
basic characteristics and stepped through its implementation, describing tasks
required to create a file-system based catalog, as well as to define and populate
full-text indexes. In this article, we will focus on data searches, which
leverage existing indexes, taking into account such features as noise words and
thesaurus files.

As a reminder, our sample index was created on Production.Document table in the AdventureWorks database. It covers DocumentSummary and Document columns (respectively of nvarchar(max) and varbinary(max) data types), with the FileExtension column defining the data
format of the latter and PK_Document_DocumentID
serving as the underlying unique, single column, non-null key. We also
explicitly set the LANGUAGE
parameter to 1033 (US
English) and enabled AUTO
updates with CHANGE_TRACKING.
As the result, the full index population should be automatically taking place
(which you can verify by examining a value of its crawl_end_date in sys.fulltext_indexes
system view). With these constructs in place, let’s examine the types of full
text index-based queries we can execute against both columns.

In order to facilitate functionality not available in standard
character-based queries, Full-Text Indexing technology introduces a set of new
syntax structures, which consist of FREETEXT,
CONTAINS, FREETEXTTABLE and CONTAINSTABLE. The first two are
predicates returning a single boolean value (TRUE
or FALSE), indicating
whether a row includes a term being searched for and are typically used as part
of a SELECT statement to
provide a listing of such rows. The second pair represents functions that rely
on the same search criteria, but yield an outcome consisting of an identifier
of each returned row along with its rating (which designates a degree to which
criteria have been met).To illustrate these concepts, let’s run a series of
equivalent searches against the same dataset using each of these constructs. To
make our results easier to analyze, we will target the DocumentSummary column, whose content
you can display by executing the SELECT DocumentSummary FROM Product.Document statement.

FREETEXT is intended for
queries of full-text indexed character-based, varbinary(max), image,
and xml based columns (which
makes it suitable for searches of both Document
and DocumentSummary). Its
syntax is relatively straightforward and consists of three parameters. The
first one designates a target (one or more columns, identified either via their
names or through a wildcard notation). The second specifies a search string (of
nvarchar data type), and the
third, optional one allows you to identify a specific language, whose rules
should be applied when carrying out the query (if this parameter is not
explicitly stated, the language assigned to the full-text indexed column is
used instead). This approach automatically takes advantage of logic
incorporated into full-text indexing, such as the use of inflections (yielding
approximate, rather than exact, matches – unless a search string is enclosed in
double quotes) as well as thesaurus-based substitutions and replacements. To
demonstrate this behavior, we could add an <expansion>
element to the tsENU.xml
thesaurus file that includes <sub>
elements for bike and bicycle, effectively making it possible
to carry out a query for each DocumentDescription
entry that contains either one of them by executing the following statement:

SELECT DocumentSummary
	FROM Production.Document
	WHERE FREETEXT (DocumentSummary, N’bike’, LANGUAGE 1033)

Another common query involves searching for entries, which contain any of
the words in the phrase you specify (for example, with 'repair components' as the search
string, you should be able to locate all rows with 'repair' or 'component' (including their
inflections).

The FREETEXTTABLE
function is syntactically and functionally similar to FREETEXT (it also is subject to the same
search criteria), however instead of evaluating to a boolean TRUE or FALSE, it returns a table consisting of two columns. The
first one identifies the KEY
(a unique, non-null, single column that a full-text index references) for each
row where an approximate match to a search string has been found, and the other
stores an integer RANK
indicating degree of similarity. The function expects several parameters, in
particular, the name of a table where indexed columns reside, column names you
want to query, and a search string. It also gives you an option to designate a
desired language (affecting inflections and thesaurus-based expansions and
replacements) and set the maximum number of returned rows. For example, the
following query will yield all rows in the DocumentSummary
column, which contain the word bike
(including its inflections and synonyms defined in the thesaurus file), along
with their respective rank and key values. (If you want to limit the result set
to an arbitrary number of matches only and automatically order them according
to their ranking, specify that number as the last parameter of the FREETEXTTABLE function):

SELECT f.*, d.DocumentSummary
	FROM FREETEXTTABLE(Production.Document, DocumentSummary, N’bike’, LANGUAGE 1033) f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID

The process used to calculate ranking for a query result set is relatively
involved. This results, at least to some extent, from a hierarchical index
structure, where statistics are evaluated on each level when forming higher-level
indexes though merging of its subordinates. Such process is not deterministic,
which implies that search results might vary as indexes change, but in general,
rank calculations take into consideration (besides the hit count) such criteria
as language used for tokenization, total number of rows indexed, column sizes,
or the largest offset of the indexed term in a row. (For more details regarding
this subject, refer to Understanding
Ranking article in the SQL Server 2005 Books Online). In its simplistic
form, ranking can be demonstrated by invoking the FREETEXTTABLE function targeting words that appear in
indexed rows with different frequency. For example, we will get distinctly
different results when querying the DocumentSummary
column for the presence of seats,
reflectors, and pedals terms (since their respective hit
counts are 1, 2, and 3) by running the following:

SELECT f.*, d.DocumentSummary
	FROM FREETEXTTABLE(Production.Document, DocumentSummary,
		N’seats pedals reflectors’, LANGUAGE 1033) f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID

In a similar manner, we could target the Document
column of varbinary(max)
data type, since it has been included in our full-text index. This would
involve simply providing its name along with desired search criteria as part of
the FREETEXTTABLE function.
For example, we could apply the equivalent query by specifying FREETEXTTABLE(Production.Document, Document, N'seats pedals reflectors', LANGUAGE 1033) f as part of the FROM clause (of course, the resulting
ranking would be different, reflecting relevance-based ordering of seats, pedals, and reflectors
terms in Word documents stored in the Document
column).

FREETEXT and FREETEXTTABLE are relatively easy to
use, but they also lack flexibility, which might be needed in more demanding
scenarios. This gap is filled by the CONTAINS
predicate and the corresponding CONTAINSTABLE
function. In the next article of this series, we will examine their
characteristics and provide examples demonstrating their use.

»

See All Articles by Columnist Marcin Policht

Marcin Policht

Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Data searches that leverage existing indexes

Marcin Policht

Company

Categories