Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MS SQL

Posted Jan 11, 2010

Full Text Index Queries using CONTAINS and CONTAINSTABLE

By Marcin Policht

In the recent installments of our series dedicated to the most important features of SQL Server 2005 Express Edition, we have been covering its implementation of Full Text Indexing. After we described its basic characteristics and presented examples illustrating the implementation of file system-resident catalogs and indexes, which they provide storage for, we turned our attention to queries that facilitate full-text searches. In the previous article, we focused on the FREETEXT predicate and its functional equivalent FREETEXTTABLE. Now it is time for the remaining two constructs CONTAINS and CONTAINSTABLE.

As we pointed out earlier, FREETEXT and FREETEXTTABLE are relatively easy to use, but they also lack flexibility, which might be needed when performing searches that are more granular. This need is addressed by the CONTAINS predicate and the corresponding CONTAINSTABLE function. Unlike their previously discussed counterparts, the CONTAINS predicate and CONTAINSTABLE function locate exact, rather than approximate matches, by default. At the same rate, they accept a wide range of parameters that can be used to alter and customize this behavior. In particular, CONTAINS (in addition to accepting the name of a column or columns on which the match will be evaluated, as well as a language identifier that will provide context for inflections and thesaurus rules) is capable of handing any of the following search conditions:

  • simple terms - a sequence of characters (typically a word or group of words) enclosed in single quotes (and an additional set of double quotes when using multiple word terms) that is to be used to locate exact matches in target columns (punctuation marks are ignored). A target can be further refined by employing logical operators, such as AND, OR, and AND NOT. For example, to locate all rows that contain bicycle component or mountain, you would include CONTAINS (DocumentSummary, '"bicycle component" OR mountain') in the WHERE clause.
  • prefix terms - a sequence of characters (one or more words) followed by a wildcard (enclosed in a pair of single and double quotes) that are matched against target columns. (Note that if double quotes were missing, this would become a simple terms condition, resulting in a search for the exact match, which is bound to fail since asterisks are excluded from indexing). The search is conducted without taking into account inflections or thesaurus (which makes it equivalent to a typical LIKE query). For example, to identify all rows with words starting with appl, you would specify CONTAINS (DocumentSummary, '"appl*"')' in the WHERE clause.
  • generation terms - a sequence of characters forming the base for a search that implements inflectional rules or thesaurus entries (indicated by the presence of FORMSOF, INFLECTIONAL, and THESAURUS entries). This facilitates approximate - rather than exact - matches, equivalent to those performed when leveraging FREETEXT and FREETEXTTABLE constructs. For example, to locate all rows that contain any inflection of the word include, you would append CONTAINS (DocumentSummary, N'FORMSOF (INFLECTIONAL, include)') in the WHERE clause. Similarly, with our already customized tsENU.xml thesaurus file, you could use CONTAINS (DocumentSummary, N'FORMSOF (THESAURUS, bike)') in the WHERE clause to find every row containing either bike or bicycle.
  • proximity terms - two or more sequences of characters designating words or phrases (joined by the NEAR operator) that are supposed to appear close to each other within the same column entry. This approach allows you to evaluate degree of proximity, rather than mere presence (search will not return a match if targeted words are too far apart). For example, to find rows where repair and instructions are located near each other, you would include CONTAINS (DocumentSummary, 'repair NEAR instructions') in the WHERE clause.
  • weighed terms - assigns an arbitrary preference value (a weighed number between 0 and 1) that affects ranking calculated when executing CONTAINSTABLE function. While this search condition is allowed in CONTAINS predicate, it does not impact its outcome. Its syntax utilizes the ISABOUT operator along with WEIGHT keywords associating individual terms with their respective weighted values. For example, when searching for information on different bicycle parts, you would be able to indicate your preferences by specifying 'ISABOUT (reflectors WEIGHT (.6), pedals WEIGHT (.5), seats WEIGHT (.4))' as part of the CONTAINS predicate or CONTAINSTABLE function.

CONTAINSTABLE is a function, which utilizes the same set of search conditions implemented by the CONTAINS predicate. At the same rate, it shares its basic syntax and format of its result set (i.e. a table consisting of KEY and RANK columns for each row where a match to a search string has been found) with FREETEXTTABLE function. For example, a query equivalent to the one we posted in our previous article that returns all rows in the DocumentSummary column that contain the word bike (including its inflections and synonyms defined in the thesaurus file), along with their respective rank and key values would take the form below. (Note that in this case, you also have the ability to limit the result set to a specific number of top matches only, by specifying that number as the last parameter of CONTAINSTABLE function).

SELECT f.*, d.DocumentSummary
	FROM CONTAINSTABLE(Production.Document, DocumentSummary, N'FORMSOF (THESAURUS, bike)', LANGUAGE 1033) f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID

To further explore the capabilities of CONTAINSTABLE, let's take a look at an example demonstrating the use of weighed terms. Our first query lists DocumentSummary entries, which contain reflectors, pedals, or seats words by leveraging the OR logical operator. The outcome includes the content of the DocumentSummary column, along with corresponding KEY and RANK values.

SELECT f.*, d.DocumentSummary
	FROM CONTAINSTABLE(Production.Document, DocumentSummary, N'reflectors OR pedals OR seats') f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID
	ORDER BY f.RANK

The ranking values reflect the fact that the first of them appears most frequently and the last one is least common. (Note that even though the order of ranking is the same as the one we obtained by running an equivalent query using FREETEXTTABLE function in the previous article of our series, their respective values are different, since ranking algorithms in both cases are different). This time, we have the ability to assign arbitrary weight values that reflect our preferences. This is accomplished by employing the ISABOUT and WEIGHT constructs we described earlier:

SELECT f.*, d.DocumentSummary
	FROM CONTAINSTABLE(Production.Document, DocumentSummary, 
		N'ISABOUT (seats WEIGHT(.8), pedals WEIGHT(.1), reflectors WEIGHT(.3))') f
	INNER JOIN Production.Document d	
		ON f.[KEY] = d.DocumentID
	ORDER BY f.RANK

This concludes our discussion regarding Full Text indexing functionality available in SQL Server 2005 Express Edition.

» See All Articles by Columnist Marcin Policht



MS SQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM