Full Text Search on SQL 2000 Part 2
January 7, 2005
In last month's article, Full Text Searching was introduced as a way to query strings with more refinement than the usual TSQL "like" or equal operator statements. With Full Text Searching, a new file system structure is created, storing key words from selected fields into Catalogs. In addition to storing typical character fields from databases, Microsoft Office documents that have been saved as binaries can also be entered into the Catalogs. Once created, these catalogs can be interrogated for key words that are near each other's proximity, the singular and plural versions of a word, or the noun and verb variations of a word can all be searched for. In addition, search result rankings, or weights, that usually accompany internet search engine returns can also be requested.
In Part 1, a simple Catalog of the pubs database was created. The Catalog, called demo, contained the title and note fields from the titles table. The Catalog was populated, and a query returned successfully. This article continues with the sample data from Part 1.
At its simplest level, the key word CONTAINS behaves similar to LIKE. To begin, we will use CONTAINS for precise string matching. The differences between CONTAINS and LIKE will not be very evident here. Using the pubs Catalog created in Part 1, issue the query:
SELECT title_id, title, notes FROM titles WHERE CONTAINS (notes, ' "computer" ')
The statement returns three rows, each containing the word "computer" in the notes field.
The Catalog created on the pubs database contained two fields: notes, and title. These are the only two fields allowed for CONTAINS searching. If a query is issued on a field not included in the Catalog, SQL will return an error. Evidenced by searching on the "type" field below:
SELECT title_id, title, notes FROM titles WHERE CONTAINS (type, ' "computer" ')
This statement returns:
Server: Msg 7601, Level 16, State 3, Line 1 Cannot use a CONTAINS or FREETEXT predicate on column 'type' because it is not full-text indexed.
The syntax of the CONTAINS is the Cataloged field name, followed by the word or phrase to search for. Although only fields Cataloged can be searched on, any field in the table can be returned. Our first statement returned the title_id field. The same is true of joining a CONTAINS statement to another table. It follows the usual TSQL JOIN statement rules. This next query joins the CONTAINS statement to the titleauthor table, useful if author names were needed.
SELECT titles.title_id, titles.title, titles.notes, titleauthor.au_id FROM titles INNER JOIN titleauthor ON titles.title_id = titleauthor.title_ID WHERE CONTAINS (titles.notes, ' "computer" ')
The statement functions like a standard join returning the following.
The first query seemed very similar to a regular statement that could have been rewritten using the LIKE key word. However, if we run a LIKE statement, some differences will appear:
SELECT title_id, title, notes FROM titles WHERE notes LIKE '%computer%'
This LIKE statement returns five records rather than the three returned by the CONTAINS:
The difference is the plural "computers," returned by LIKE. The CONTAINS statement was only looking for an exact text match of "computer." Plurals were ignored. As a side note, CONTAINS is case insensitive. It sees no difference between "COmputERS" and "computers." To include the plurals, the CONTAINS is changed from what is known as a simple term search, to a prefix term search. This is accomplished by using the * sign. Change the original search statement to:
SELECT title_id, title, notes FROM titles WHERE CONTAINS (notes, '"computer*" ')
The query will now return the same five records as the LIKE statement. This leads to a discussion about the use of single and double quotes. The double quotes indicate the asterisk should look for zero, one, or more characters after the root word. Without the double quotes, such as:
SELECT title_id, title, notes FROM titles WHERE CONTAINS (notes, 'computer*')
The asterisk will now be treated as literal character to be found, rather than a wild card. Only the three records with the singular use of "computer" will be returned with this statement.
Another way to find variants of a word is change our prefix term search into a generation term search by the use of a new key word: FORMSOF. FORMSOF is used in conjunction with CONTAINS to look for variants of the search word. To use FORMSOF, the original statement is changed to:
SELECT title, notes FROM titles WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, computer)')
This will return all five records, those with the word "computer," and the plural "computers." Again, we could obtain the same results by using a LIKE statement, but the real power of FORMSOF becomes evident when the word "good" is searched for. This next LIKE statement will not return any matching records:
SELECT title, notes FROM titles WHERE notes LIKE '%good%'
The word "good" does not appear in the notes field. However, this next FORMSOF statement will return one record:
SELECT title, notes FROM titles WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, good)')
The difference is that in addition to returning singular and plural forms of a word, FORMSOF will also include gender and neutral forms of nouns, verbs, and adjectives. In the above example, "best" is returned as a variant of "good."
If we change the above statement around, and look for variants of the word "and" by using the following statement:
SELECT title, notes FROM titles WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, and)')
An error will be returned rather than the previously returned record:
Server: Msg 7619, Level 16, State 1, Line 1 Execution of a full-text operation failed. A clause of the query contained only ignored words.
The word "and" is considered a NOISE WORD. These are common words that MS Search feels should be ignored, and not be included in any Catalog. The full list of NOISE WORDS is located at \Program Files\Microsoft SQL Server\MSSQL\FTDATA\SQLServer\Config . The file representing your language can be opened and modified with a text editor. Any changes made will be reflected on the next Catalog population.
Full Text Searching goes beyond simple LIKE string searching to provide new query possibilities. Next month, we will continue with advanced searching, weights, searching Microsoft Office documents, and maintenance issues.