MySQL's Full-Text Formulas | Database Journal

by Peter Gulutzan

A MySQL full-text query returns rows according to relevance. But what is relevance? It is a floating-point number based on formulas. Researchers have shown that these formulas produce results that real users want. I will disclose how MySQL 4.1 uses them, with a bit of arithmetic and a helpful program.

I made a 4-row database with MySQL 4.1:

CREATE TABLE quotes (quote CHAR(100),FULLTEXT (quote));
INSERT INTO quotes VALUES
  ('Special times require special socks'),
  ('Knock three times on the ceiling'),
  ('Boliauns are weeds'),
  ('The leprechaun''s gold');

Some words occur multiple times. Some words will not go in the FULLTEXT index because they are too short or too common. This database is complex enough to use for examples of all the formulas’ tricky parts.

myisam_ftdump
The formulas
The formulas in plain English
The stopword list
Some Tips
References

myisam_ftdump

To see what is in a full-text index, use the myisam_ftdump program. It comes with the standard distribution, but it does not come with instructions. Here is how I use it.

I find out what my data directory is with a SHOW statement:

mysql> SHOW VARIABLES LIKE 'datadir%';
+---------------+-----------------------+
| Variable_name | Value                 |
+---------------+-----------------------+
| datadir       | /usr/local/mysql/var/ |
+---------------+-----------------------+
1 row in set (0.00 sec)

I find out what my database is with a SELECT statement:

mysql> SELECT DATABASE();
+------------+
| DATABASE() |
+------------+
| db1        |
+------------+
1 row in set (0.01 sec)

I run myisam_ftdump. Usually it is in the same directory as the mysql client program (…/bin). (If not, I use the OS “find” utility to search for myisam_ftdump*.)

#> myisam_ftdump --help
Use: myisam_ftdump <table_name> <index_num>
  -d, --dump          Dump index (incl. data offsets and word weights).
  -s, --stats         Report global stats.
  -v, --verbose       Be verbose.
  -c, --count         Calculate per-word stats (counts and global
weights).
  -l, --length        Report length distribution.
  -h, --help          Display help and exit.
  -?, --help          Synonym for -h.

Now, knowing that my “datadir” is /usr/local/mysql/var, and that my “database” is db1, and that my table name is ‘quotes’, I say I want to “Dump index”:

# myisam_ftdump /usr/local/mysql/var/db1/quotes 0 -d
       ca            0.9775171 boliauns
       65            0.9666505 ceiling
      12f            0.9775171 gold
       65            0.9666505 knock
      12f            0.9775171 leprechaun's
        0            0.8148246 require
        0            0.8148246 socks
        0            1.3796179 special
        0            0.8148246 times
       65            0.9666505 times
       ca            0.9775171 weeds

You should be able to see the same thing. If you get “Error xx” that probably means you lack permission to see the index file, or you entered the name wrong.

Finally I say I want to “Calculate per-word stats”:

# myisam_ftdump /usr/local/mysql/var/db1/quotes 0 -c
        1            1.0986123 boliauns
        1            1.0986123 ceiling
        1            1.0986123 gold
        1            1.0986123 knock
        1            1.0986123 leprechaun's
        1            1.0986123 require
        1            1.0986123 socks
        1            1.0986123 special
        2            0.0000000 times
        1            1.0986123 weeds

Clearly MySQL associates terms with numbers. All that remains is to find out what the numbers mean.

The formulas

There are three formulas.

local weight = (log(dtf)+1)/sumdtf * U/(1+0.0115*U)
global weight = log((N-nf)/nf)
query weight = local weight * global weight * qf

Operand Meaning:

dtf      How many times the term appears in the row
sumdtf   The sum of “(log(dtf)+1)” for all terms in the same row
U        How many unique terms are in the row
N        How many rows are in the table
nf       How many rows contain the term
qf       How many times the term appears in the query

The only rough part of the formulas is the dread word “log”. Think of “log(n)” as a number that goes up whenever n goes up but more slowly, like a law of diminishing returns. For example log(1) is 0, log(2) is 0.6931472, log(3) is 1.0986123.

Let’s take the word “special” and plug it into the formulas for the first row in the table, which myisam_ftdump identifies as row 0 (zero).

For the first formula: (log(dtf)+1)/sumdtf * U/(1+0.0115*U);

dtf       "special" appears 2 times in row 0
          so log(dtf()+1) = 0.6931472 + 1 = 1.6931472
sumdtf    "special" appears 2 times in row 0, so add log(2)+1
          "times" appears 1 times in row 0, so add log(1)+1
          "require" appears 1 times in row 0, so add log(1)+1
          "socks" appears 1 times in row 0, so add log(1)+1
          so sumdtf = log(2)+1 + (log(1)+1)*3 = 4.6931472
U         there are 4 unique terms in row 0
          so U/(1+0.115*U) = 4/(1+0.0115*4) = 3.824092

Therefore, local weight = 1.6931472 / 4.6931472 * 3.824092 = 1.3796179, with creative rounding. That’s what myisam_ftdump said for “dump index”.

For the second formula: log((N-nf)/nf);

     N         there are 4 rows in the ´quotes´ table
     nf        the term "special" occurs in 1 row

Therefore, global weight = log((N-nf)/nf) = log(3) = 1.0986123. That’s what myisam_ftdump said for “Calculate per-word stats”.

For the third formula: query weight = local weight * global weight * qf;

     local weight     1.3796179
     global weight    1.0986123
     qf               "special" appears 1 times in the query

Therefore, query weight = 1.3796179 * 1.0986123 * 1 = 1.5156652.

At last we can go back into mysql and enter a MATCH … AGAINST:

mysql>  
+-------------------------------------------+
| ROUND(MATCH(quote) AGAINST ('special'),7) |
+-------------------------------------------+
|                                 1.5156652 |
|                                 0.0000000 |
|                                 0.0000000 |
|                                 0.0000000 |
+-------------------------------------------+
4 rows in set (0.00 sec)

So MATCH … AGAINST returns 1.5156652 for the first row, and that’s exactly what the query-weight formula says it should do.

The formulas in plain English

Notice that local weight depends on a straight multiplier, the term-within-row frequency times the unique frequency. Put most simply: If a term appears many times in a row, the weight goes up.

Why does local weight depend on how many times the term is in the row? Think of the document you are reading now. I inevitably mention “MySQL” and “full-text” several times. That’s typical: if words appear several times, they are likely to be relevant.

Notice that global weight depends on an inverse multiplier, the count of rows MINUS the count of rows that the term appears in. Put most simply: If a term appears in many rows, the weight goes down.

Why is global weight an inverse? That is, why is it bad if the term appears in many rows? Well, think of the farmer who marked the location of the leprechaun’s gold by tying a sock around a nearby bush. When he returned with his shovel to dig up the gold, the leprechaun had tied a sock around every bush in the field! Moral: if ‘sock’ appears everywhere, it is a useless marker.

Notice that local weight, global weight, and query weight are the only things that matter. MySQL will not increase weight if two terms are close to each other. MySQL will not do stemming automatically, that is, a search for “weeds” will not return “weed.” MySQL has many options, including IN BOOLEAN MODE where the above formulas are irrelevant, but for a no-frills information-retrieval query the frequencies are all that matter.

The stopword list

Some terms — ‘and’ and ‘the’ and ‘is’ are good examples — will inevitably have global weight equal zero, in English text. MySQL will not even bother to index such words, and will not use them in queries. They are “stopwords.” Keeping a stopword list is desirable, and normal. However, I can think of some situations where this list is trouble.

– Think of these famous Americans’ statements:

Every word she writes is a lie, including “and” and “the”.’
‘It depends on what the meaning of the word “is” is.’

Here there is no use searching for ‘and and the’ or ‘is is’. You must use a different operator, for example SELECT * FROM quotes WHERE quote LIKE ‘%is” is%’; Or you must use a phrase and include a non-stopword, for example SELECT * FROM quotes WHERE MATCH(quote) AGAINST(‘including and and the’);

– Suppose your text is not English. Then “Is” could be a Russian river name, and “the” could be the French word for tea (yes, the French word has an accent mark, but the accent is irrelevant in MySQL’s default latin1_swedish_ci collation). Therefore, you have to change the stopword list. Links for stopword lists are on the University of Neuchatel site (<a href=”http://www.unine.ch/info/clef/”>http://www.unine.ch/info/clef/), for Finnish, French, German, Italian, Russian, Spanish, and Swedish.

So make your own list, put it in stopword.txt, and always start up the MySQL server with the stopword file:

#> mysqld --ft_stopword_file=stopword.txt

To check that the server was started with a stopword file, use SHOW:

mysql> SHOW VARIABLES LIKE 'ft_stopword_file';
+------------------+--------------+
| Variable_name    | Value        |
+------------------+--------------+
| ft_stopword_file | stopword.txt |
+------------------+--------------+
1 row in set (0.00 sec)

Some Tips

Since the final formula contains qf (the frequency of a term in the query), you can give a term extra weight by mentioning it twice in the query. Try asking for ‘special special’ instead of ‘special’ and notice how the result changes.

It might help to add a column with terms that you think are particularly relevant, and include that column in the FULLTEXT index. This is similar to the idea behind the
keyword tag on an HTML page. The words that you add to this column will give a greater weight for the row.

If the results of some searches surprise you, run myisam_ftdump. It will not save you — no information-retrieval gives perfect results every time — but it will help you think of more effective index and query strategies next time.

Keep an eye on the “Full-text search TODO” in the MySQL Reference Manual, http://dev.mysql.com/doc/mysql/en/fulltext-todo.html. I know for certain that someone is working on one of the TODO items as I speak. And if you are wondering about InnoDB support, well, I have no insider knowledge, but the InnoDB makers want to hire a programmer whose first task might be full-text (http://www.innodb.com/jobs.php).