Free Newsletters:
DatabaseJournal  
DBANews
Database Journal
Search Database Journal:
 
HOME News MS SQL Oracle DB2 Access MySQL PostgreSQL PHP SQL Etc Scripts Links Discussion
internet.com

» HOME
» NEWS
» FEATURES
» SERIES
MS SQL
Oracle
MS Access
MySQL
DB2
» RESOURCES
Products
Scripts
Links
» DISCUSSION
» TECH JOBS

Marketplace Partners
Be a Marketplace Partner




internet.commerce
Be a Commerce Partner
Home Improvement
GPS Devices
Prepaid Phone Card
Compare Prices
Memory Upgrades
Laptops
Career Education
Corporate Awards
Remote Online Backup
Desktop Computers
KVM over IP
Web Design
Promotional Products
Auto Insurance Quote




MySpace Joins eBay, Yahoo in Open Profile Push

News Corp. Unit Under Fire for Ties to Hacker

Are Non-PC Devices Hurting 'Net Innovation?

internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Linked Data Planet Conference & Expo

CA ERwin® Data Modeler Proven database design and modeling. Efficiently analyze, design and deploy effective database solutions. Whitepaper: Manage SQL Server Deployments
Try it free: CA ERwin® Data Modeler


Solaris 8 Migration Assistant
Rapidly move your Solaris 8 application environments to new systems running Solaris 10 with the Solaris 8 Migration Assistant. Reduce migration risk while taking advantage of increased performance, reliability and security of the latest SPARC hardware platforms and Solaris 10 OS. »

 
Sun Eco Innovation: Good for Business, Good for the Environment
A complete solution to help you optimize and refresh your datacenter while properly recycling equipment and eliminating eWaste, including money-saving promotions to lower hardware acquisition costs. »

 
Sun Eco Innovation: Power Calculators
Power consumption has increasingly become a priority in customer's minds when purchasing new systems or storage. Sun's Power Calculators provide data on power consumption of Sun products allowing IT managers to better plan the power requirements in the datacenter to achieve better energy and cost savings. »

 
Optimize the Web Tier: Consolidate to Get More Performance in Less Space and Lower Power Consumption
Expansion in the Web tier is generally accomplished by adding more servers whenever extra capacity is needed. As the pool of servers grows larger, however, the complexity of the environment can grow exponentially. »

Production Manager (hands on)
Aquent
US-MA-Cambridge

Justtechjobs.com Post A Job | Post A Resume
MySQL
June 15, 2005
MySQL's Full-Text Formulas

by Peter Gulutzan

A MySQL full-text query returns rows according to relevance. But what is relevance? It is a floating-point number based on formulas. Researchers have shown that these formulas produce results that real users want. I will disclose how MySQL 4.1 uses them, with a bit of arithmetic and a helpful program.

I made a 4-row database with MySQL 4.1:




CREATE TABLE quotes (quote CHAR(100),FULLTEXT (quote));
INSERT INTO quotes VALUES
  ('Special times require special socks'),
  ('Knock three times on the ceiling'),
  ('Boliauns are weeds'),
  ('The leprechaun''s gold');

Some words occur multiple times. Some words will not go in the FULLTEXT index because they are too short or too common. This database is complex enough to use for examples of all the formulas' tricky parts.

myisam_ftdump

To see what is in a full-text index, use the myisam_ftdump program. It comes with the standard distribution, but it does not come with instructions. Here is how I use it.

I find out what my data directory is with a SHOW statement:


mysql> SHOW VARIABLES LIKE 'datadir%';
+---------------+-----------------------+
| Variable_name | Value                 |
+---------------+-----------------------+
| datadir       | /usr/local/mysql/var/ |
+---------------+-----------------------+
1 row in set (0.00 sec)

I find out what my database is with a SELECT statement:


mysql> SELECT DATABASE();
+------------+
| DATABASE() |
+------------+
| db1        |
+------------+
1 row in set (0.01 sec)

I run myisam_ftdump. Usually it is in the same directory as the mysql client program (.../bin). (If not, I use the OS "find" utility to search for myisam_ftdump*.)


#> myisam_ftdump --help
Use: myisam_ftdump <table_name> <index_num>
  -d, --dump          Dump index (incl. data offsets and word weights).
  -s, --stats         Report global stats.
  -v, --verbose       Be verbose.
  -c, --count         Calculate per-word stats (counts and global
weights).
  -l, --length        Report length distribution.
  -h, --help          Display help and exit.
  -?, --help          Synonym for -h.

Now, knowing that my "datadir" is /usr/local/mysql/var, and that my "database" is db1, and that my table name is 'quotes', I say I want to "Dump index":


# myisam_ftdump /usr/local/mysql/var/db1/quotes 0 -d
       ca            0.9775171 boliauns
       65            0.9666505 ceiling
      12f            0.9775171 gold
       65            0.9666505 knock
      12f            0.9775171 leprechaun's
        0            0.8148246 require
        0            0.8148246 socks
        0            1.3796179 special
        0            0.8148246 times
       65            0.9666505 times
       ca            0.9775171 weeds

You should be able to see the same thing. If you get "Error xx" that probably means you lack permission to see the index file, or you entered the name wrong.

Finally I say I want to "Calculate per-word stats":


# myisam_ftdump /usr/local/mysql/var/db1/quotes 0 -c
        1            1.0986123 boliauns
        1            1.0986123 ceiling
        1            1.0986123 gold
        1            1.0986123 knock
        1            1.0986123 leprechaun's
        1            1.0986123 require
        1            1.0986123 socks
        1            1.0986123 special
        2            0.0000000 times
        1            1.0986123 weeds

Clearly MySQL associates terms with numbers. All that remains is to find out what the numbers mean.

The formulas

There are three formulas.


local weight = (log(dtf)+1)/sumdtf * U/(1+0.0115*U)
global weight = log((N-nf)/nf)
query weight = local weight * global weight * qf

Operand  Meaning:

dtf      How many times the term appears in the row
sumdtf   The sum of "(log(dtf)+1)" for all terms in the same row
U        How many unique terms are in the row
N        How many rows are in the table
nf       How many rows contain the term
qf       How many times the term appears in the query

The only rough part of the formulas is the dread word "log". Think of "log(n)" as a number that goes up whenever n goes up but more slowly, like a law of diminishing returns. For example log(1) is 0, log(2) is 0.6931472, log(3) is 1.0986123.

Let's take the word "special" and plug it into the formulas for the first row in the table, which myisam_ftdump identifies as row 0 (zero).

For the first formula: (log(dtf)+1)/sumdtf * U/(1+0.0115*U);

dtf       "special" appears 2 times in row 0
          so log(dtf()+1) = 0.6931472 + 1 = 1.6931472
sumdtf    "special" appears 2 times in row 0, so add log(2)+1
          "times" appears 1 times in row 0, so add log(1)+1
          "require" appears 1 times in row 0, so add log(1)+1
          "socks" appears 1 times in row 0, so add log(1)+1
          so sumdtf = log(2)+1 + (log(1)+1)*3 = 4.6931472
U         there are 4 unique terms in row 0
          so U/(1+0.115*U) = 4/(1+0.0115*4) = 3.824092

Therefore, local weight = 1.6931472 / 4.6931472 * 3.824092 = 1.3796179, with creative rounding. That's what myisam_ftdump said for "dump index".

For the second formula: log((N-nf)/nf);


     N         there are 4 rows in the ´quotes´ table
     nf        the term "special" occurs in 1 row

Therefore, global weight = log((N-nf)/nf) = log(3) = 1.0986123. That's what myisam_ftdump said for "Calculate per-word stats".

For the third formula: query weight = local weight * global weight * qf;


     local weight     1.3796179
     global weight    1.0986123
     qf               "special" appears 1 times in the query

Therefore, query weight = 1.3796179 * 1.0986123 * 1 = 1.5156652.

At last we can go back into mysql and enter a MATCH ... AGAINST:


mysql>  
+-------------------------------------------+
| ROUND(MATCH(quote) AGAINST ('special'),7) |
+-------------------------------------------+
|                                 1.5156652 |
|                                 0.0000000 |
|                                 0.0000000 |
|                                 0.0000000 |
+-------------------------------------------+
4 rows in set (0.00 sec)

So MATCH ... AGAINST returns 1.5156652 for the first row, and that's exactly what the query-weight formula says it should do.

The formulas in plain English

Notice that local weight depends on a straight multiplier, the term-within-row frequency times the unique frequency. Put most simply: If a term appears many times in a row, the weight goes up.

Why does local weight depend on how many times the term is in the row? Think of the document you are reading now. I inevitably mention "MySQL" and "full-text" several times. That's typical: if words appear several times, they are likely to be relevant.

Notice that global weight depends on an inverse multiplier, the count of rows MINUS the count of rows that the term appears in. Put most simply: If a term appears in many rows, the weight goes down.

Why is global weight an inverse? That is, why is it bad if the term appears in many rows? Well, think of the farmer who marked the location of the leprechaun's gold by tying a sock around a nearby bush. When he returned with his shovel to dig up the gold, the leprechaun had tied a sock around every bush in the field! Moral: if 'sock' appears everywhere, it is a useless marker.

Notice that local weight, global weight, and query weight are the only things that matter. MySQL will not increase weight if two terms are close to each other. MySQL will not do stemming automatically, that is, a search for "weeds" will not return "weed." MySQL has many options, including IN BOOLEAN MODE where the above formulas are irrelevant, but for a no-frills information-retrieval query the frequencies are all that matter.

The stopword list

Some terms -- 'and' and 'the' and 'is' are good examples -- will inevitably have global weight equal zero, in English text. MySQL will not even bother to index such words, and will not use them in queries. They are "stopwords." Keeping a stopword list is desirable, and normal. However, I can think of some situations where this list is trouble.

- Think of these famous Americans' statements:

Every word she writes is a lie, including "and" and "the".'
'It depends on what the meaning of the word "is" is.'

Here there is no use searching for 'and and the' or 'is is'. You must use a different operator, for example SELECT * FROM quotes WHERE quote LIKE '%is" is%'; Or you must use a phrase and include a non-stopword, for example SELECT * FROM quotes WHERE MATCH(quote) AGAINST('including and and the');

- Suppose your text is not English. Then "Is" could be a Russian river name, and "the" could be the French word for tea (yes, the French word has an accent mark, but the accent is irrelevant in MySQL's default latin1_swedish_ci collation). Therefore, you have to change the stopword list. Links for stopword lists are on the University of Neuchatel site (http://www.unine.ch/info/clef/), for Finnish, French, German, Italian, Russian, Spanish, and Swedish.

So make your own list, put it in stopword.txt, and always start up the MySQL server with the stopword file:


#> mysqld --ft_stopword_file=stopword.txt

To check that the server was started with a stopword file, use SHOW:


mysql> SHOW VARIABLES LIKE 'ft_stopword_file';
+------------------+--------------+
| Variable_name    | Value        |
+------------------+--------------+
| ft_stopword_file | stopword.txt |
+------------------+--------------+
1 row in set (0.00 sec)

Some Tips

Since the final formula contains qf (the frequency of a term in the query), you can give a term extra weight by mentioning it twice in the query. Try asking for 'special special' instead of 'special' and notice how the result changes.

It might help to add a column with terms that you think are particularly relevant, and include that column in the FULLTEXT index. This is similar to the idea behind the
keyword tag on an HTML page. The words that you add to this column will give a greater weight for the row.

If the results of some searches surprise you, run myisam_ftdump. It will not save you -- no information-retrieval gives perfect results every time -- but it will help you think of more effective index and query strategies next time.

Keep an eye on the "Full-text search TODO" in the MySQL Reference Manual, http://dev.mysql.com/doc/mysql/en/fulltext-todo.html. I know for certain that someone is working on one of the TODO items as I speak. And if you are wondering about InnoDB support, well, I have no insider knowledge, but the InnoDB makers want to hire a programmer whose first task might be full-text (http://www.innodb.com/jobs.php).

References

"The Full-Text Stuff That We Didn't Put In The Manual"
http://dev.mysql.com/tech-resources/articles/full-text-revealed.html

"The Field of boliauns"
http://www.sacred-texts.com/neu/celt/cft/cft06.htm

About the author

Peter Gulutzan works for MySQL AB as senior software architect. He lives in Edmonton, Alberta.

Tools:
Add databasejournal.com to your favorites
Add databasejournal.com to your browser search box
IE 7 | Firefox 2.0 | Firefox 1.5.x
Receive news via our XML/RSS feed

MySQL Archives

Download: SQL Backup & DBA Best Practices eBook.
Data Sheet: IBM Information Server Blade
Flash Demo: Learn how IBM Information Server Blade is easy to manage, highly scalable and efficient.
Download: SQL Backup & DBA Best Practices eBook
What's The Future Of IT? Find Out By Reading "IT in 2018" Now. Free Registration Required.


Latest Forum Threads
MySQL Forum
Topic By Replies Updated
database sculptor44 0 May 5th, 09:10 PM
arrangeable Database sculptor44 0 May 5th, 08:53 PM
Grant's order in sql anti 0 May 1st, 03:23 AM
Transfer information from an access database to MySQl database sculptor44 1 April 21st, 12:31 AM







JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
Microsoft Article: HyperV-The Killer Feature in WinServer ‘08
Avaya Article: How to Feed Data into the Avaya Event Processor
Microsoft Article: Install What You Need with Win Server ‘08
HP eBook: Putting the Green into IT
Whitepaper: HP Integrated Citrix XenServer for HP ProLiant Servers
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 1
Intel Go Parallel Portal: Interview with C++ Guru Herb Sutter, Part 2--The Future of Concurrency
Avaya Article: Setting Up a SIP A/S Development Environment
IBM Article: How Cool Is Your Data Center?
Microsoft Article: Managing Virtual Machines with Microsoft System Center
HP eBook: Storage Networking , Part 1
Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
Intel Video: Are Multi-core Processors Here to Stay?
On-Demand Webcast: Five Virtualization Trends to Watch
HP Video: Page Cost Calculator
Intel Video: APIs for Parallel Programming
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Sun Download: Solaris 8 Migration Assistant
Sybase Download: SQL Anywhere Developer Edition
Red Gate Download: SQL Backup Pro and free DBA Best Practices eBook
Red Gate Download: SQL Compare Pro 6
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
How-to-Article: Preparing for Hyper-Threading Technology and Dual Core Technology
eTouch PDF: Conquering the Tyranny of E-Mail and Word Processors
IBM Article: Collaborating in the High-Performance Workplace
HP Demo: StorageWorks EVA4400
Intel Featured Algorhythm: Intel Threading Building Blocks--The Pipeline Class
Microsoft How-to Article: Get Going with Silverlight and Windows Live
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES