Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted March 9, 2015

MySQL Fuzzy Text Searching Using the SOUNDEX Function

By Rob Gravelle

Fuzzy searching has become a very prominent feature of Web search engines like Google.  One of the key factors in Google’s explosive growth was its ability to locate Web pages that are likely to be relevant to your search terms even if they don’t map exactly to the content that you are looking for.  The idea is not new; approximate string matching, as it’s known in computer science, is performed by algorithms that find strings that match a pattern approximately rather than exactly.

It would seem that the best place for such functionality is right in the database itself, where all the data is stored.  Unfortunately, MySQL only provides some modicum of fuzzy search capability.  There has always been far more choices in the realm of programming languages like PHP, Java, and what have you.

Nonetheless, I have found that as long as you are clear on what you are trying to accomplish, you can implement fuzzy text searching within your MySQL database by using a combination of built-in and user functions.  In today's article, we'll start with the native SOUNDEX MySQL function.

How it Works

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.  SOUNDEX codes from different strings can be compared to see how similar the strings sound when spoken. The first character of the code is the first character of the expression, converted to upper case. The second through fourth characters of the code are numbers that represent the letters in the expression. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string.  All international alphabetic characters outside the A-Z range are treated as vowels.  Hence, two strings that sound almost the same should have identical soundex strings.  For instance, the words "Assistance" and "Assistants" both produce a soundex of “A223”.

Soundex provides a very simple way to search for misspellings because the Soundex code for strings that are misspelled are often the same. For example, “Williams” and “Wlliams” both produce a soundex value of “W452”.  Here is a query that matches authors with last names that sound like “Williams”:

SELECT * FROM `author_bios` where SOUNDEX(`auth_last_name`) = SOUNDEX('Williams')

Matching a Single Search Term against Multiple Words

Soundex() works best when comparing a single word or short phrase against a string of the same length.  For that reason it’s not immediately suited for full-text searches against blocks of text.  However, with a little coding (or sleuthing as the case may be), we can split the search text into individual words for comparison.  This code can be placed in a user function that we can then call like so:

mysql>SELECT soundex_match('minesota', 'Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana', ',' ) as 'Soundex Results';

Before we get to the query result, let’s take a look at the soundex_match() function.

As alluded to above, I am not the author of this excellent function.  It is the work of Imranul Hoque.  I cannot easily say what the function does on a line-by-line basis, but I can say with certainty that:

1 .the function takes 3 arguments:

  • needle: The word you are looking for
  • haystack: The block of text that you are searching
  • splitChar: The whitespace character that’ll split the string into single words. Generally it is the space(‘ ‘)

2. If any word in the haystack sounds similar to the needle, the function will return 1. Otherwise it returns 0.

CREATE DEFINER=`root`@`localhost` FUNCTION `soundex_match`(
   needle varchar(128), haystack text, splitChar varchar(1)) 
 RETURNS tinyint(4)
 DETERMINISTIC
 BEGIN
     declare spacePos int;
     declare searchLen int default 0;
     declare curWord varchar(128) default '';
     declare tempStr text default haystack;
     declare tmp text default '';
     declare soundx1 varchar(64) default '';
     declare soundx2 varchar(64) default '';    

    set searchLen = length(haystack);
     set spacePos  = locate(splitChar, tempStr);
     set soundx1   = soundex(needle);

    while searchLen > 0 do
       if spacePos = 0 then
         set tmp = tempStr;
         select soundex(tmp) into soundx2;
         if soundx1 = soundx2 then
           return 1;
         else
           return 0;
         end if;
       else
         set tmp = substr(tempStr, 1, spacePos-1);
         set soundx2 = soundex(tmp);
         if soundx1 = soundx2 then
           return 1;
         end if;

        set tempStr = substr(tempStr, spacePos+1);
         set searchLen = length(tempStr);
       end if;      

      set spacePos = locate(splitChar, tempStr);
     end while;  

    return 0;
 END

Hence, our query above would return a value of 1 because “minesota” still sounds like “Minnesota”.

Matching Multiple Search Terms against Multiple Words!

The above function works great for needle-in-a-haystack type searching where you have a single search term.  Search engines that I’ve used are not limited to one search term.  If I was looking for a restaurant called “Ed’s Easy Diner” and I spelled it as “Eds Eezy Dinner”, Google would still find it, and probably with high relevance, unless of course there was a restaurant called “Ed’s Easy Dinner”!

As mentioned earlier, there is no need for a custom function at all if you’re trying to match against the same string but minus the spelling error.  In that case a simple Soundex comparison will suffice:

select soundex('eds easy diner'), soundex('Ed\'s Eesy Dinner');

In both cases, soundex() returns a value of “E32356”.

Soundex_match() works great with single word search strings against a larger block of text, but not so much for queries that contain a phrase.  This statement would return zero (no match):    

select soundex_match('eds eezy dinner', 'Welcome to Ed\'s Easy Diner!', ' ') as 'matches';

The solution that I devised was to split the search string into individual words and the call soundex_match() with each one.  I specifically designed it that way so that missing words won’t derail your search.  For instance, searching for “Eds Dinner” against “Welcome to Ed\'s Easy Diner!” still works because the function only cares whether or not words that sound like “Eds” and “Dinner” are contained within the target string.

I named my function soundex_match_all:

CREATE DEFINER=`root`@`localhost` FUNCTION `soundex_match_all`(needle varchar(128), haystack text, splitChar varchar(1)) RETURNS tinyint(4)

BEGIN
     /* find the first instance of the splitting character */ 
     DECLARE comma  INT  DEFAULT 0;  
     DECLARE word   TEXT; 

    SET comma = LOCATE(splitChar, needle);
     SET word = TRIM(needle);

    if LENGTH(haystack) = 0 then
         return 0;
     elseif comma = 0 then 
         /* one word search term */ 
         return soundex_match(word, haystack, splitChar);
     end if;

    SET word = trim(substr(needle, 1, comma));

    /* Insert each split variable into the word variable */ 
     REPEAT
         if soundex_match(word, haystack, splitChar) = 0 then 
             return 0;
         end if;

        /* get the next word */
         SET needle = trim(substr(needle, comma));
         SET comma  = LOCATE(splitChar, needle); 
         if comma = 0 then 
             /* last word */
             return soundex_match(needle, haystack, splitChar);
         end if;

        SET word = trim(substr(needle, 1, comma));
     UNTIL length(word) = 0
     END REPEAT; 

    return 0; 
 END

Conclusion

The Soundex() function is only one of several ways to perform fuzzy searches on your MySQL data.  We’ll explore some other options soon!

See all articles by Rob Gravelle



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM