Exploring MySQL 5.7’s Geo-Spatial Functions

Fuzzy searching has become a very prominent feature of Web search engines like Google.  One of the key factors in Google’s explosive growth was its ability to locate Web pages that are likely to be relevant to your search terms even if they don’t map exactly to the content that you are looking for.  The idea is not new; approximate string matching, as it’s known in computer science, is performed by algorithms that find strings that match a pattern approximately rather than exactly.

It would seem that the best place for such functionality is right in the database itself, where all the data is stored.  Unfortunately, MySQL only provides some modicum of fuzzy search capability.  There has always been far more choices in the realm of programming languages like PHP, Java, and what have you.

Nonetheless, I have found that as long as you are clear on what you are trying to accomplish, you can implement fuzzy text searching within your MySQL database by using a combination of built-in and user functions.  In today’s article, we’ll start with the native SOUNDEX MySQL function.

How it Works

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.  SOUNDEX codes from different strings can be compared to see how similar the strings sound when spoken. The first character of the code is the first character of the expression, converted to upper case. The second through fourth characters of the code are numbers that represent the letters in the expression. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string.  All international alphabetic characters outside the A-Z range are treated as vowels.  Hence, two strings that sound almost the same should have identical soundex strings.  For instance, the words “Assistance” and “Assistants” both produce a soundex of “A223”.

Soundex provides a very simple way to search for misspellings because the Soundex code for strings that are misspelled are often the same. For example, “Williams” and “Wlliams” both produce a soundex value of “W452”.  Here is a query that matches authors with last names that sound like “Williams”:

SELECT * FROM `author_bios` where SOUNDEX(`auth_last_name`) = SOUNDEX('Williams')

Matching a Single Search Term against Multiple Words

Soundex() works best when comparing a single word or short phrase against a string of the same length.  For that reason it’s not immediately suited for full-text searches against blocks of text.  However, with a little coding (or sleuthing as the case may be), we can split the search text into individual words for comparison.  This code can be placed in a user function that we can then call like so:

mysql>SELECT soundex_match('minesota', 'Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana', ',' ) as 'Soundex Results';

Before we get to the query result, let’s take a look at the soundex_match() function.

As alluded to above, I am not the author of this excellent function.  It is the work of Imranul Hoque.  I cannot easily say what the function does on a line-by-line basis, but I can say with certainty that:

1 .the function takes 3 arguments:

  • needle: The word you are looking for
  • haystack: The block of text that you are searching
  • splitChar: The whitespace character that’ll split the string into single words. Generally it is the space(‘ ‘)

2. If any word in the haystack sounds similar to the needle, the function will return 1. Otherwise it returns 0.

CREATE PROCEDURE `closest_restaurants_gis`
        (IN units varchar(5), IN lat Decimal(9,6), IN lon Decimal(9,6), 
         IN max_distance SMALLINT, IN limit_rows MEDIUMINT)
 BEGIN 
   DECLARE avg_deg smallint DEFAULT 111; 
   DECLARE kms_to_miles_conversion Decimal(6,5) DEFAULT 0.62137;
   IF units = 'miles' THEN
     SET avg_deg      = avg_deg      * kms_to_miles_conversion;
     SET max_distance = max_distance * kms_to_miles_conversion;
   END IF;  

  SELECT pm1.post_id, 
          p.post_title as restaurant_name_and_address,
          ST_AsText(Point(pm2.meta_value, pm1.meta_value)) as lon_lat,
              CONCAT(
               ROUND(
                   ST_Distance_Sphere(Point(lon, lat), Point(pm2.meta_value, pm1.meta_value)) / 1000 
                     * (CASE units WHEN 'miles' THEN kms_to_miles_conversion ELSE 1 END)
                      , 3)
                   , (CASE units WHEN ' miles' THEN ' miles' ELSE ' kms' END)
              ) as distance
   FROM goodfood_wp_md20m_postmeta AS pm1, 
        goodfood_wp_md20m_postmeta AS pm2,
        goodfood_wp_md20m_posts    AS p 
   WHERE pm1.meta_key = 'latitude' AND pm2.meta_key = 'longitude'
     AND pm1.post_id = pm2.post_id
     AND pm1.post_id = p.id 
     AND p.post_status = 'publish'
     AND ST_Contains( ST_MakeEnvelope(
                         Point((lon+(max_distance/avg_deg)), (lat+(max_distance/avg_deg))),
                         Point((lon-(max_distance/avg_deg)), (lat-(max_distance/avg_deg)))
                    ), Point(pm2.meta_value, pm1.meta_value) )
   ORDER BY distance ASC
   LIMIT limit_rows;
 END

Creating an Envelope using the Average Distance between Longitude and Latitude Degrees

The very first variable declaration sets avg_deg to 111.   That number is the average distance between longitude and latitude degrees, which is 111km.  It’s fairly accurate for latitude, but much more variant for longitude, due to the bulging of the Earth at the equator. 

We use that distance to create the envelope (or bounding box) around our location. 

Take a look at the call to ST_MakeEnvelope() in the WHERE clause and you’ll see that the max_distance is divided by the avg_deg before being added to or subtracted from our location.  Adding to and subtracting from both the latitude and longitude has the effect of creating a diagonal, which results in ST_MakeEnvelope() constructing a polygon using pt1 and pt2 as diagonal points.

While serviceable for small bounding boxes, larger areas will lose precision due to the variations in longitude.  For such cases, you can employ the following formulas to calculate the bounding box longitude coordinates:

lon1 = lon +
(distance/abs(cos(radians(lat))*111))
lon2 = lon - (distance/abs(cos(radians(lat))*111))

The resulting envelope is then passed to the ST_Contains() function to determine whether or not the coordinates of each restaurant falls within the bounding box.

Using ST_Distance_Sphere() to Calculate the Distance between Two Points

In addition to finding the closest restaurants within a given range, the above procedure also returns the distance of each restaurant from our location.  We can now replace the Haversine formula with ST_Distance_Sphere(). It returns the minimum distance in meters between two lon/lat coordinates using a spherical earth and radius of 6,370,986 meters.  

I included some additional calculations for converting from meters to kilometers (and to miles if necessary), rounding, and appending the unit type to each value.

Calling the Procedure

Here is the same search for the ten closest restaurants to the center of London, UK, measured in kilometers, within a hundred kilometer radius:

CALL `restaurants`.`closest_restaurants_gis`('kms', 51.5112139, -0.119824, 100, 10);

post_id

restaurant_name_and_address

lon_lat

distance

2103

Pret A Manger – 135 Strand

POINT(-0.12143076204578 51.510322681011)

0.066 kms

2288

Pret A Manger – 87-88 Strand

POINT(-0.123385 51.509794)

0.149 kms

2093

Pret A Manger – 421/422 Strand

POINT(-0.118399 51.514158)

0.293 kms

2095

Pret A Manger – 29-33 Kingsway

POINT(-0.122915 51.513926)

0.342 kms

2139

Pret A Manger – 65 Long Acre

POINT(-0.1251 51.5103)

0.370 kms

7075

Nando’s – Covent Garden

POINT(-0.114694 51.512557)

0.379 kms

2146

Pret A Manger – 182 Strand

POINT(-0.125737 51.511575)

0.385 kms

1324

Carluccio’s – Covent Garden

POINT(-0.12373537807585 51.5082070105091)

0.411 kms

2088

Pret A Manger – 25 Villiers Street

POINT(-0.124849 51.514033)

0.430 kms

2305

Pret A Manger – 37 Shelton Street

POINT(-0.118865231606 51.5112310396)

0.468 kms

Rounded to three decimal places, the distances obtained using the MySQL 5.7 Geo-Spatial functions are exactly the same as those obtained in the last tutorial.  Just to get a second opinion, I ran the coordinates through an online calculator that does use the Haversine formula.  Again, the numbers match.  For instance, here are the results for Pret A Manger – 87-88 Strand:

Screenshot

Conclusion

Just because these functions will soon be available in an official MySQL release doesn’t implicitly mean that you need to use them.  Just think of them as another tool at your disposal.  The important thing is that you are able to generate the correct results for your particular application.

See all articles by Rob Gravelle

Robert Gravelle
Robert Gravelle
Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Latest Articles