SHARE

In Search of a Regex Replace Function for MySQL

Written By

Apr 11, 2016

5 minute read

One of my favorite TV shows as a boy was In Search of…, hosted by the late great Leonard Nimoy. Every week he would explore some unusual natural phenomena or paranormal occurrence. It never failed to stoke the fires of my imagination! Fast forward to 2016, and I now found myself on a quest of my own. While MySQL does feature Regular Expression matching via REGEXP operator it does not offer an equivalent to Oracle’s REGEXP_REPLACE function, which is all the more mysterious when you consider that Oracle owns MySQL! Join me as we try to account for this incongruence between the two DBMSes and search for user-defined replacements.

Origins and Uses

Regex replace functions can be traced all the way back to the sed UNIX utility. Developed in 1973/74, it parsed and transformed text, using a search formatted as “s/regexp/replacement/”. These extend the functionality of regular REPLACE functions by allowing you to search by Regex pattern. Beyond that, different implementations feature all sorts of specialized features, such as the ability to replace one or all matches, where to start matching, etc…

Let’s use Oracle’s REGEXP_REPLACE function as a guide. It accepts the following parameters:

REGEXP_REPLACE(
string, pattern [, replacement_string [, start_position [, th_appearance [,
match_parameter ] ] ] ] )

string: The string to search. It can be CHAR, VARCHAR2, NCHAR, NVARCHAR2, CLOB, or NCLOB.
pattern: The regular expression matching information.
replacement_string: Optional. Matched patterns will be replaced with replacement_string in string. If the replacement_string parameter is omitted, the function simply removes all matched patterns, and returns the resulting string.
start_position: Optional. It is the position in string where the search will start. If omitted, it defaults to 1 which is the first position in the string.
nth_appearance: Optional. It is the nth appearance of pattern in string. If omitted, it defaults to 1 which is the first appearance of pattern in string. If you specify 0 for this parameter, all appearances of pattern will be replaced in string.
match_parameter: Optional. It allows you to modify the matching behavior for the REGEXP_REPLACE function. It can be a combination of the following:
- ‘c’: Perform case-sensitive matching.
- ‘i’: Perform case-insensitive matching.
- ‘n’: Allows the period character (.) to match the newline character. By default, the period is a wildcard.
- ‘m’: expression is assumed to have multiple lines. By default, expression is assumed to be a single line.
- ‘x’: Whitespace characters are ignored. By default, whitespace characters are matched like any other character.

As for its imminent inclusion in MySQL, there has been a feature request since 2007. No dice so far…

Testing Candidate Functions

I applied candidate user functions against the open source Sakila sample database. It was developed by Mike Hillyer, a former member of the MySQL AB documentation team, and was created specifically for the purpose of providing a standard schema for use in books, tutorials, articles, and the like. I described the steps for downloading and installing it in my Generating Reports on MySQL Data article. Take a look at it if you’d like to install it yourself.

Rasika Godawatte’s regex_replace() Function

Software Engineer Rasika Godawatte needed a regex_replace function so he wrote one himself. He shares the result on his blog.

In MySQL 5.0.1 or higher, you have to set the NO_BACKSLASH_ESCAPES mode ON, before you can use the above function to replace any characters which are escaped with a back slash “”, i.e.: A,B,etc…

SET sql_mode = 'NO_BACKSLASH_ESCAPES';

Here is the function source code:

DELIMITER $$
 CREATE FUNCTION `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000))
RETURNS VARCHAR(1000)
 DETERMINISTIC
 BEGIN 
  DECLARE temp VARCHAR(1000); 
  DECLARE ch VARCHAR(1); 
  DECLARE i INT;
  SET i = 1;
  SET temp = '';
  IF original REGEXP pattern THEN 
   loop_label: LOOP 
    IF i>CHAR_LENGTH(original) THEN
     LEAVE loop_label;  
    END IF;
    SET ch = SUBSTRING(original,i,1);
    IF NOT ch REGEXP pattern THEN
     SET temp = CONCAT(temp,ch);

   ELSE
     SET temp = CONCAT(temp,replacement);
    END IF;
    SET i=i+1;
   END LOOP;
  ELSE
   SET temp = original;
  END IF;
  RETURN temp;
 END$$
 DELIMITER ;

We can apply Rasika’s function to the Sakila database to replace all non-alphanumeric characters (excluding spaces) using the following query:

SELECT description as 'before',
        regex_replace([^A-Za-z0-9 ]', '[[MATCH]]', description) as 'after'
 FROM sakila.film
 where description REGEXP '[^A-Za-z0-9 ]';

From the first three returned rows we can see that the hyphens (-) were matched:

Before                                                 After
 ---------------------------------------------------------------------------------------------------------------
 A Fast-Paced Documentary of a Pastry Chef And a...  | A Fast[[MATCH]]Paced Documentary of a Pastry Chef And a...
 A Action-Packed Tale of a Man And a Lumberjack...   | A Action[[MATCH]]Packed Tale of a Man And a Lumberjack...
 A Action-Packed Drama of a Dentist And a Croc...    | A Action[[MATCH]]Packed Drama of a Dentist And a Croc...
 ...

Other Regex operators such as the start of string (^), end of string ($), and quantifiers, including zero or one (?), one or more (+) and zero or more (*), are also permitted.

The main limitation – and it’s a big one – is that, because the REGEX function does not return the matched string, it is impossible to know exactly what substring was matched. Hence,

any DIY regex_replace() function must limit itself to single character patterns.

I’ve tried a few other variations of the above function, but they all do pretty much the same thing.

Another Option: UDF Functions

If you’re really bound and determined to approximate Oracle’s REGEXP_REPLACE function, then your best option is a User Defined Function (UDF). A UDF is a compiled function that behaves just like a native MySQL function. UDFs are usually written in C or C++ and added to MySQL using the CREATE FUNCTION command.

As with non-compiled user regex_replace() functions, there’s little imperative for writing one yourself, as there are plenty of excellent choices to be found online. Here are a couple:

Arnold Daniels’ lib_mysqludf_preg is a library of MySQL UDFs that provide access to the PCRE (perl compatible-regular-expressions) library for pattern matching. The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. This syntax can often handle more complex expressions and capturing than standard regular expression implementations.
Hartmut Holzgraefe’s mysql-udf-regexp library is mentioned as a workaround on the MySQL feature page. It looks suspiciously like Arnold Daniels’ library. Not sure what the difference is.

Conclusion

At the beginning of this article we embarked on a search for user defined regex replace functions. What we found was that you’ll have to limit your scope to single character patterns unless you are willing to expend the extra effort to compile and install a UDF.

See all articles by Rob Gravelle

Robert Gravelle

Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.