Searching and Sorting Strings in Oracle
November 20, 2008
When it comes to searching, how often do you find yourself using the LIKE (including NOT LIKE) operator in SQL statements? Other well-known tests include equality and comparison (greater or less than), and this includes not only strings, but also numbers. Conceptually, the equality/comparison tests are pretty simple, and indexes can add a tremendous boost to performance when these tests are used. However, what takes place when were not looking for an exact match, but something, well, LIKE the value of interest? Mathematically, or at least algorithm-wise, constructing a test for equality is pretty simple. How do you suppose Oracle approaches the problem of determining if the string ABCD appears in the string ABCABDABCDAB and if it does, how many times (as in the use of INSTR)? Further, once we have a result set, what determines the sorted order?
The focus of Part 1 is on searching strings. In Part 2, Ill discuss sorting strings.
One thing that stands out in string searching is relevance. When testing equality or for a range, the values which are returned can be ranked or ordered. For X>500, a returned element of 1000 probably has much more significance than an element valued at 501. For strings, it is difficult to assign any meaning to Johns versus Johnson when searching for names like John. I may be interested in the first and you may be interested in the second name. There will only be relevance to the user at that time, but for someone else, the opposite may be true. For a scalar, 501 will always be greater than 500, so the interpretation is always the same.
From an algorithm design standpoint, we want the process to be performant. The complexity of the algorithm should be first order, as in O(n) (the big O notation) as opposed to O(n2), nLogO(n) and so on. This implies that the algorithm is efficient, and efficiency can be qualified with goals of not repeating work, skipping over intervals, and be deterministic (we can compute the best case and worst case).
In addition to string searching within SQL, virtually every text editor incorporates a search function. Without knowing the implementation details or underlying code, it is hard to determine exactly how a search editor works. Like most algorithms, there will be cases where some general or even trivial condition will make the algorithm scream along or slow to a crawl. Searching for a word within a string, where the word is longer than the searched string, is an example of a case where adding a simple length check up front will eliminate at least one condition. On the other hand, searching for something not very distinguishable (but distinguishable enough because the pattern does match) can evoke a worst-case performance (search for A in the string BBBBBB BBBA, no hit until the very end). Overall, the search should be done in no worse than what is called linear time, or O(n).
So, from which end of the target string should the search start? One widely used algorithm, because of its efficiency, is the Boyer-Moore string search algorithm. This algorithm actually starts at the end and works backwards. Once a miss is encountered (a bad character, that is, one that is not in the search pattern/string), the algorithm knows to shift the search over by the length of the search. If the bad character falls anywhere within the positions covered by the length, then it is impossible to find a match within that interval as the search string would span the bad character.
Now that we know the general approach of this algorithm, how do you think it applies to Oracle? If the test operator is LIKE, perhaps Oracle uses Boyer-Moore because we, and subsequently Oracle, dont care where the pattern is met or how many times it is met, just that it is met at least once. On the other hand, if using INSTR, we have a choice as to where to start the search (beginning or end), in addition to which occurrence. The fact that this function defaults to the beginning of the target should not imply anything about how Oracle conducts the search. For all we know, Oracle may internally reverse the string for search purposes (which can be quickly done) and leave the default start position as the beginning because this is more intuitively obvious to humans (not implying Oracle is alien, just that in English, we tend to start from left to right).
The example for the INSTR function in the SQL Reference guide is shown below, followed by a similar query where I let the default start position and occurrence both go to 1.
SQL> SELECT INSTR('CORPORATE FLOOR','OR', 3, 2) 2 "Instring" FROM DUAL; Instring ---------- 14 SQL> ed Wrote file afiedt.buf 1 SELECT INSTR('CORPORATE FLOOR','OR') 2* "Instring" FROM DUAL SQL> / Instring ---------- 2
This would suggest that defaulting to 1,1 for start and occurrence AND not caring about the position returned by the function is equivalent to using the pattern matching operation of LIKE.
What may not be obvious at this point is this fact: the longer the search string/pattern, the quicker the search is likely to be performed. If you divide a searched word (or string) of length N by the length M of the search pattern, you have at most N/M chunks to search. The more information, that is, the longer the search pattern is, you can provide up front, the less work the search algorithm has to perform. This is akin to searching for a needle in a haystack. It should be obvious that the longer the needle, the easier it will be to find. The same holds true for string searches.
However, does this hold true in Oracle? A simple test suggests that Boyer-Moore may not be the algorithm used for string searches.
set serveroutput on declare l_start number; v_count number; cursor c is select last_name from hr.employees; begin execute immediate 'alter system flush shared_pool'; execute immediate 'alter system flush buffer_cache'; l_start := dbms_utility.get_time; for r in c loop v_count := instr(r.last_name,'x'); end loop; dbms_output.put_line('One char time = '|| to_char(dbms_utility.get_time - l_start)); execute immediate 'alter system flush shared_pool'; execute immediate 'alter system flush buffer_cache'; l_start := dbms_utility.get_time; for r in c loop v_count := instr(r.last_name,'xxxxxxxxxx'); end loop; dbms_output.put_line('Multi char time = '|| to_char(dbms_utility.get_time - l_start)); end; / One char time = 12 Multi char time = 62 PL/SQL procedure successfully completed.
Repeated numerous times, and all else being equal, the time ratio stays at least 1 to 4 for the single character versus longer string search, regardless if the search starts at the beginning or end. I would suspect two things about Oracles string searching algorithm in support of ANSI SQL: there may not be a length check performed beforehand (its extra work, so reasonable to assume this), and the start end does matter.
The string searches mentioned so far are based on pattern matching. Another way to classify this type of search is hit or miss. What about close, as in close only counts in hand grenades and horseshoes? Oracle Text, in fact, considers close or fuzzy to be a type of search. Queries are based on CONTAINS and documents that match are returned in a hitlist. Our searching via SQL can emulate fuzzy searching to some degree by use of wildcard expansion, but thats for another article.