Sorting Strings in Oracle
December 10, 2008
In the last article, we covered some of the possible implementations Oracle uses to perform searches. Closely related to searching is sorting, and that raises the question of how Oracle performs this key piece of functionality. If youve done it once, youve done it a million times, and that is to include an ORDER BY clause in a SELECT statement, and one of the main goals of this article is to help foster an appreciation of what the RDBMS does or has to go through when you toss in the ubiquitous ORDER BY clause. Fundamentally, how do you sort a list? Do you start at one end and work your through to the end, finding the first ranked element (could be highest or lowest), return to the start of the list (minus the position you just found as being the first) and start over? And how does MIN and MAX fit into this sorting or ordering of elements?
Lets start with MIN and MAX, as these can be boiled down to trivial cases. Thats not trivial in the sense of we have nothing better to do, but trivial in the sense of mind-numbing exotic proofs your Real Analysis professor used to show by filling up every chalkboard in the classroom. In this case, the search for MIN and MAX means we only care about the current min or max value. Once a candidate is compared to the current leader, it either replaces it or is discarded, never to be bothered with again. The search for MIN/MAX should then have one benefit over a complete sort: only one pass through the list is needed.
So what is of more interest to us is how Oracle sorts a list. And list can mean more than one source of data. For example, if joining and sorting two tables, would it be more efficient to first sort each table and then perform a merge-sort, or would it be better to perform the full join, and then sort the results? If you are familiar with execution plans, you may already suspect the answer given the fact Oracle often uses a merge join in such cases (a sort merge join). The idea is simple: sort one list, then the other (concurrently if possible), start at the top of both (sub) lists and merge them together to produce the final list. And to extend this idea, the source can be just one list that is recursively divided or divided into buckets (think partitions). Taking a large piece of work and breaking it up into smaller units of work, commonly known as divide and conquer, has widespread use in Oracle.
Speaking of lists, how is the data arrayed within a list? The ideal (or trivial) case would be sorting an already sorted (in the same order) list. A relatively (depending on the algorithm used) worst case would be sorting an already sorted list, but this time you are reversing the sort order. Why would you do that? Probably because you didnt know the data was already sorted or you simply need the list presented in the opposite order. Where does the data live? If it came from an index, there may not be as much work to do since the index could have stored the data in sorted order to begin with. But, if the data comes from a table, who knows how it is stored? Four possible cases (or five, if the last one is extended a bit) are shown below. Is completely (as much as it can be) random data easier or harder to sort than something which is nearly sorted, reversed, or relatively unique (the extended case being 100% unique)?
Figure 1. Various Arrays of Data (from http://www.sorting-algorithms.com/)
The answer to that question depends on the sorting algorithm used. Since practically all of what Oracle does is designed to be performed in the most efficient manner possible (including human efforts to introduce the worst possible SQL), it would be reasonable to assume that the RDBMS uses the best sorting algorithm known to man. If there were one and only one best sorting algorithm, great, but that is not the case. With respect to sorting algorithms, there is no one best method. So, which method does Oracle use? Like the answer to many other questions related to optimization, it depends.
What are some of the choices? These are typically presented in algorithm design classes, and the list includes insertion, selection, bubble (and variations thereof), quicksort, merge and heap. For small lists (small being relative, but generally less than 1000 elements), some algorithms are faster than others, but even within the same algorithm, how the data is arrayed can make a difference. As an interesting exercise, one could code some of the algorithms using PL/SQL, populate arrays/collections with data, and time the sorting results. When using the ORDER BY clause in SQL, which method does Oracle choose to use when and where? If you were designing your own RDBMS, you would have to re-invent this wheel on your own. We know sorting is a key element of using and viewing data, and countless forum questions have been registered about sort_area_size.
For a visual representation of how sorting algorithms look with respect to performance (look at the relative performance, not the absolute times), go to http://www.sorting-algorithms.com/, a site developed by David R. Martin of Boston College. Other Web sites show visual animations much like this one, but sorting-algorithms.com makes choosing, viewing, and comparing methods easy to do.
Another factor to consider when it comes to sorting concerns the rules used to sort data. If we compare Apples to Apples, does it matter which element comes first, that is, do we preserve the order as is? Does it matter if Apples comes before or after APPLES or even apples? And to make things more interesting, what if your language spells apples as applés? Sorting numbers is pretty much a no-brainer, but sorting strings introduces much more complexity, and that leads to the next topic in this series: linguistic sorting. There is an amazing amount of rules and considerations to take into account when sorting strings. If all youve ever worked with is your own language set, then seeing how Oracle deals with other languages will be somewhat of an eye-opener.
Again, the emphasis here is to gain an appreciation of what Oracle has to offer with respect to searching and sorting, and especially so when it comes to linguistic sorting. You may not know this, but you do have some control over how sorts are performed within your databases character set and you can compare Apples to apples, as it turns out.