Multilingual linguistic searching and sorting in Oracle
December 24, 2008
Todays final Jeopardy category is language features. The answer is: glyphs, diacritics, canonical equivalence, base letters, contracting characters and expanding ones. Times up. If you wagered everything on, What are factors Oracle considers when sorting strings? then you win. And these are just some of the factors. When performing a sort using one language (i.e., your own) within a character set for your database, the rules (and expected results) are pretty well known and predictable. However, when dealing with a foreign language, you may have no clue as to the rules and therefore would likely have a hard time explaining how the results came about.
The Linguistic Sorting and String Searching chapter within Oracle Database Globalization Support Guide 10g Release 2 (10.2) goes into great detail about what Oracle does or considers but not how - when string sorts and searches are requested. Of the three types of sorts Oracle performs (binary, monolingual and multilingual linguistic), binary is by far and away to the most familiar to you if English is your native language. Sorting is based on what youd expect: something to do with ASCII values because these values translate to numbers, and as previously mentioned, sorting numbers is trivial. Just like what you see when doing a folder listing in Windows, uppercase letters and words appear before lowercase ones.
A problem arises when languages are mixed, because in this case, how reliable are the results? Right now, were only talking about simple A-z, shorthand for A-Z including a-z. English does not use expanded or contracted characters, and more importantly with respect to keeping the complexity low, does not include ideographic characters, such as those commonly found in Asian languages. For example, in the diagram below from http://www.iam.uni-bonn.de/~alt/html/unicode_196.html#id1), a binary sort has no idea how to order the (very simple) T looking symbols.
But not to worry. Oracle can employ a multilingual linguistic sort, which is based on ISO and Unicode standards. Its really a kind of science figuring out how to translate Mandarin Chinese symbols into what is known as PinYin, or the Romanization of Mandarin. The idea behind Romanization is simple: convert a writing system into the Roman (or Latin) alphabet. The Roman alphabet is what you are reading at this very moment, and its not just words, but also carries over into speech.
In English, there are situations where the case does not matter, and that holds true in multilingual sorts. Oracle employs three levels of sorts: primary, secondary and tertiary. The primary level is based on base letters, and a locale can be used to establish the order of those. Any variation of MAP will come before any variation of MOP (maP before MoP, etc.). Many languages use diacritics, or special characters, which change pronunciation or meaning when applied to base letters. Accent marks, umlauts, circumflex and tildes are familiar to you if youve had Spanish, French or German. When diacritics matter for sorting, then youve encountered the secondary level. The third level is when case is applied. By and large, sorting of English strings involves the first and third levels.
How well do you know how strings (anything from a single character to complete words, including non-Roman letters) are sorted using your default NLS settings (what NLS_SORT is set to)? What is the default order, and what are some of the options (and their meaning) you have?
Shown below are three orderings of the same four characters.
More than likely, the value of NLS_SORT (query this from v$nls_parameters) is BINARY. Two other options are BINARY_CI and BINARY_AI. For English, BINARY is what we see and use every day, sort by ASCII values. Under setting 1 (which is BINARY), upper case comes before lower case, and odd or accented/diacritics come last.
The _CI and _AI options for BINARY have suggestive names - fairly easy to guess that CI stands for case insensitive and AI is accent insensitive. BINARY_CI ignores case but considers diacritics (which come last). In BINARY_AI, the third column, case and diacritics (also referred to as accents) are both ignored, so the sorting is based on the base letter.
But what about other sorts, recalling that not only is there binary, but linguistic as well? The value of NLS_COMP determines how NLS_SORT is implemented. The NLS_COMP parameter can have values of BINARY, LINGUISTIC and ANSI. Many SQL operators and functions have an impact on how data is presented, so their behavior is directly affected by NLS_COMP.
Given that data resides in tables, and assuming you have a requirement to not only perform sorts (and searches) involving linguistic strings, but also to have well-performing queries, you wouldnt be wrong in guessing that Oracle supports the use of linguistic indexes. For all practical purposes, a linguistic index is nothing more than a function-based index, and the CREATE INDEX syntax is quite similar. With a mixture of words from different languages, a linguistic index can help improve performance and offers the flexibility to provide sorted output based on whatever language setting you need. It should also be noted that there are some minor restrictions in SQL when using NLS_COMP=LINGUISTIC. The min and max functions and the like operator cannot use a linguistic index.
NLS Settings in SQL Server
One operation frequently performed in a heterogeneous environment is to establish links to other database management systems. Two very common directions are Oracle to SQL Server and vice versa. We already know, or least have a better appreciation thereof, what Oracle considers in determining the alphabetization for a sort. NLS settings in Oracle can be viewed via v$nls_parameters. How does SQL Server, which uses a linked server (database link in Microsoft-speak) implement NLS-like settings and where would you find that setting?
In SQL Server Management Studio, right-click on a database and view the properties. Under Maintenance on the General page, the collation setting determines the Oracle-like NLS behavior. Oracle allows you to change its (and yours within a session) NLS settings, and so does MSSQL. Remaining within the Database Properties window, select the Options page and at the very top is a (very long) drop down list for collation settings.
A common setting is SQL_Latin1_General_CP1_CI_AS, which is essentially nothing more than dictionary order, case-insensitive and will give you the same results as what you use out of the box for English. While creating a linked server, you will have an option to select a collation setting, and that setting will take care of the translation, so to speak, between one NLS environment and another.
There is much we take for granted in what Oracle does to provide us the results of queries. The seemingly innocuous use of SORT BY has the potential to trigger an enormous amount of work behind the scenes. Even in your own environment, how many times have you been momentarily perplexed trying to straighten out the difference between US7ASCII and WE8ISO8859P1 in an export session? For those of you (myself included) who work exclusively in one language set or setting, you should feel quite fortunate for not having to deal with difficult translations, and even more fortunate UNICODE multibyte is something youve heard of but have not had to implement. Searching and sorting strings within Oracle, as it turns out, is a very sophisticated operation. For better overall performance, sort only when needed and sort accordingly.