Multilingual linguistic searching and sorting in Oracle

Today’s final Jeopardy category is language features. The
answer is: glyphs, diacritics, canonical equivalence, base letters, contracting
characters and expanding ones. Time’s up. If you wagered everything on, “What
are factors Oracle considers when sorting strings?” then you win. And these are
just some of the factors. When performing a sort using one language (i.e., your
own) within a character set for your database, the rules (and expected results)
are pretty well known and predictable. However, when dealing with a foreign
language, you may have no clue as to the rules and therefore would likely have
a hard time explaining how the results came about.

The “Linguistic Sorting and String Searching” chapter within
Oracle Database Globalization Support Guide 10g Release 2 (10.2)
goes into great detail about what Oracle does or considers – but not how – when
string sorts and searches are requested. Of the three types of sorts Oracle
performs (binary, monolingual and multilingual linguistic), binary is by far
and away to the most familiar to you if English is your native language.
Sorting is based on what you’d expect: something to do with ASCII values
because these values translate to numbers, and as previously mentioned, sorting
numbers is trivial. Just like what you see when doing a folder listing in
Windows, uppercase letters and words appear before lowercase ones.

A problem arises when languages are mixed, because in this
case, how reliable are the results? Right now, we’re only talking about simple
A-z, shorthand for A-Z including a-z. English does not use expanded or
contracted characters, and more importantly with respect to keeping the
complexity low, does not include ideographic characters, such as those commonly
found in Asian languages. For example, in the diagram below from http://www.iam.uni-bonn.de/~alt/html/unicode_196.html#id1),
a binary sort has no idea how to order the (very simple) “T” looking symbols.

But not to worry. Oracle can employ a multilingual
linguistic sort, which is based on ISO and Unicode standards. It’s really a
kind of science figuring out how to translate Mandarin Chinese symbols into
what is known as PinYin, or the Romanization of Mandarin. The idea behind
Romanization is simple: convert a writing system into the Roman (or Latin)
alphabet. The Roman alphabet is what you are reading at this very moment, and it’s
not just words, but also carries over into speech.

In English, there are situations where the case does not
matter, and that holds true in multilingual sorts. Oracle employs three levels
of sorts: primary, secondary and tertiary. The primary level is based on base
letters, and a locale can be used to establish the order of those. Any
variation of MAP will come before any variation of MOP (maP before MoP, etc.).
Many languages use diacritics, or special characters, which change
pronunciation or meaning when applied to base letters. Accent marks, umlauts,
circumflex and tildes are familiar to you if you’ve had Spanish, French or
German. When diacritics matter for sorting, then you’ve encountered the
secondary level. The third level is when case is applied. By and large, sorting
of English strings involves the first and third levels.

How well do you know how strings (anything from a single
character to complete words, including non-Roman letters) are sorted using your
default NLS settings (what NLS_SORT is set to)? What is the default order, and
what are some of the options (and their meaning) you have?

Shown below are three orderings of the same four characters.

NLS setting 1

NLS setting 2

NLS setting 3

A

a

ä

Z

A

a

a

Z

A

ä

ä

Z

More than likely, the value of NLS_SORT (query this from v$nls_parameters)
is BINARY. Two other options are BINARY_CI and BINARY_AI. For English, BINARY
is what we see and use every day, sort by ASCII values. Under setting 1 (which
is BINARY), upper case comes before lower case, and “odd” or
accented/diacritics come last.

The _CI and _AI options for BINARY have suggestive names – fairly
easy to guess that CI stands for case insensitive and AI is accent insensitive.
BINARY_CI ignores case but considers diacritics (which come last). In
BINARY_AI, the third column, case and diacritics (also referred to as accents)
are both ignored, so the sorting is based on the base letter.

But what about other sorts, recalling that not only is there
binary, but linguistic as well? The value of NLS_COMP determines how NLS_SORT
is implemented. The NLS_COMP parameter can have values of BINARY, LINGUISTIC
and ANSI. Many SQL operators and functions have an impact on how data is
presented, so their behavior is directly affected by NLS_COMP.

Given that data resides in tables, and assuming you have a
requirement to not only perform sorts (and searches) involving linguistic
strings, but also to have well-performing queries, you wouldn’t be wrong in
guessing that Oracle supports the use of linguistic indexes. For all practical
purposes, a linguistic index is nothing more than a function-based index, and
the CREATE INDEX syntax is quite similar. With a mixture of words from
different languages, a linguistic index can help improve performance and offers
the flexibility to provide sorted output based on whatever language setting you
need. It should also be noted that there are some minor restrictions in SQL
when using NLS_COMP=LINGUISTIC. The min and max functions and the like operator
cannot use a linguistic index.

NLS Settings in SQL Server

One operation frequently performed in a heterogeneous
environment is to establish links to other database management systems. Two
very common directions are Oracle to SQL Server and vice versa. We already
know, or least have a better appreciation thereof, what Oracle considers in
determining the “alphabetization” for a sort. NLS settings in Oracle can be
viewed via v$nls_parameters. How does SQL Server, which uses a linked server
(database link in Microsoft-speak) implement NLS-like settings and where would
you find that setting?

In SQL Server Management Studio, right-click on a database
and view the properties. Under “Maintenance” on the General page, the collation
setting determines the Oracle-like NLS behavior. Oracle allows you to change its
(and yours within a session) NLS settings, and so does MSSQL. Remaining within
the Database Properties window, select the Options page and at the very top is
a (very long) drop down list for collation settings.

A common setting is SQL_Latin1_General_CP1_CI_AS, which is
essentially nothing more than dictionary order, case-insensitive and will give
you the same results as what you use out of the box for English. While creating
a linked server, you will have an option to select a collation setting, and
that setting will take care of the translation, so to speak, between one NLS environment
and another.

In Closing

There is much we take for granted in what Oracle does to
provide us the results of queries. The seemingly innocuous use of SORT BY has
the potential to trigger an enormous amount of work behind the scenes. Even in
your own environment, how many times have you been momentarily perplexed trying
to straighten out the difference between US7ASCII and WE8ISO8859P1 in an export
session? For those of you (myself included) who work exclusively in one
language set or setting, you should feel quite fortunate for not having to deal
with difficult translations, and even more fortunate UNICODE multibyte is
something you’ve heard of but have not had to implement. Searching and sorting
strings within Oracle, as it turns out, is a very sophisticated operation. For
better overall performance, sort only when needed and sort accordingly.

»


See All Articles by Columnist
Steve Callan

Steve Callan
Steve Callan
Steve is an Oracle DBA (OCP 8i and 9i)/developer working in Denver. His Oracle experience also includes Forms and Reports, Oracle9iAS and Oracle9iDS.

Latest Articles