How does Oracle compare and sort data. There is more to it than
meets the eye.
Ever since I was a newbie to databases I never questioned how sorting
occurred in a database. I just knew that there was a predefined order to the
character set (English) and I never thought much more of it. We all get too
complacent about equating predicates in a WHERE clause or adding an ORDER BY
clause to sort our output to put priority to the data we view. If you care to
delve into how Oracle determines how to sort, there is a good write up in the
Oracle documentation set that discusses primary (first) level, secondary, and
tertiary level sorting mechanisms. I would encourage you to search on these
sorting levels as the reading gives great insight into the complexities that
need to be overcome. We are, in this article, more concerned with the setting
of variables to get the desired outcome for the locale we need to present data
to.
Well these past few months, yes months, I have been playing
around with this globalization stuff and have continually been stumped by what
is in Oracle documentation and what I can actually get to work on my local
databases. After getting over the hang-up of operating system variables over
riding my Oracle client’s session variables it is now time to get a bit deeper
into what these variables do. In my opinion the two most important variables
that can get set for globalization are the NLS_COMP and NLS_SORT variables.
These two variables themselves define how a client will experience the data
that gets presented to them. They control how characters get compared to each
other and how they sort. Table 1 gives a definition of these two
variables.
Table 1 NLS_COMP & NLS_SORT Defined
PARAMETER |
Descripton |
NLS_COMP |
The sole purpose of this variable is to determine how the There are only two settings. BINARY : All comparisons will be done in accordance ANSI : All comparisons will be done linguistically How to set
|
NLS_SORT |
If the value is If the value is a named linguistic sort, sorting is based There are many values. BINARY : All sorting is done in accordance to the binary V$NLS_VALID_VALUES : All sorting is done in How to set
|
Newsflash Setting NLS_LANG environment
variable
It has come to my
attention that the setting of your NLS_LANG environment variable is of the
utmost importance. This little variable, if not set properly, will create havoc
within your database. It should be set to a valid setting that takes the form
of <nls_language>_<nls_territory>.<nls_characterset> . The
concern comes when you have a registry entry on your client machine that is not
compatible with the database you are connecting to. In my case it was a
difference in character set which caused incompatibilities between clients at
different locales. You can find these values from the view V$NLS_VALID_VALUES.
For example my NLS_LANG
variable was set to AMERICAN_AMERICA.WE8MSWIN1252
In the first
article on globalization in Oracle I presented you with a table on some
common characters in the English and German language. Table 2 re-visits
this data. You should revisit the first article if you want to create this
table and query it. Table 2 gives you a few examples so that you can see
how setting the NLS_SORT variable will alter the output of a SELECT … ORDER BY
statement.
Table 2 Example data sorted on different NLS_COMP
& NLS_SORT values
select xgraphic from char_table order by xgraphic;
NLS_COMP=ANSI NLS_SORT=BINARY |
NLS_COMP=ANSI NLS_SORT=BINARY_CI |
NLS_COMP=ANSI NLS_SORT=BINARY_AI |
XDECIMAL XBINARY XGRAPHIC ——– ———- ——— 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 196 11000100 Ä 214 11010110 Ö 220 11011100 Ü 228 11100100 ä 246 11110110 ö 252 11111100 ü |
XDECIMAL XBINARY XGRAPHIC ——– ———- ——— 65 1000001 A 97 1100001 a 66 1000010 B 98 1100010 b 67 1000011 C 99 1100011 c 68 1000100 D 100 1100100 d 79 1001111 O 111 1101111 o 85 1010101 U 117 1110101 u 196 11000100 Ä 228 11100100 ä 214 11010110 Ö 246 11110110 ö 220 11011100 Ü 252 11111100 ü |
XDECIMAL XBINARY XGRAPHIC ——– ———- ——– 65 1000001 A 97 1100001 a 228 11100100 ä 196 11000100 Ä 66 1000010 B 98 1100010 b 67 1000011 C 99 1100011 c 68 1000100 D 100 1100100 d 79 1001111 O 111 1101111 o 214 11010110 Ö 246 11110110 ö 85 1010101 U 117 1110101 u 252 11111100 ü 220 11011100 Ü |
To take a look at what is really happing behind the scene
here we need to extract the numerical representation of a character to see how
it sorts. This can be done by imposing the NLSSORT function around the
character string in question. Table 3 presents the output of just such a
query. The second column now uses the _CI (case insensitivity) extension. The
item to glean from this output is that the upper case letter A has the same
NLSSORT value as the lower case letter a and the ä character now has the same nlssort
value as Ä. Now when we get to _AI (accent insensitivity & case
insensitivity) in column three we have a condition where the nlssort value is
the same for a set of characters such that a=A=ä=Ä. WOW! Now I encourage you to
run under these three scenarios and issue the following SQL.
SELECT xgraphic from char_table where xgraphic = `a’;
Table 3 Example data sorted on different NLS_COMP
& NLS_SORT values with NLSSORT
select xdecimal, nlssort(xgraphic) nlssort,xgraphic from char_table
order by xgraphic;
NLS_COMP=ANSI NLS_SORT=BINARY |
NLS_COMP=ANSI NLS_SORT=BINARY_CI |
NLS_COMP=ANSI NLS_SORT=BINARY_AI |
XDECIMAL NLSSORT XGRAPHIC ——– ———- ——— 65 4100 A 66 4200 B 67 4300 C 68 4400 D 79 4F00 O 85 5500 U 97 6100 a 98 6200 b 99 6300 c 100 6400 d 111 6F00 o 117 7500 u 196 C38400 Ä 214 C39600 Ö 220 C39C00 Ü 228 C3A400 ä 246 C3B600 ö 252 C3BC00 ü |
XDECIMAL NLSSORT XGRAPHIC ——– ———- ——– 65 6100 A 97 6100 a 66 6200 B 98 6200 b 67 6300 C 99 6300 c 68 6400 D 100 6400 d 79 6F00 O 111 6F00 o 85 7500 U 117 7500 u 196 C3A400 Ä 228 C3A400 ä 214 C3B600 Ö 246 C3B600 ö 220 C3BC00 Ü 252 C3BC00 ü |
XDECIMAL NLSSORT XGRAPHIC ——– ———- ——– 65 6100 A 97 6100 a 228 6100 ä 196 6100 Ä 66 6200 B 98 6200 b 67 6300 C 99 6300 c 68 6400 D 100 6400 d 79 6F00 O 111 6F00 o 214 6F00 Ö 246 6F00 ö 85 7500 U 117 7500 u 252 7500 ü 220 7500 Ü |
Dealing with a multi lingual database is confusing at best. But
when we can get a glimpse at the internal numbers associated with a sort, it
starts to give us some insights into how things work and why we are sorting in
a particular order. Unfortunately this article has brought up just as many
questions as I think I have answered for myself. Hopefully for you also. If
there is just one item I would like everyone to get out of this article, it is
that you MUST set the NLS_LANG variable properly. DO NOT rely upon past
settings and PLEASE verify the setting in your registry. Globalization is a
wild ride, please hold on.