The Globalization of Language in Oracle – Terminology

With today’s exploding world economy, multi-national communication is essential. Databases must not only store different character sets but also present information in a comfortable format and order for individuals from every locale. This series explores how to globalize your databases and communicate effectively across the globe.

Let’s begin learning about how sorting
and character comparisons happen in a global market place by first
understanding some of the terminology.

You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?

While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.

Listing 1

Example on sorting different characters:
Create table to hold some characters


CREATE TABLE char_table (xDecimal NUMBER, xBinary NUMBER, xGraphic CHAR(4));
INSERT INTO char_table VALUES (65 ,01000001 ,’A’);
INSERT INTO char_table VALUES (66 ,01000010 ,’B’);
INSERT INTO char_table VALUES (67 ,01000011 ,’C’);
INSERT INTO char_table VALUES (68 ,01000100 ,’D’);
INSERT INTO char_table VALUES (79 ,01001111 ,’O’);
INSERT INTO char_table VALUES (85 ,01010101 ,’U’);
INSERT INTO char_table VALUES (97 ,01100001 ,’a’);
INSERT INTO char_table VALUES (98 ,01100010 ,’b’);
INSERT INTO char_table VALUES (99 ,01100011 ,’c’);
INSERT INTO char_table VALUES (100 ,01100100 ,’d’);
INSERT INTO char_table VALUES (111 ,01101111 ,’o’);
INSERT INTO char_table VALUES (117 ,01110101 ,’u’);
INSERT INTO char_table VALUES (196 ,11000100 ,’Ä’);
INSERT INTO char_table VALUES (214 ,11010110 ,’Ö’);
INSERT INTO char_table VALUES (220 ,11011100 ,’Ü’);
INSERT INTO char_table VALUES (228 ,11100100 ,’ä’);
INSERT INTO char_table VALUES (246 ,11110110 ,’ö’);
INSERT INTO char_table VALUES (252 ,11111100 ,’ü’);

Select the Characters and order
them on the Decimal, Binary, and Character value:


SELECT * FROM char_table ORDER BY xDecimal;
SELECT * FROM char_table ORDER BY xBinary;
SELECT * FROM char_table ORDER BY xGraphic;

ORDER BY xDecimal

ORDER BY xBinary

ORDER BY xGraphic

XDECIMAL XBINARY XGRA

——— ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

——– ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

——– ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

252 11111100 ü

220 11011100 Ü

196 11000100 Ä

246 11110110 ö

228 11100100 ä

214 11010110 Ö

QUICK TIP: Creating special characters from the keyboard
Hold the <ALT>-key down while punching in the ASCII code in the number pad.
You will need to prefix the decimal number with a zero.
So if you wanted to type in a Ü then you would hold down the <ALT>-key and key in 0220 with the number pad.


It is not usually my style to begin with an example but I
think the example in Listing 1 brings a few items to the forefront. You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?

While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.

In subsequent articles, I will take us through a journey of
handling and configuring our databases so that we may be able to have it
converse within a global economy and provide a local flavor of the data to who
ever is viewing it. As there are many definitions revolving around language and
data comparison, I thought it only proper to explore the terminology around the
globalization of language before getting much deeper into this topic. I am
positive that after reviewing these definitions you will soon gain some insight
into why this is such a difficult task for a database vendor to overcome.

Terminology

locale

The environment that your database system is being
accessed from and has a desire to have information displayed and handled in
its native format.

Language

Signifies a part of the world and dictates specific
conventions on how character data is displayed, sorted, and compared.

Territory

Signifies a part of the world and the specific conventions
on defaults and formats for displaying the character set.

Character Set

Specifies the character set used by the client application
(normally the Oracle character set that corresponds to the user’s terminal
character set or the OS character set). Each supported character set has a
unique acronym, for example, US7ASCII, WE8ISO8859P1, WE8DEC, WE8MSWIN1252, or
JA16EUC. Each language has a default character set associated with it.

Case

Refers to the condition of being uppercase or lowercase

Sort order

Phonetic or character appearance in a list

Comparison

The method of looking at a character or character set and
determining their relationship in equality or sorting order.

Diacritic

A mark near or through a character or character
combination

Binary Value

Every character evaluates to a binary number (a series of
1’s and 0’s)

Binary Sort

The order of a set or group of characters that are sorted
by their binary equivalent. This type of sort order works well for English
but once foreign languages are introduced, it cannot be relied upon.

ASCII

For us Americans, this is the sort we are very familiar
with as the letters A thru Z evaluate since it was our American Standard
(ASCII) codes we developed.

Linguistic Sort

A sort sequence that matches the alphabetic sequence of
characters independently of their numeric values. In reality, a new /
different number is assigned to each character with a number that represents
the order they should take in a sort.

Unicode

Monolingual Linguistic Sort

Method of sorting character strings in a two-step process
that is defined by assigning and comparing first a major value and then a
minor value to the string to get the sort order. This allows for characters
that are similar, except for maybe case or a diacritic to be sorted together.
In order to use monolingual linguistic sorting, the database or column must
support a Unicode multi-byte character set such as UTF8 for the database or a
NCHAR on a column.

Supplementary characters

User defined characters that have been added to the
character set

Multilingual Linguistic Sort

Allows multiple languages to be sorted together as one. This
is of importance if a table is meant to have more than one language or for
languages that have complex sorting rules. In addition, multilingual sorts
can handle canonical equivalence and supplementary characters.

Oracle will evaluate a multilingual sort at three levels
of precision.

Base Letter

Is a letter each character evaluates to, for example, u
and U evaluate to the base letter u.

Ignorable Characters

A character that is not involved in a linguistic sort such
as diacritics and punctuation marks. This allows for characters and words
that may be hyphenated or punctuated in different ways but can be evaluated
to the same character or word.

Contracting Characters

Multiple characters that must treated as a single
character for sorting in a multilingual setting.

Expanding Characters

A single character that must be treated as multiple
characters when sorting in a multilingual setting.

Context-Sensitive Characters

Some characters or symbols represent how to pronounce
certain sounds in certain languages. These characters are translated into
those sounds and become part of the sort order.

Canonical Equivalence

A character or group of characters when reduced and
compared to each other are evaluated to being the same.

When talking about handling different character sets and
needing to represent those characters from languages different from our own and
presenting data to someone halfway around the world, it becomes very important
that we handle data properly so that the viewer of the data feels comfortable
with what they see.

»


See All Articles by Columnist
James Koopmann

James Koopmann
James Koopmann
James Koopmann has fourteen years of database design, development and performance tuning experience. In addition, he has extensive database administration experience in Oracle and other relational databases in production environments, specializing in performance tuning of database engines and SQL based applications. Koopmann is an accomplished author with several technical papers in various Oracle related publications such as Oracle Magazine, Oracle Professional and SQL>UPDATE_RMOUG. He is a featured author and database expert for DatabaseJournal, a member of the editorial review committee for Select Journal (The Magazine for the International Oracle Users Group), an Oracle Certified Professional DBA and noted speaker at local Oracle User Groups around the country.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles