SHARE

The Globalization of Language in Oracle – Terminology

Written By

Feb 18, 2005

8 minute read

With today’s exploding world economy, multi-national communication is essential. Databases must not only store different character sets but also present information in a comfortable format and order for individuals from every locale. This series explores how to globalize your databases and communicate effectively across the globe.

Let’s begin learning about how sorting
and character comparisons happen in a global market place by first
understanding some of the terminology.

You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?

While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.

Listing 1

Example on sorting different characters:
Create table to hold some characters


CREATE TABLE char_table (xDecimal NUMBER, xBinary NUMBER, xGraphic CHAR(4));
INSERT INTO char_table VALUES (65   ,01000001   ,’A’);
INSERT INTO char_table VALUES (66   ,01000010   ,’B’);
INSERT INTO char_table VALUES (67   ,01000011   ,’C’);
INSERT INTO char_table VALUES (68   ,01000100   ,’D’);
INSERT INTO char_table VALUES (79   ,01001111   ,’O’);
INSERT INTO char_table VALUES (85   ,01010101   ,’U’);
INSERT INTO char_table VALUES (97   ,01100001   ,’a’);
INSERT INTO char_table VALUES (98   ,01100010   ,’b’);
INSERT INTO char_table VALUES (99   ,01100011   ,’c’);
INSERT INTO char_table VALUES (100  ,01100100   ,’d’);
INSERT INTO char_table VALUES (111  ,01101111   ,’o’);
INSERT INTO char_table VALUES (117  ,01110101   ,’u’);
INSERT INTO char_table VALUES (196  ,11000100   ,’Ä’);
INSERT INTO char_table VALUES (214  ,11010110   ,’Ö’);
INSERT INTO char_table VALUES (220  ,11011100   ,’Ü’);
INSERT INTO char_table VALUES (228  ,11100100   ,’ä’);
INSERT INTO char_table VALUES (246  ,11110110   ,’ö’);
INSERT INTO char_table VALUES (252  ,11111100   ,’ü’);

Select the Characters and order
them on the Decimal, Binary, and Character value:


SELECT * FROM char_table ORDER BY xDecimal;
SELECT * FROM char_table ORDER BY xBinary;
SELECT * FROM char_table ORDER BY xGraphic;

ORDER BY xDecimal	ORDER BY xBinary	ORDER BY xGraphic
XDECIMAL XBINARY XGRA ——— ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 196 11000100 Ä 214 11010110 Ö 220 11011100 Ü 228 11100100 ä 246 11110110 ö 252 11111100 ü	XDECIMAL XBINARY XGRA ——– ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 196 11000100 Ä 214 11010110 Ö 220 11011100 Ü 228 11100100 ä 246 11110110 ö 252 11111100 ü	XDECIMAL XBINARY XGRA ——– ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 252 11111100 ü 220 11011100 Ü 196 11000100 Ä 246 11110110 ö 228 11100100 ä 214 11010110 Ö

ORDER BY xDecimal

ORDER BY xBinary

ORDER BY xGraphic

XDECIMAL XBINARY XGRA

——— ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

——– ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

——– ———- —-

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

252 11111100 ü

220 11011100 Ü

196 11000100 Ä

246 11110110 ö

228 11100100 ä

214 11010110 Ö

QUICK TIP

:

Creating special characters from the keyboard

Hold the

<

ALT

>

-key down while punching in the ASCII code in the number pad.

You will need to prefix the decimal number with a zero.

So if you wanted to type in a

Ü

then you would hold down the

<

ALT

>

-key and key in 0220 with the number pad.

It is not usually my style to begin with an example but I
think the example in Listing 1 brings a few items to the forefront. You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?

While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.

In subsequent articles, I will take us through a journey of
handling and configuring our databases so that we may be able to have it
converse within a global economy and provide a local flavor of the data to who
ever is viewing it. As there are many definitions revolving around language and
data comparison, I thought it only proper to explore the terminology around the
globalization of language before getting much deeper into this topic. I am
positive that after reviewing these definitions you will soon gain some insight
into why this is such a difficult task for a database vendor to overcome.

Terminology

locale	The environment that your database system is being accessed from and has a desire to have information displayed and handled in its native format.
Language	Signifies a part of the world and dictates specific conventions on how character data is displayed, sorted, and compared.
Territory	Signifies a part of the world and the specific conventions on defaults and formats for displaying the character set.
Character Set	Specifies the character set used by the client application (normally the Oracle character set that corresponds to the user’s terminal character set or the OS character set). Each supported character set has a unique acronym, for example, US7ASCII, WE8ISO8859P1, WE8DEC, WE8MSWIN1252, or JA16EUC. Each language has a default character set associated with it.
Case	Refers to the condition of being uppercase or lowercase
Sort order	Phonetic or character appearance in a list
Comparison	The method of looking at a character or character set and determining their relationship in equality or sorting order.
Diacritic	A mark near or through a character or character combination
Binary Value	Every character evaluates to a binary number (a series of 1’s and 0’s)
Binary Sort	The order of a set or group of characters that are sorted by their binary equivalent. This type of sort order works well for English but once foreign languages are introduced, it cannot be relied upon.
ASCII	For us Americans, this is the sort we are very familiar with as the letters A thru Z evaluate since it was our American Standard (ASCII) codes we developed.
Linguistic Sort	A sort sequence that matches the alphabetic sequence of characters independently of their numeric values. In reality, a new / different number is assigned to each character with a number that represents the order they should take in a sort.
Unicode
Monolingual Linguistic Sort	Method of sorting character strings in a two-step process that is defined by assigning and comparing first a major value and then a minor value to the string to get the sort order. This allows for characters that are similar, except for maybe case or a diacritic to be sorted together. In order to use monolingual linguistic sorting, the database or column must support a Unicode multi-byte character set such as UTF8 for the database or a NCHAR on a column.
Supplementary characters	User defined characters that have been added to the character set
Multilingual Linguistic Sort	Allows multiple languages to be sorted together as one. This is of importance if a table is meant to have more than one language or for languages that have complex sorting rules. In addition, multilingual sorts can handle canonical equivalence and supplementary characters. Oracle will evaluate a multilingual sort at three levels of precision.
Base Letter	Is a letter each character evaluates to, for example, u and U evaluate to the base letter u.
Ignorable Characters	A character that is not involved in a linguistic sort such as diacritics and punctuation marks. This allows for characters and words that may be hyphenated or punctuated in different ways but can be evaluated to the same character or word.
Contracting Characters	Multiple characters that must treated as a single character for sorting in a multilingual setting.
Expanding Characters	A single character that must be treated as multiple characters when sorting in a multilingual setting.
Context-Sensitive Characters	Some characters or symbols represent how to pronounce certain sounds in certain languages. These characters are translated into those sounds and become part of the sort order.
Canonical Equivalence	A character or group of characters when reduced and compared to each other are evaluated to being the same.

When talking about handling different character sets and
needing to represent those characters from languages different from our own and
presenting data to someone halfway around the world, it becomes very important
that we handle data properly so that the viewer of the data feels comfortable
with what they see.

»

See All Articles by ColumnistJames Koopmann

James Koopmann

James Koopmann has fourteen years of database design, development and performance tuning experience. In addition, he has extensive database administration experience in Oracle and other relational databases in production environments, specializing in performance tuning of database engines and SQL based applications. Koopmann is an accomplished author with several technical papers in various Oracle related publications such as Oracle Magazine, Oracle Professional and SQL>UPDATE_RMOUG. He is a featured author and database expert for DatabaseJournal, a member of the editorial review committee for Select Journal (The Magazine for the International Oracle Users Group), an Oracle Certified Professional DBA and noted speaker at local Oracle User Groups around the country.

The Globalization of Language in Oracle – Terminology

James Koopmann

Company

Categories