Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted Feb 18, 2005

The Globalization of Language in Oracle - Terminology

By James Koopmann

With today's exploding world economy, multi-national communication is essential. Databases must not only store different character sets but also present information in a comfortable format and order for individuals from every locale. This series explores how to globalize your databases and communicate effectively across the globe.

Let's begin learning about how sorting and character comparisons happen in a global market place by first understanding some of the terminology.

You can see in Listing 1 that I have created a table called CHAR_TABLE and inserted some characters along with their decimal and binary equivalents. I then decided to order these rows on the decimal, binary, and graphical representation. Since the binary number is generated from the binary number, it would be assumed that the sort order would be the same--and it was. However, when we sort on the graphical character we get a somewhat different order for the non-American characters. We must then ask ourselves why this happens. What is different about the graphical character representation that makes the sort work differently than that on the decimal value?

While we might all wish everyone to speak the same language, it surely isn't going to happen in most of our life times. While we are moving closer and closer to a real one-world economy, we will most surely still need to converse in multiple languages. This requires us to be able to store, retrieve, and manipulate these languages within our databases. The issue doesn't much lie in storage and retrieval, although there are a few minor issues, but instead it lies in how to provide an environment that gives linguistic meaning for the part of the globe that is viewing the information. This is most apparent when trying to perform equality and sorting on characters to retrieve the information that the user needs.

Listing 1

Example on sorting different characters: Create table to hold some characters

CREATE TABLE char_table (xDecimal NUMBER, xBinary NUMBER, xGraphic CHAR(4));
INSERT INTO char_table VALUES (65   ,01000001   ,'A');
INSERT INTO char_table VALUES (66   ,01000010   ,'B');
INSERT INTO char_table VALUES (67   ,01000011   ,'C');
INSERT INTO char_table VALUES (68   ,01000100   ,'D');
INSERT INTO char_table VALUES (79   ,01001111   ,'O');
INSERT INTO char_table VALUES (85   ,01010101   ,'U');
INSERT INTO char_table VALUES (97   ,01100001   ,'a');
INSERT INTO char_table VALUES (98   ,01100010   ,'b');
INSERT INTO char_table VALUES (99   ,01100011   ,'c');
INSERT INTO char_table VALUES (100  ,01100100   ,'d');
INSERT INTO char_table VALUES (111  ,01101111   ,'o');
INSERT INTO char_table VALUES (117  ,01110101   ,'u');
INSERT INTO char_table VALUES (196  ,11000100   ,'Ä'); 
INSERT INTO char_table VALUES (214  ,11010110   ,'Ö');
INSERT INTO char_table VALUES (220  ,11011100   ,'Ü');
INSERT INTO char_table VALUES (228  ,11100100   ,'ä');
INSERT INTO char_table VALUES (246  ,11110110   ,'ö');
INSERT INTO char_table VALUES (252  ,11111100   ,'ü');

Select the Characters and order them on the Decimal, Binary, and Character value:

SELECT * FROM char_table ORDER BY xDecimal;
SELECT * FROM char_table ORDER BY xBinary;
SELECT * FROM char_table ORDER BY xGraphic;

ORDER BY xDecimal

ORDER BY xBinary

ORDER BY xGraphic

XDECIMAL XBINARY XGRA

--------- ---------- ----

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

-------- ---------- ----

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

196 11000100 Ä

214 11010110 Ö

220 11011100 Ü

228 11100100 ä

246 11110110 ö

252 11111100 ü

XDECIMAL XBINARY XGRA

-------- ---------- ----

65 1000001 A

66 1000010 B

67 1000011 C

68 1000100 D

79 1001111 O

85 1010101 U

97 1100001 a

98 1100010 b

99 1100011 c

100 1100100 d

111 1101111 o

117 1110101 u

252 11111100 ü

220 11011100 Ü

196 11000100 Ä

246 11110110 ö

228 11100100 ä

214 11010110 Ö



QUICK TIP: Creating special characters from the keyboard
Hold the <ALT>-key down while punching in the ASCII code in the number pad.
You will need to prefix the decimal number with a zero.
So if you wanted to type in a Ü then you would hold down the <ALT>-key and key in 0220 with the number pad.

It is not usually my style to begin with an example but I think the example in Listing 1 brings a few items to the forefront. You can see in Listing 1 that I have created a table called CHAR_TABLE and inserted some characters along with their decimal and binary equivalents. I then decided to order these rows on the decimal, binary, and graphical representation. Since the binary number is generated from the binary number, it would be assumed that the sort order would be the same--and it was. However, when we sort on the graphical character we get a somewhat different order for the non-American characters. We must then ask ourselves why this happens. What is different about the graphical character representation that makes the sort work differently than that on the decimal value?

While we might all wish everyone to speak the same language, it surely isn't going to happen in most of our life times. While we are moving closer and closer to a real one-world economy, we will most surely still need to converse in multiple languages. This requires us to be able to store, retrieve, and manipulate these languages within our databases. The issue doesn't much lie in storage and retrieval, although there are a few minor issues, but instead it lies in how to provide an environment that gives linguistic meaning for the part of the globe that is viewing the information. This is most apparent when trying to perform equality and sorting on characters to retrieve the information that the user needs.

In subsequent articles, I will take us through a journey of handling and configuring our databases so that we may be able to have it converse within a global economy and provide a local flavor of the data to who ever is viewing it. As there are many definitions revolving around language and data comparison, I thought it only proper to explore the terminology around the globalization of language before getting much deeper into this topic. I am positive that after reviewing these definitions you will soon gain some insight into why this is such a difficult task for a database vendor to overcome.

Terminology

locale

The environment that your database system is being accessed from and has a desire to have information displayed and handled in its native format.

Language

Signifies a part of the world and dictates specific conventions on how character data is displayed, sorted, and compared.

Territory

Signifies a part of the world and the specific conventions on defaults and formats for displaying the character set.

Character Set

Specifies the character set used by the client application (normally the Oracle character set that corresponds to the user's terminal character set or the OS character set). Each supported character set has a unique acronym, for example, US7ASCII, WE8ISO8859P1, WE8DEC, WE8MSWIN1252, or JA16EUC. Each language has a default character set associated with it.

Case

Refers to the condition of being uppercase or lowercase

Sort order

Phonetic or character appearance in a list

Comparison

The method of looking at a character or character set and determining their relationship in equality or sorting order.

Diacritic

A mark near or through a character or character combination

Binary Value

Every character evaluates to a binary number (a series of 1's and 0's)

Binary Sort

The order of a set or group of characters that are sorted by their binary equivalent. This type of sort order works well for English but once foreign languages are introduced, it cannot be relied upon.

ASCII

For us Americans, this is the sort we are very familiar with as the letters A thru Z evaluate since it was our American Standard (ASCII) codes we developed.

Linguistic Sort

A sort sequence that matches the alphabetic sequence of characters independently of their numeric values. In reality, a new / different number is assigned to each character with a number that represents the order they should take in a sort.

Unicode

Monolingual Linguistic Sort

Method of sorting character strings in a two-step process that is defined by assigning and comparing first a major value and then a minor value to the string to get the sort order. This allows for characters that are similar, except for maybe case or a diacritic to be sorted together. In order to use monolingual linguistic sorting, the database or column must support a Unicode multi-byte character set such as UTF8 for the database or a NCHAR on a column.

Supplementary characters

User defined characters that have been added to the character set

Multilingual Linguistic Sort

Allows multiple languages to be sorted together as one. This is of importance if a table is meant to have more than one language or for languages that have complex sorting rules. In addition, multilingual sorts can handle canonical equivalence and supplementary characters.

Oracle will evaluate a multilingual sort at three levels of precision.

Base Letter

Is a letter each character evaluates to, for example, u and U evaluate to the base letter u.

Ignorable Characters

A character that is not involved in a linguistic sort such as diacritics and punctuation marks. This allows for characters and words that may be hyphenated or punctuated in different ways but can be evaluated to the same character or word.

Contracting Characters

Multiple characters that must treated as a single character for sorting in a multilingual setting.

Expanding Characters

A single character that must be treated as multiple characters when sorting in a multilingual setting.

Context-Sensitive Characters

Some characters or symbols represent how to pronounce certain sounds in certain languages. These characters are translated into those sounds and become part of the sort order.

Canonical Equivalence

A character or group of characters when reduced and compared to each other are evaluated to being the same.

When talking about handling different character sets and needing to represent those characters from languages different from our own and presenting data to someone halfway around the world, it becomes very important that we handle data properly so that the viewer of the data feels comfortable with what they see.

» See All Articles by Columnist James Koopmann



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM


















Thanks for your registration, follow us on our social networks to keep up-to-date