With today’s exploding world economy, multi-national communication is essential. Databases must not only store different character sets but also present information in a comfortable format and order for individuals from every locale. This series explores how to globalize your databases and communicate effectively across the globe.
Let’s begin learning about how sorting
and character comparisons happen in a global market place by first
understanding some of the terminology.
You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?
While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.
Listing 1
Example on sorting different characters:
Create table to hold some characters
CREATE TABLE char_table (xDecimal NUMBER, xBinary NUMBER, xGraphic CHAR(4));
INSERT INTO char_table VALUES (65 ,01000001 ,’A’);
INSERT INTO char_table VALUES (66 ,01000010 ,’B’);
INSERT INTO char_table VALUES (67 ,01000011 ,’C’);
INSERT INTO char_table VALUES (68 ,01000100 ,’D’);
INSERT INTO char_table VALUES (79 ,01001111 ,’O’);
INSERT INTO char_table VALUES (85 ,01010101 ,’U’);
INSERT INTO char_table VALUES (97 ,01100001 ,’a’);
INSERT INTO char_table VALUES (98 ,01100010 ,’b’);
INSERT INTO char_table VALUES (99 ,01100011 ,’c’);
INSERT INTO char_table VALUES (100 ,01100100 ,’d’);
INSERT INTO char_table VALUES (111 ,01101111 ,’o’);
INSERT INTO char_table VALUES (117 ,01110101 ,’u’);
INSERT INTO char_table VALUES (196 ,11000100 ,’Ä’);
INSERT INTO char_table VALUES (214 ,11010110 ,’Ö’);
INSERT INTO char_table VALUES (220 ,11011100 ,’Ü’);
INSERT INTO char_table VALUES (228 ,11100100 ,’ä’);
INSERT INTO char_table VALUES (246 ,11110110 ,’ö’);
INSERT INTO char_table VALUES (252 ,11111100 ,’ü’);
Select the Characters and order
them on the Decimal, Binary, and Character value:
SELECT * FROM char_table ORDER BY xDecimal;
SELECT * FROM char_table ORDER BY xBinary;
SELECT * FROM char_table ORDER BY xGraphic;
ORDER BY xDecimal |
ORDER BY xBinary |
ORDER BY xGraphic |
XDECIMAL XBINARY XGRA ——— ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 196 11000100 Ä 214 11010110 Ö 220 11011100 Ü 228 11100100 ä 246 11110110 ö 252 11111100 ü |
XDECIMAL XBINARY XGRA ——– ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 196 11000100 Ä 214 11010110 Ö 220 11011100 Ü 228 11100100 ä 246 11110110 ö 252 11111100 ü |
XDECIMAL XBINARY XGRA ——– ———- —- 65 1000001 A 66 1000010 B 67 1000011 C 68 1000100 D 79 1001111 O 85 1010101 U 97 1100001 a 98 1100010 b 99 1100011 c 100 1100100 d 111 1101111 o 117 1110101 u 252 11111100 ü 220 11011100 Ü 196 11000100 Ä 246 11110110 ö 228 11100100 ä 214 11010110 Ö |
Hold the <ALT>-key down while punching in the ASCII code in the number pad.
You will need to prefix the decimal number with a zero.
So if you wanted to type in a Ü then you would hold down the <ALT>-key and key in 0220 with the number pad.
It is not usually my style to begin with an example but I
think the example in Listing 1 brings a few items to the forefront. You can see
in Listing 1 that I have created a table called CHAR_TABLE and inserted some
characters along with their decimal and binary equivalents. I then decided to
order these rows on the decimal, binary, and graphical representation. Since
the binary number is generated from the binary number, it would be assumed that
the sort order would be the same–and it was. However, when we sort on the graphical
character we get a somewhat different order for the non-American characters. We
must then ask ourselves why this happens. What is different about the graphical
character representation that makes the sort work differently than that on the
decimal value?
While we might all wish everyone to speak the same language,
it surely isn’t going to happen in most of our life times. While we are moving
closer and closer to a real one-world economy, we will most surely still need
to converse in multiple languages. This requires us to be able to store, retrieve,
and manipulate these languages within our databases. The issue doesn’t much lie
in storage and retrieval, although there are a few minor issues, but instead it
lies in how to provide an environment that gives linguistic meaning for the
part of the globe that is viewing the information. This is most apparent when
trying to perform equality and sorting on characters to retrieve the
information that the user needs.
In subsequent articles, I will take us through a journey of
handling and configuring our databases so that we may be able to have it
converse within a global economy and provide a local flavor of the data to who
ever is viewing it. As there are many definitions revolving around language and
data comparison, I thought it only proper to explore the terminology around the
globalization of language before getting much deeper into this topic. I am
positive that after reviewing these definitions you will soon gain some insight
into why this is such a difficult task for a database vendor to overcome.
Terminology
locale |
The environment that your database system is being |
Language |
Signifies a part of the world and dictates specific |
Territory |
Signifies a part of the world and the specific conventions |
Character Set |
Specifies the character set used by the client application |
Case |
Refers to the condition of being uppercase or lowercase |
Sort order |
Phonetic or character appearance in a list |
Comparison |
The method of looking at a character or character set and |
Diacritic |
A mark near or through a character or character |
Binary Value |
Every character evaluates to a binary number (a series of |
Binary Sort |
The order of a set or group of characters that are sorted |
ASCII |
For us Americans, this is the sort we are very familiar |
Linguistic Sort |
A sort sequence that matches the alphabetic sequence of |
Unicode |
|
Monolingual Linguistic Sort |
Method of sorting character strings in a two-step process |
Supplementary characters |
User defined characters that have been added to the |
Multilingual Linguistic Sort |
Allows multiple languages to be sorted together as one. This Oracle will evaluate a multilingual sort at three levels |
Base Letter |
Is a letter each character evaluates to, for example, u |
Ignorable Characters |
A character that is not involved in a linguistic sort such |
Contracting Characters |
Multiple characters that must treated as a single |
Expanding Characters |
A single character that must be treated as multiple |
Context-Sensitive Characters |
Some characters or symbols represent how to pronounce |
Canonical Equivalence |
A character or group of characters when reduced and |
When talking about handling different character sets and
needing to represent those characters from languages different from our own and
presenting data to someone halfway around the world, it becomes very important
that we handle data properly so that the viewer of the data feels comfortable
with what they see.