My last
DatabaseJournal article, All
About the Crosstab Query, described how to formulate an SQL statement for
generating a cross tabulation query. My original intention for today’s follow
up was to explore the use of stored procedures to make crosstab generation more
dynamic. That was until a bug sent me on a search for answers. What
I found was intriguing enough to make me set aside my original topic, so that I
could now relate what I discovered about the handling of import data
encoding. I think that you’ll agree that it’s a journey well worth
taking!
Character Encoding and Collation Described
A character encoding
is a way of mapping a character (the letter ‘A’) to an integer in a character
set (the number 65 in the US-ASCII character set). With a limited character
set, such as US-ASCII, which includes the twenty-six letters of the English
alphabet, both lowercase and uppercase, numbers from 0 to 9, and some
punctuation, fitting this into a single byte is not a problem. But when dealing
with other languages like German, Swedish, Hungarian, and Japanese, you start
to hit the boundaries of the 8-bit byte. This can happen when you try to create
a character set to represent two languages, or even a single language like
Japanese.
In an effort to
account for the profusion of languages and scripts in the modern world, a
number of different character encodings have been ascribed for mapping
different characters to integers. For character sets that wouldn’t fit in a
single byte, double-byte character sets were created, along with multi-byte
character sets that use a special character to signal a shift between
single-byte and double-byte encoding.
The Unicode Consortium
came together to create a specification for a character encoding that would be
able to encompass the characters in all written languages. The result was the
Unicode character set. The two most common are UCS-2, which encodes everything
as two-byte characters, and UTF-8, which uses a multi-byte encoding scheme that
extends US-ASCII.
ISO-8859-1 is the most
common character set used for Western languages, and it is extended by the
Windows-1252 character set to include some other characters, such as the Euro
(€) and trademark symbol (™). Because Windows-1252 is a superset of ISO-8859-1,
the character set is known as latin1
to MySQL. (It does not recognize ISO-8859-1 as being a distinct
character set.)
So Why Can’t We Just Use UCS-2 or UTF-8 for Everything?
The main reason that it isn’t
practical to always use Unicode is that it wastes bandwidth when using only a
single language. In TIS-620, the single byte code page for Thai, all characters
takes up one byte, whereas in UTF-8, Thai characters take up three bytes each.
Many people think UTF-8 is efficient because ASCII characters take up only one
byte, but in reality, UTF-8 can be highly inefficient when most of your file
consists of characters outside of the ASCII set. For example, if half your file
consists of ASCII, and the other half is Thai, then saving the file in UTF-8
makes it take up twice as much space than TIS-620 would.
In situations where storage space is
at a premium, and ASCII plus one script is used, using an older one byte
character set can make a lot of sense. Unicode is necessary when using multiple
scripts in one file, and two or more of the scripts use different code pages (e.g.:
you can mix Thai and English, because TIS-620 also includes the ASCII symbols,
but you cannot mix Thai and Greek without using Unicode, because Thai and Greek
require different code pages).
A collation comprises the rules
governing the proper use of characters for either a language, such as Greek or
Polish, or an alphabet, such as Latin1_General. The collation attribute is used
by MySQL for the sorting of characters in relation to one another, and not for
encoding specifically.
Each SQL Server collation specifies
two properties:
- The
sort order to use for nchar, nvarchar, and ntext Unicode data types as well as
for non-Unicode character data types (char, varchar, and text). A sort order
defines the sequence in which characters are evaluated in comparison
operations. - The
code page used to store non-Unicode character data.
If you’d like to read
up more on collation, there’s an informative DatabaseJounal
article on collation by Muthusamy Anantha Kumar (aka The MAK).
A Tale of Two Character Sets
MySQL ships with a
Latin-1 as the default encoding – actually latin1_swedish_ci – presumably
because it is used by the majority of MySQL customers. In MySQL 4.1.12 or
greater, data is imported in UTF8 encoding. While this does allow the
maximum number of characters codes, it can still present problems when the
incoming data is in a different character encoding. What’s worse is that
you may not know about the problem until a lot of work is required to undo the
damage!
I discovered this
first-hand when I imported data from an Access database to test my crosstab SQL
code in MySQL. The transfer only ran into one snag, when the date formats
were inconsistent. I fixed that problem by formatting the dates in the
universal “yyyy-mm-dd” format. Here is the Access query used to extract
the data:
SELECT TA_CASES.FEE_NUMBER,
TA_CASES.CASE_TYPE,
Format([CREATION_DATE],”yyyy-mm-dd”) AS [CREATION DATE],
TA_CASES.REGION_CODE
FROM TA_CASES;
To my surprise,
running the query on the imported data produced the following bizarre results,
as seen in this screenshot of my HeidiSQL Windows client:
It seemed that the
MONTHNAME function was misbehaving and returning a HEX number instead of a
string. Upon further experimentation, I concluded that any function on
the dates were returning HEX values. A few Internet searches later, I came to
understand that this sort of thing was quite common as I read account after
account of people having to jump through hoops to translate character codes to
the correct number.
In my case, the
discrepancy between the two encodings was caused by Access when I saved the
data to a .csv (comma-separated values) file. MS Access uses the “Windows-1252”
character encoding when exporting to text. You can test this by exporting
data as an HTML page. In it, there will be a META tag that declares the
character encoding for the page:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Windows-1252">
The simplest solution
that I found was to use the MySQL CONVERT function in the query. It
accepts a value and translates it to the encoding format that you specify
following the USING keyword. Here is how I used the CONVERT function in
my SQL code to fix the encoding problem:
Mysql>SELECT CONVERT(MONTHNAME(CREATION_DATE) USING latin1) AS 'Month', ...
The solution that I
employed only affected the output of the query. Other solutions can be
utilized at import time or applied to the entire server, database, table, or
field level. We’ll be looking at these after we get to our previously
scheduled crosstab stored proc, in the next article.