Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Oct 18, 2006

MySQL and Character Sets

By Ian Gilfillan

In spite of the dominance of English for business purposes, we still live in a multilingual world. At some point, most serious MySQL developers and DBA's will come across the need to store data from other languages and these may require other character sets or collations.

What are character sets and collations?

A character set is a set of characters, and the specific encoding for each character. A collation is a set of rules applied to a particular character set for comparison purposes. A collation can only apply to one character set (and its name will always begin with the character sets name), and each character set will always have a default collation.

For example, the letter Z may be encoded as the hex value A23 in one character set, and 15 in another. As an example of differing collation rule, just as in English, a falls before b, in Swedish, the character z falls before the character ö, while in German the reverse is true. More obscurely, in a German dictionary, the character öf falls before the character of, while in German telephone directories the reverse is true.

There can even be differences about what constitutes a character. For example, in German, ä is usually treated as an accented form of a, and therefore it follows after a. In Swedish however, it's seen as an individual letter, and falls after z.

MySQL support for character sets and collations

Before MySQL 4.1 (it seems so long ago now), MySQL could support only one character set at a time. So, MySQL could support Hebrew, or Greek, or any other character set, but if you wanted to store Greek and Hebrew data at the same time, this was not possible.

Now, you can specify character sets at the server, database, table, column, or string constant level. The MyISAM, MEMORY, NDBCluster, and InnoDB storage engines all support character sets. Here is a list of all character sets supported by my version of MySQL. You can run the SHOW CHARACTER SET statement to see what is available in your environment.

*************************** 1. row ***************************
          Charset: big5
      Description: Big5 Traditional Chinese
Default collation: big5_chinese_ci
           Maxlen: 2
*************************** 2. row ***************************
          Charset: dec8
      Description: DEC West European
Default collation: dec8_swedish_ci
           Maxlen: 1
*************************** 3. row ***************************
          Charset: cp850
      Description: DOS West European
Default collation: cp850_general_ci
           Maxlen: 1
*************************** 4. row ***************************
          Charset: hp8
      Description: HP West European
Default collation: hp8_english_ci
           Maxlen: 1
*************************** 5. row ***************************
          Charset: koi8r
      Description: KOI8-R Relcom Russian
Default collation: koi8r_general_ci
           Maxlen: 1
*************************** 6. row ***************************
          Charset: latin1
      Description: cp1252 West European
Default collation: latin1_swedish_ci
           Maxlen: 1
*************************** 7. row ***************************
          Charset: latin2
      Description: ISO 8859-2 Central European
Default collation: latin2_general_ci
           Maxlen: 1
*************************** 8. row ***************************
          Charset: swe7
      Description: 7bit Swedish
Default collation: swe7_swedish_ci
           Maxlen: 1
*************************** 9. row ***************************
          Charset: ascii
      Description: US ASCII
Default collation: ascii_general_ci
           Maxlen: 1
*************************** 10. row ***************************
          Charset: ujis
      Description: EUC-JP Japanese
Default collation: ujis_japanese_ci
           Maxlen: 3
*************************** 11. row ***************************
          Charset: sjis
      Description: Shift-JIS Japanese
Default collation: sjis_japanese_ci
           Maxlen: 2
*************************** 12. row ***************************
          Charset: hebrew
      Description: ISO 8859-8 Hebrew
Default collation: hebrew_general_ci
           Maxlen: 1
*************************** 13. row ***************************
          Charset: tis620
      Description: TIS620 Thai
Default collation: tis620_thai_ci
           Maxlen: 1
*************************** 14. row ***************************
          Charset: euckr
      Description: EUC-KR Korean
Default collation: euckr_korean_ci
           Maxlen: 2
*************************** 15. row ***************************
          Charset: koi8u
      Description: KOI8-U Ukrainian
Default collation: koi8u_general_ci
           Maxlen: 1
*************************** 16. row ***************************
          Charset: gb2312
      Description: GB2312 Simplified Chinese
Default collation: gb2312_chinese_ci
           Maxlen: 2
*************************** 17. row ***************************
          Charset: greek
      Description: ISO 8859-7 Greek
Default collation: greek_general_ci
           Maxlen: 1
*************************** 18. row ***************************
          Charset: cp1250
      Description: Windows Central European
Default collation: cp1250_general_ci
           Maxlen: 1
*************************** 19. row ***************************
          Charset: gbk
      Description: GBK Simplified Chinese
Default collation: gbk_chinese_ci
           Maxlen: 2
*************************** 20. row ***************************
          Charset: latin5
      Description: ISO 8859-9 Turkish
Default collation: latin5_turkish_ci
           Maxlen: 1
*************************** 21. row ***************************
          Charset: armscii8
      Description: ARMSCII-8 Armenian
Default collation: armscii8_general_ci
           Maxlen: 1
*************************** 22. row ***************************
          Charset: utf8
      Description: UTF-8 Unicode
Default collation: utf8_general_ci
           Maxlen: 3
*************************** 23. row ***************************
          Charset: ucs2
      Description: UCS-2 Unicode
Default collation: ucs2_general_ci
           Maxlen: 2
*************************** 24. row ***************************
          Charset: cp866
      Description: DOS Russian
Default collation: cp866_general_ci
           Maxlen: 1
*************************** 25. row ***************************
          Charset: keybcs2
      Description: DOS Kamenicky Czech-Slovak
Default collation: keybcs2_general_ci
           Maxlen: 1
*************************** 26. row ***************************
          Charset: macce
      Description: Mac Central European
Default collation: macce_general_ci
           Maxlen: 1
*************************** 27. row ***************************
          Charset: macroman
      Description: Mac West European
Default collation: macroman_general_ci
           Maxlen: 1
*************************** 28. row ***************************
          Charset: cp852
      Description: DOS Central European
Default collation: cp852_general_ci
           Maxlen: 1
*************************** 29. row ***************************
          Charset: latin7
      Description: ISO 8859-13 Baltic
Default collation: latin7_general_ci
           Maxlen: 1
*************************** 30. row ***************************
          Charset: cp1251
      Description: Windows Cyrillic
Default collation: cp1251_general_ci
           Maxlen: 1
*************************** 31. row ***************************
          Charset: cp1256
      Description: Windows Arabic
Default collation: cp1256_general_ci
           Maxlen: 1
*************************** 32. row ***************************
          Charset: cp1257
      Description: Windows Baltic
Default collation: cp1257_general_ci
           Maxlen: 1
*************************** 33. row ***************************
          Charset: binary
      Description: Binary pseudo charset
Default collation: binary
           Maxlen: 1
*************************** 34. row ***************************
          Charset: geostd8
      Description: GEOSTD8 Georgian
Default collation: geostd8_general_ci
           Maxlen: 1
*************************** 35. row ***************************
          Charset: cp932
      Description: SJIS for Windows Japanese
Default collation: cp932_japanese_ci
           Maxlen: 2
*************************** 36. row ***************************
          Charset: eucjpms
      Description: UJIS for Windows Japanese
Default collation: eucjpms_japanese_ci
           Maxlen: 3
36 rows in set (0.00 sec)

MySQL Archives

Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM