MySQL and Character Sets

In spite of the dominance of English for business purposes, we still live in a multilingual world. At some point, most serious MySQL developers and DBA’s will come across the need to store data from other languages and these may require other character sets or collations.


What are character sets and collations?


A character set is a set of characters, and the specific encoding for each character. A collation is a set of rules applied to a particular character set for comparison purposes. A collation can only apply to one character set (and its name will always begin with the character sets name), and each character set will always have a default collation.


For example, the letter Z may be encoded as the hex value A23 in one character set, and 15 in another. As an example of differing collation rule, just as in English, a falls before b, in Swedish, the character z falls before the character ö, while in German the reverse is true. More obscurely, in a German dictionary, the character öf falls before the character of, while in German telephone directories the reverse is true.


There can even be differences about what constitutes a character. For example, in German, ä is usually treated as an accented form of a, and therefore it follows after a. In Swedish however, it’s seen as an individual letter, and falls after z.


MySQL support for character sets and collations


Before MySQL 4.1 (it seems so long ago now), MySQL could support only one character set at a time. So, MySQL could support Hebrew, or Greek, or any other character set, but if you wanted to store Greek and Hebrew data at the same time, this was not possible.


Now, you can specify character sets at the server, database, table, column, or string constant level. The MyISAM, MEMORY, NDBCluster, and InnoDB storage engines all support character sets. Here is a list of all character sets supported by my version of MySQL. You can run the SHOW CHARACTER SET statement to see what is available in your environment.

mysql> SHOW CHARACTER SETG
*************************** 1. row ***************************
Charset: big5
Description: Big5 Traditional Chinese
Default collation: big5_chinese_ci
Maxlen: 2
*************************** 2. row ***************************
Charset: dec8
Description: DEC West European
Default collation: dec8_swedish_ci
Maxlen: 1
*************************** 3. row ***************************
Charset: cp850
Description: DOS West European
Default collation: cp850_general_ci
Maxlen: 1
*************************** 4. row ***************************
Charset: hp8
Description: HP West European
Default collation: hp8_english_ci
Maxlen: 1
*************************** 5. row ***************************
Charset: koi8r
Description: KOI8-R Relcom Russian
Default collation: koi8r_general_ci
Maxlen: 1
*************************** 6. row ***************************
Charset: latin1
Description: cp1252 West European
Default collation: latin1_swedish_ci
Maxlen: 1
*************************** 7. row ***************************
Charset: latin2
Description: ISO 8859-2 Central European
Default collation: latin2_general_ci
Maxlen: 1
*************************** 8. row ***************************
Charset: swe7
Description: 7bit Swedish
Default collation: swe7_swedish_ci
Maxlen: 1
*************************** 9. row ***************************
Charset: ascii
Description: US ASCII
Default collation: ascii_general_ci
Maxlen: 1
*************************** 10. row ***************************
Charset: ujis
Description: EUC-JP Japanese
Default collation: ujis_japanese_ci
Maxlen: 3
*************************** 11. row ***************************
Charset: sjis
Description: Shift-JIS Japanese
Default collation: sjis_japanese_ci
Maxlen: 2
*************************** 12. row ***************************
Charset: hebrew
Description: ISO 8859-8 Hebrew
Default collation: hebrew_general_ci
Maxlen: 1
*************************** 13. row ***************************
Charset: tis620
Description: TIS620 Thai
Default collation: tis620_thai_ci
Maxlen: 1
*************************** 14. row ***************************
Charset: euckr
Description: EUC-KR Korean
Default collation: euckr_korean_ci
Maxlen: 2
*************************** 15. row ***************************
Charset: koi8u
Description: KOI8-U Ukrainian
Default collation: koi8u_general_ci
Maxlen: 1
*************************** 16. row ***************************
Charset: gb2312
Description: GB2312 Simplified Chinese
Default collation: gb2312_chinese_ci
Maxlen: 2
*************************** 17. row ***************************
Charset: greek
Description: ISO 8859-7 Greek
Default collation: greek_general_ci
Maxlen: 1
*************************** 18. row ***************************
Charset: cp1250
Description: Windows Central European
Default collation: cp1250_general_ci
Maxlen: 1
*************************** 19. row ***************************
Charset: gbk
Description: GBK Simplified Chinese
Default collation: gbk_chinese_ci
Maxlen: 2
*************************** 20. row ***************************
Charset: latin5
Description: ISO 8859-9 Turkish
Default collation: latin5_turkish_ci
Maxlen: 1
*************************** 21. row ***************************
Charset: armscii8
Description: ARMSCII-8 Armenian
Default collation: armscii8_general_ci
Maxlen: 1
*************************** 22. row ***************************
Charset: utf8
Description: UTF-8 Unicode
Default collation: utf8_general_ci
Maxlen: 3
*************************** 23. row ***************************
Charset: ucs2
Description: UCS-2 Unicode
Default collation: ucs2_general_ci
Maxlen: 2
*************************** 24. row ***************************
Charset: cp866
Description: DOS Russian
Default collation: cp866_general_ci
Maxlen: 1
*************************** 25. row ***************************
Charset: keybcs2
Description: DOS Kamenicky Czech-Slovak
Default collation: keybcs2_general_ci
Maxlen: 1
*************************** 26. row ***************************
Charset: macce
Description: Mac Central European
Default collation: macce_general_ci
Maxlen: 1
*************************** 27. row ***************************
Charset: macroman
Description: Mac West European
Default collation: macroman_general_ci
Maxlen: 1
*************************** 28. row ***************************
Charset: cp852
Description: DOS Central European
Default collation: cp852_general_ci
Maxlen: 1
*************************** 29. row ***************************
Charset: latin7
Description: ISO 8859-13 Baltic
Default collation: latin7_general_ci
Maxlen: 1
*************************** 30. row ***************************
Charset: cp1251
Description: Windows Cyrillic
Default collation: cp1251_general_ci
Maxlen: 1
*************************** 31. row ***************************
Charset: cp1256
Description: Windows Arabic
Default collation: cp1256_general_ci
Maxlen: 1
*************************** 32. row ***************************
Charset: cp1257
Description: Windows Baltic
Default collation: cp1257_general_ci
Maxlen: 1
*************************** 33. row ***************************
Charset: binary
Description: Binary pseudo charset
Default collation: binary
Maxlen: 1
*************************** 34. row ***************************
Charset: geostd8
Description: GEOSTD8 Georgian
Default collation: geostd8_general_ci
Maxlen: 1
*************************** 35. row ***************************
Charset: cp932
Description: SJIS for Windows Japanese
Default collation: cp932_japanese_ci
Maxlen: 2
*************************** 36. row ***************************
Charset: eucjpms
Description: UJIS for Windows Japanese
Default collation: eucjpms_japanese_ci
Maxlen: 3
36 rows in set (0.00 sec)

Ian Gilfillan
Ian Gilfillan
Ian Gilfillan lives in Cape Town, South Africa. He is the author of the book 'Mastering MySQL 4', published by Sybex, and has been working with MySQL since 1997. These days he develops mainly in PHP and MySQL, although confesses to starting out with BASIC and COBOL, way back when, and still has a soft spot for Perl. He developed South Africa's first online grocery store, and has developed and taught internet development and other technical courses for various institutions. He has majors in Programming and Information Systems, as well as English and Philosophy. For 5 years he was Lead Developer and IT Manager for Independent Online, South Africa's premier news portal. However, he has now 'retired' from fulltime work, and is hoping that his next book will be more in the style of William Blake and Allen Ginsberg.

Latest Articles