Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted Oct 18, 2006

MySQL and Character Sets - Page 3

By Ian Gilfillan

Default character sets and collations

There are five levels to which character sets and collations can be applied - server, database, table, column and string constant, each of which can have differing defaults. They also apply according to a certain hierarchy. The character set and collation specific to a particular string constant and column apply first. If those don't exist, the table default is used, and if that doesn't exist the database default, and finally the server default. By default, MySQL uses the latin1 character set, and the latin1_swedish_ci collation for each. You can see which defaults apply to your installation by running the SHOW VARIABLES LIKE 'character_set%' and SHOW VARIABLES LIKE 'collation%'statements:

mysql> SHOW VARIABLES LIKE 'character_set%'\G
*************************** 1. row ***************************
Variable_name: character_set_client
        Value: latin1
*************************** 2. row ***************************
Variable_name: character_set_connection
        Value: latin1
*************************** 3. row ***************************
Variable_name: character_set_database
        Value: latin1
*************************** 4. row ***************************
Variable_name: character_set_results
        Value: latin1
*************************** 5. row ***************************
Variable_name: character_set_server
        Value: latin1
*************************** 6. row ***************************
Variable_name: character_set_system
        Value: utf8
*************************** 7. row ***************************
Variable_name: character_sets_dir
        Value: /usr/local/mysql/share/mysql/charsets/
7 rows in set (0.00 sec)
mysql> SHOW VARIABLES LIKE 'collation%'\G
*************************** 1. row ***************************
Variable_name: collation_connection
        Value: latin1_swedish_ci
*************************** 2. row ***************************
Variable_name: collation_database
        Value: latin1_swedish_ci
*************************** 3. row ***************************
Variable_name: collation_server
        Value: latin1_swedish_ci
3 rows in set (0.00 sec)

The server defaults are set up in your configuration file (my.cnf). For example:

[mysqld]
...
default-character-set=latin2
default-collation=latin2_general_ci
...

If no other character sets or collations are specified at any of the other three levels, all operations will be performed using this character set and collation. For example, create a new database, and see which character set is used:

mysql> CREATE DATABASE latin1db;
Query OK, 1 row affected (0.00 sec)
mysql> SHOW CREATE DATABASE latin1db\G
*************************** 1. row ***************************
       Database: latin1db
Create Database: CREATE DATABASE `latin1db` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)

To create a database that makes use of a different character set and collation, simply use the CHARACTER SET (CHARSET is a synonym) and COLLATION clauses:

mysql> CREATE DATABASE latin2db CHARACTER SET latin2 COLLATE latin2_general_ci;
Query OK, 1 row affected (0.05 sec)
mysql> SHOW CREATE DATABASE latin2db\G
*************************** 1. row ***************************
       Database: latin2db
Create Database: CREATE DATABASE `latin2db` /*!40100 DEFAULT CHARACTER SET latin2 */
1 row in set (0.00 sec)


Let's look at changing the character set and collation at the table and column levels as well.

mysql> USE latin1db;
Database changed
mysql> CREATE TABLE latin2table (v VARCHAR(50), t TEXT) CHARACTER SET latin2 COLLATE latin2_general_ci;
Query OK, 0 rows affected (0.12 sec)
mysql> CREATE TABLE latin1table (c CHAR(20), v VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci, t TEXT CHARACTER SET latin1 COLLATE latin1_german1_ci);
Query OK, 0 rows affected (0.07 sec)

Now, let's insert the 'ý' character into each column, and see the different ways it's stored, using the HEX() function.

mysql> INSERT INTO latin1table(c,v,t) VALUES('ý','ý','ý');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT c,v,t FROM latin1table;
+------+------+------+
| c    | v    | t    |
+------+------+------+
| ý    | ý    | ý    |
+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT HEX(c), HEX(v), HEX(t) FROM latin1table;
+--------+--------+--------+
| HEX(c) | HEX(v) | HEX(t) |
+--------+--------+--------+
| FD     | C3BD   | FD     |
+--------+--------+--------+
1 row in set (0.00 sec)

Since utf8 is an encoding for many more characters than latin1, it therefore requires more bytes of storage, as is indicated in the hex value used to store the character.

Client and connection CHARACTER SET and COLLATION issues

As anyone developing for a wide variety of languages and character sets will know, it's not just how MySQL handles character sets that's important. All the other tools used for interacting are as important. The client being used to interact with MySQL is a key element as well. Take for example the character. This character can be encoded in utf8, but not in latin1. Working with it is further complicated by the other tools I'm using at the same time. I'm using a KDE desktop environment to write this, and can easily cut and paste the character. In order to save the character in my editor, KWrite, I had to manually set the encoding. Some editors won't be able to make sense of the character at all. However, if I paste the character into my command line MySQL client, the character is replaced with ?, as the client cannot understand it. MySQL too will replace characters it cannot understand with a question mark. For example, let's convert our utf8 varchar column (containing the ý character) into ascii, which cannot store it:

mysql> ALTER TABLE latin1table CHANGE v v VARCHAR(50) CHARACTER SET ascii;
Query OK, 1 row affected, 1 warning (0.20 sec)
Records: 1  Duplicates: 0  Warnings: 1
mysql> SELECT c,v,t FROM latin1table;
+------+------+------+
| c    | v    | t    |
+------+------+------+
| ý    | ?    | ý    |
+------+------+------+
1 row in set (0.00 sec)
mysql> SELECT HEX(c), HEX(v), HEX(t) FROM latin1table;
+--------+--------+--------+
| HEX(c) | HEX(v) | HEX(t) |
+--------+--------+--------+
| FD     | 3F     | FD     |
+--------+--------+--------+
1 row in set (0.00 sec)

You can see one major change between ascii and utf8, in the length of the returned hex result. utf-8 stores characters with between one and four bytes (although the subset used by MySQL only uses up to three bytes a character), while ascii only uses one. Because of the performance drop when using more bytes to store strings (and indexes), it doesn't make sense to use a broad character set requiring more than one byte per character (utf16 for example) just in case you ever need it. The default latin1 is sufficient for most western European languages. If you do need to use a multi-byte character set, rather use VARCHARs than CHARs, as the benefits are even greater than normal.

Earlier, we saw the character_set_client, character_set_results and character_set_connection variables. You can modify each of these individually (for example, to latin2) by running:

SET character_set_client = latin2;
SET character_set_results = latin2;
SET character_set_connection = latin2;

but it's quite likely you'll want to set all three at the same time. You can do this in one statement, as follows:

SET NAMES = latin2;

This will set the collation to whatever the default is for that character set, but to apply a collation specifically, use the COLLATE clause, for example:

SET NAMES latin2 COLLATE latin2_czech_cs

Conclusion

MySQL's mechanism for handling multiple languages is simple and powerful, especially compared to many of its DBMS rivals. In practice, you are likely to come across more issues with the other elements in a multilingual project, such as your desktop environment, programming language or editor. Good luck!

» See All Articles by Columnist Ian Gilfillan



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM


















Thanks for your registration, follow us on our social networks to keep up-to-date