Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted Oct 15, 2010

Inside MySQL Character Sets & Settings

By Sean Hull

MySQL has sophisticated character set support, and can store, retrieve and collate single byte, and multi-byte character sets alike. Sean Hull explores using non-default character sets and settings, which use the hierarchy to control these settings.

What Are Character Sets Generally?

Character sets are ways of storing string or text data in a database. Since the world's languages use different character sets for their writing systems, a database must support many different types of character sets to store information in those languages. For Western European languages, for example, there are alphabets with many overlapping characters, but in addition some require accents, different currency characters, and so on. For Asian languages with many more characters, a multi-byte character set is required since one byte is not large enough to store all the characters that can be represented in that language.

How Are Character Sets Controlled in MySQL?

MySQL controls character sets through a hierarchy of settings. At the most global level there is the server setting. Below that is the database or schema setting. Below that is the connection setting, and client setting. Then at a further granular level, character sets can be set on a table-by-table level or even at the column level.

(root@localhost) [(none)]> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 | 
| character_set_connection | latin1 | 
| character_set_database | latin1 | 
| character_set_filesystem | binary | 
| character_set_results | latin1 | 
| character_set_server | latin1 | 
| character_set_system | utf8 | 
| character_sets_dir | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

What Is A Character Set Collation?

Collation in character sets defines how you compare different strings. For instance, you can have case-sensitive or case-insensitive collations. It gets more complicated when you consider some languages have characters with and without accents, which may have equivalence implications. Then there are multi-byte and Asian character sets, which may have still more complex collation considerations.

MySQL's Default Collation

Like with character sets themselves, the database has different levels of collations. You have the server global setting, the schema setting, and then the connection setting. You may also be surprised to learn that MySQL's default collation is "latin1_swedish_ci" meaning case insensitive.

Take a look at this example:

(root@localhost) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
| HI THERE | 
+----------+
2 rows in set (0.00 sec)

How Can I Change The Database Collation?

Suppose you want to change the collation of tables in the database so that you can do the previously mentioned case sensitive search. How would you go about doing that?

Well with collation, like with character sets, there are server level settings, schema or "database" level settings, and then there are connection level settings. Let's take a look:

(root@localhost) [test]> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci | 
| collation_database | latin1_swedish_ci | 
| collation_server | latin1_general_cs | 
+----------------------+-------------------+
3 rows in set (0.00 sec)

Let's go ahead and change them:

(root@localhost) [test]> set collation_server = 'latin1_general_cs';
(root@localhost) [test]> set collation_connection = 'latin1_general_cs';
(root@localhost) [test]> set collation_database = 'latin1_general_cs';
(root@localhost) [test]> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_general_cs | 
| collation_database | latin1_general_cs | 
| collation_server | latin1_general_cs | 
+----------------------+-------------------+
3 rows in set (0.00 sec)
Let's try our search again:
(root@localhost) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
| HI THERE | 
+----------+
2 rows in set (0.00 sec)

What's happening here? Why can I still not do a case insensitive search? It turns out, as you may have already guessed, that it is set at the table level in this case.

You must now go and convert the existing table collation if you want it to take affect for existing objects:

(root@localhost) [test]> alter table seant convert to character set 'latin1' collate 'latin1_general_cs';
Query OK, 2 rows affected (0.14 sec)
Records: 2 Duplicates: 0 Warnings: 0
(root@localhost) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
+----------+
1 row in set (0.00 sec)

What Is My Table Character Set and Collation?

You can determine the character set and collation of an existing table with the following command:

(root@localhost) [test]> show create table seantG;
*************************** 1. row ***************************
Table: seant
Create Table: CREATE TABLE `seant` (
`c1` varchar(64) collate latin1_general_cs default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_cs
1 row in set (0.00 sec)

How Can I Convert A Table's Character Set?

As we explained above, the ALTER command can be used to convert a table's character set as follows:

(root@localhost) [test]> alter table seant convert to character set 'latin1' collate 'latin1_general_cs';
Query OK, 2 rows affected (0.14 sec)
Records: 2 Duplicates: 0 Warnings: 0

Should I Always Use UTF8 Character Set?

In some applications, you may be tempted to use UTF8 universally, as it handles Unicode and potentially characters from any language. UTF8 is variable length in MySQL, and is stored efficiently on disk if you are using single-byte character sets. However MySQL will use three bytes per character when storing UTF8 strings in memory. This will affect session memory usage, as well as data sets stored in other memory caches in the database. So overall memory consumption will be larger. How much of course depends on how much string data you have in your database, but it is certainly something to consider.

Conclusion

MySQL has sophisticated character set support, and can store, retrieve and collate single byte, and multi-byte character sets alike. When using non-default character sets and settings, you must consider the hierarchy, which controls these settings. They can be set at the server level, at the schema level, at the connection level, at the table level and the column level.

Related Articles

MySQL Data Methods

» See All Articles by Columnist Sean Hull



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM


















Thanks for your registration, follow us on our social networks to keep up-to-date