Inside MySQL Character Sets & Settings


MySQL has sophisticated character set support, and can store, retrieve and collate single byte, and multi-byte character sets alike. Sean Hull explores using non-default character sets and settings, which use the hierarchy to control these settings.

What Are Character Sets Generally?

Character sets are ways of storing string or text data in a
database. Since the world’s languages use different character sets for their
writing systems, a database must support many different types of character sets
to store information in those languages. For Western European languages, for
example, there are alphabets with many overlapping characters, but in addition
some require accents, different currency characters, and so on. For Asian
languages with many more characters, a multi-byte character set is required
since one byte is not large enough to store all the characters that can be
represented in that language.

How Are Character Sets Controlled in MySQL?

MySQL controls character sets through a hierarchy of settings. At
the most global level there is the server setting. Below that is the database
or schema setting. Below that is the connection setting, and client setting. Then
at a further granular level, character sets can be set on a table-by-table
level or even at the column level.

([email protected]) [(none)]> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 | 
| character_set_connection | latin1 | 
| character_set_database | latin1 | 
| character_set_filesystem | binary | 
| character_set_results | latin1 | 
| character_set_server | latin1 | 
| character_set_system | utf8 | 
| character_sets_dir | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

What Is A Character Set Collation?

Collation in character sets defines how you compare different
strings. For instance, you can have case-sensitive or case-insensitive
collations. It gets more complicated when you consider some languages have
characters with and without accents, which may have equivalence implications. Then
there are multi-byte and Asian character sets, which may have still more
complex collation considerations.

MySQL’s Default Collation

Like with character sets themselves, the database has different
levels of collations. You have the server global setting, the schema setting,
and then the connection setting. You may also be surprised to learn that
MySQL’s default collation is "latin1_swedish_ci" meaning case
insensitive.

Take a look at this example:

([email protected]) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
| HI THERE | 
+----------+
2 rows in set (0.00 sec)

How Can I Change The Database Collation?

Suppose you want to change the collation of tables in the database
so that you can do the previously mentioned case sensitive search. How would
you go about doing that?

Well with collation, like with character sets, there are server
level settings, schema or "database" level settings, and then there
are connection level settings. Let’s take a look:

([email protected]) [test]> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci | 
| collation_database | latin1_swedish_ci | 
| collation_server | latin1_general_cs | 
+----------------------+-------------------+
3 rows in set (0.00 sec)

Let’s go ahead and change them:

([email protected]) [test]> set collation_server = 'latin1_general_cs';
([email protected]) [test]> set collation_connection = 'latin1_general_cs';
([email protected]) [test]> set collation_database = 'latin1_general_cs';
([email protected]) [test]> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_general_cs | 
| collation_database | latin1_general_cs | 
| collation_server | latin1_general_cs | 
+----------------------+-------------------+
3 rows in set (0.00 sec)
Let's try our search again:
([email protected]) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
| HI THERE | 
+----------+
2 rows in set (0.00 sec)

What’s happening here? Why can I still not do a case insensitive
search? It turns out, as you may have already guessed, that it is set at the
table level in this case.

You must now go and convert the existing table collation if you
want it to take affect for existing objects:

([email protected]) [test]> alter table seant convert to character set 'latin1' collate 'latin1_general_cs';
Query OK, 2 rows affected (0.14 sec)
Records: 2 Duplicates: 0 Warnings: 0
([email protected]) [test]> select * from seant where c1 = 'hi there';
+----------+
| c1 |
+----------+
| hi there | 
+----------+
1 row in set (0.00 sec)

What Is My Table Character Set and Collation?

You can determine the character set and collation of an existing
table with the following command:

([email protected]) [test]> show create table seantG;
*************************** 1. row ***************************
Table: seant
Create Table: CREATE TABLE `seant` (
`c1` varchar(64) collate latin1_general_cs default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_cs
1 row in set (0.00 sec)

How Can I Convert A Table’s Character Set?

As we explained above, the ALTER command can be used to convert a
table’s character set as follows:

([email protected]) [test]> alter table seant convert to character set 'latin1' collate 'latin1_general_cs';
Query OK, 2 rows affected (0.14 sec)
Records: 2 Duplicates: 0 Warnings: 0

Should I Always Use UTF8 Character Set?

In some applications, you may be tempted to use UTF8 universally,
as it handles Unicode and potentially characters from any language. UTF8 is
variable length in MySQL, and is stored efficiently on disk if you are using
single-byte character sets. However MySQL will use three bytes per character when storing UTF8 strings in memory. This will affect
session memory usage, as well as data sets stored in other memory caches in the
database. So overall memory consumption will be larger. How much of course
depends on how much string data you have in your database, but it is certainly
something to consider.

Conclusion

MySQL has sophisticated character set support, and can store,
retrieve and collate single byte, and multi-byte character sets alike. When
using non-default character sets and settings, you must consider the hierarchy,
which controls these settings. They can be set at the server level, at the
schema level, at the connection level, at the table level and the column level.

Related Articles

MySQL Data Methods

»


See All Articles by Columnist

Sean Hull

Sean Hull
Sean Hull
Sean Hull is a business and technology consultant, author and speaker. He is the founder and senior consultant at Heavyweight Internet Group in New York and has worked as a database consultant and expert for ten years. He is the author of "Oracle and Open Source" on O'Reilly & Associates and his articles have appeared in Oracle's own OTN website, DBA Zine, ChangeThis.com, and many others. Using techniques, technologies, and perspectives discussed in the book, articles, and seminars, he has taken his career into exciting areas. He has served as Technology Consultant, Project Manager, CEO, and mentor with a background in programming, systems integration & administration, project development and management. He can be reached at [email protected] or visit http://www.iheavy.com for more info about consulting services.

Latest Articles