Unicode Support

As of MySQL version 4.1, there are two new character sets for storing Unicode data:

In UCS-2 (binary Unicode representation), every character is represented by a two-byte Unicode code with the most significant byte first. For example: "LATIN CAPITAL LETTER A" has the code 0x0041 and it's stored as a two-byte sequence: 0x00 0x41. "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.

A temporary restriction is that UCS-2 cannot yet be used as a client character set. That means that SET NAMES 'ucs2' will not work.

The UTF8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC2279. The idea of the UTF8 character set is that various Unicode characters fit into byte sequences of different lengths:

Currently, MySQL UTF8 support does not include four-byte sequences.

Tip: To save space with UTF8, use VARCHAR instead of CHAR. Otherwise, MySQL has to reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column, because that's the maximum possible length.