Chapter 10. Character Set Support

Table of Contents

10.1. Character Sets and Collations in General
10.2. Character Sets and Collations in MySQL
10.3. Determining the Default Character Set and Collation
10.3.1. Server Character Set and Collation
10.3.2. Database Character Set and Collation
10.3.3. Table Character Set and Collation
10.3.4. Column Character Set and Collation
10.3.5. Examples of Character Set and Collation Assignment
10.3.6. Connection Character Sets and Collations
10.3.7. Character String Literal Character Set and Collation
10.3.8. Using COLLATE in SQL Statements
10.3.9. COLLATE Clause Precedence
10.3.10. BINARY Operator
10.3.11. Some Special Cases Where the Collation Determination Is Tricky
10.3.12. Collations Must Be for the Right Character Set
10.3.13. An Example of the Effect of Collation
10.4. Operations Affected by Character Set Support
10.4.1. Result Strings
10.4.2. CONVERT()
10.4.3. CAST()
10.4.4. SHOW Statements
10.5. Unicode Support
10.6. UTF8 for Metadata
10.7. Compatibility with Other DBMSs
10.8. New Character Set Configuration File Format
10.9. National Character Set
10.10. Character Sets and Collations That MySQL Supports
10.10.1. Unicode Character Sets
10.10.2. West European Character Sets
10.10.3. Central European Character Sets
10.10.4. South European and Middle East Character Sets
10.10.5. Baltic Character Sets
10.10.6. Cyrillic Character Sets
10.10.7. Asian Character Sets

This chapter discusses the following topics:

Character set support in MySQL 5.0 is included in the MyISAM, MEMORY, and InnoDB storage engines.

10.1. Character Sets and Collations in General

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: ‘A’, ‘B’, ‘a’, ‘b’. We give each letter a number: ‘A’ = 0, ‘B’ = 1, ‘a’ = 2, ‘b’ = 3. The letter ‘A’ is a symbol, the number 0 is the encoding for ‘A’, and the combination of all four letters and their encodings is a character set.

Suppose that we want to compare two string values, ‘A’ and ‘B’. The simplest way to do this is to look at the encodings: 0 for ‘A’ and 1 for ‘B’. Because 0 is less than 1, we say ‘A’ is less than ‘B’. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): “compare the encodings.” We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters ‘a’ and ‘b’ as equivalent to ‘A’ and ‘B’; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.

In real life, most character sets have many characters: not just ‘A’ and ‘B’ but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an “accent” is a mark attached to a character as in German ‘Ö’) and multiple-character mappings (such as the rule that ‘Ö’ = ‘OE’ in one of the two German collations).

MySQL 5.0 can do these things for you:

  • Store strings using a variety of character sets

  • Compare strings using a variety of collations

  • Mix strings with different character sets or collations in the same server, the same database, or even the same table

  • Allow specification of character set and collation at any level

In these respects, not only is MySQL 5.0 far more flexible than earlier versions of MySQL, it also is far ahead of other DBMSs. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.

10.2. Character Sets and Collations in MySQL

The MySQL server can support multiple character sets. To list the available character sets, use the SHOW CHARACTER SET statement:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252     West European    | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
...

(For a complete listing, see Section 10.10, “Character Sets and Collations That MySQL Supports”.)

Any given character set always has at least one collation. It may have several collations.

To list the collations for a character set, use the SHOW COLLATION statement. For example, to see the collations for the latin1 (“ISO-8859-1 West European”) character set, use this statement to find those collation names that begin with latin1:

mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         |          |       0 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         |          |       0 |
| latin1_general_cs   | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci   | latin1  | 94 |         |          |       0 |
+---------------------+---------+----+---------+----------+---------+

The latin1 collations have the following meanings:

CollationMeaning
latin1_german1_ciGerman DIN-1
latin1_swedish_ciSwedish/Finnish
latin1_danish_ciDanish/Norwegian
latin1_german2_ciGerman DIN-2
latin1_binBinary according to latin1 encoding
latin1_general_ciMultilingual (Western European)
latin1_general_csMultilingual (ISO Western European), case sensitive
latin1_spanish_ciModern Spanish

Collations have these general characteristics:

  • Two different character sets cannot have the same collation.

  • Each character set has one collation that is the default collation. For example, the default collation for latin1 is latin1_swedish_ci.

  • There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with _ci (case insensitive), _cs (case sensitive), or _bin (binary).

10.3. Determining the Default Character Set and Collation

There are default settings for character sets and collations at four levels: server, database, table, and connection. The following description may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.

10.3.1. Server Character Set and Collation

The MySQL Server has a server character set and a server collation, neither of which which may be null.

MySQL determines the server character set and server collation as follows:

  • According to the option settings in effect when the server starts

  • According to the values set at runtime

At the server level, the decision is simple. The server character set and collation depend initially on the options that you use when you start mysqld. You can use --default-character-set for the character set, and along with it you can add --default-collation for the collation. If you don't specify a character set, that is the same as saying --default-character-set=latin1. If you specify only a character set (for example, latin1) but not a collation, that is the same as saying --default-charset=latin1 --default-collation=latin1_swedish_ci because latin1_swedish_ci is the default collation for latin1. Therefore, the following three commands all have the same effect:

shell> mysqld
shell> mysqld --default-character-set=latin1
shell> mysqld --default-character-set=latin1 \
           --default-collation=latin1_swedish_ci

One way to change the settings is by recompiling. If you want to change the default server character set and collation when building from sources, use: --with-charset and --with-collation as arguments for configure. For example:

shell> ./configure --with-charset=latin1

Or:

shell> ./configure --with-charset=latin1 \
           --with-collation=latin1_german1_ci

Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.

The current server character set and collation are available as the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.

10.3.2. Database Character Set and Collation

Every database has a database character set and a database collation, which may not be null. The CREATE DATABASE and ALTER DATABASE statements have optional clauses for specifying the database character set and collation:

CREATE DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

ALTER DATABASE db_name
    [[DEFAULT] CHARACTER SET charset_name]
    [[DEFAULT] COLLATE collation_name]

Example:

CREATE DATABASE db_name
    DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;

MySQL chooses the database character set and database collation thus:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.

  • Otherwise, the server character set and server collation.

MySQL's CREATE DATABASE ... DEFAULT CHARACTER SET ... syntax is analogous to the standard SQL CREATE SCHEMA ... CHARACTER SET ... syntax. Because of this, it is possible to create databases with different character sets and collations on the same MySQL server.

The database character set and collation are used as default values if the table character set and collation are not specified in CREATE TABLE statements. They have no other purpose.

The character set and collation for the default database are available as the values of the character_set_database and collation_database system variables. The server sets these variables whenever the default database changes. If there is no default database, the variables have the same value as the corresponding server-level variables, character_set_server and collation_server.

10.3.3. Table Character Set and Collation

Every table has a table character set and a table collation, which may not be null. The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:

CREATE TABLE tbl_name (column_list)
    [DEFAULT CHARACTER SET charset_name [COLLATE collation_name]]

ALTER TABLE tbl_name
    [DEFAULT CHARACTER SET charset_name] [COLLATE collation_name]

Example:

CREATE TABLE t1 ( ... )
    DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

MySQL chooses the table character set and collation in the following manner:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation.

  • Otherwise, the database character set and collation.

The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.

10.3.4. Column Character Set and Collation

Every “character” column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation, which may not be null. Column definition syntax has optional clauses for specifying the column character set and collation:

col_name {CHAR | VARCHAR | TEXT} (col_length)
    [CHARACTER SET charset_name [COLLATE collation_name]]

Example:

CREATE TABLE Table1
(
    column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);

MySQL chooses the column character set and collation in the following manner:

  • If both CHARACTER SET X and COLLATE Y were specified, then character set X and collation Y are used.

  • If CHARACTER SET X was specified without COLLATE, then character set X and its default collation are used.

  • Otherwise, the table character set and collation are used.

The CHARACTER SET and COLLATE clauses are standard SQL.

10.3.5. Examples of Character Set and Collation Assignment

The following examples show how MySQL determines default character set and collation values.

Example 1: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;

Here we have a column with a latin1 character set and a latin1_german1_ci collation. The definition is explicit, so that's straightforward. Notice that there is no problem with storing a latin1 column in a latin2 table.

Example 2: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

This time we have a column with a latin1 character set and a default collation. Although it might seem natural, the default collation is not taken from the table level. Instead, because the default collation for latin1 is always latin1_swedish_ci, column c1 has a collation of latin1_swedish_ci (not latin1_danish_ci).

Example 3: Table and Column Definition

CREATE TABLE t1
(
    c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;

We have a column with a default character set and a default collation. In this circumstance, MySQL looks up to the table level for inspiration in determining the column character set and collation. So, the character set for column c1 is latin1 and its collation is latin1_danish_ci.

Example 4: Database, Table, and Column Definition

CREATE DATABASE d1
    DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
    c1 CHAR(10)
);

We create a column without specifying its character set and collation. We're also not specifying a character set and a collation at the table level. In this circumstance, MySQL looks up to the database level for inspiration. (The database's settings become the table's settings, and thereafter become the column's setting.) So, the character set for column c1 is latin2 and its collation is latin2_czech_ci.

10.3.6. Connection Character Sets and Collations

Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:

  • The server character set and collation are available as the values of the character_set_server and collation_server variables.

  • The character set and collation of the default database are available as the values of the character_set_database and collation_database variables.

Additional character set and collation variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation variables.

Consider what a “connection” is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:

  • What character set is the query in when it leaves the client?

    The server takes the character_set_client variable to be the character set in which queries are sent by the client.

  • What character set should the server translate a query to after receiving it?

    For this, character_set_connection and collation_connection are used by the server. It converts queries sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8). collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, it does not matter because columns have a higher collation precedence.

  • What character set should the server translate to before shipping result sets or error messages back to the client?

    The character_set_results variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, and result metadata such as column names.

You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip this section).

There are two statements that affect the connection character sets:

SET NAMES 'charset_name'
SET CHARACTER SET charset_name

SET NAMES indicates what is in the SQL statements that the client sends. Thus, SET NAMES 'cp1251' tells the server “future incoming messages from this client are in character set cp1251.” It also specifies the character set for results that the server sends back to the client. (For example, it indicates what character set column values are if you use a SELECT statement.)

A SET NAMES 'x' statement is equivalent to these three statements:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the default collation for x.

SET CHARACTER SET is similar but sets the connection character set and collation to be those of the default database. A SET CHARACTER SET x statement is equivalent to these three statements:

mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET collation_connection = @@collation_database;

When a client connects, it sends to the server the name of the character set that it wants to use. The server sets the character_set_client, character_set_results, and character_set_connection variables to that character set. (In effect, the server performs a SET NAMES operation using the character set.)

With the mysql client, it is not necessary to execute SET NAMES every time you start up if you want to use a character set different from the default. You can add the --default-character-set option setting to your mysql statement line, or in your option file. For example, the following option file setting changes the three character set variables set to koi8r each time you run mysql:

[mysql]
default-character-set=koi8r

Example: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do not say SET NAMES or SET CHARACTER SET, then for SELECT column1 FROM t, the server sends back all the values for column1 using the character set that the client specified when it connected. On the other hand, if you say SET NAMES 'latin1' or SET CHARACTER SET latin1, then just before sending results back, the server converts the latin2 values to latin1. Conversion may be lossy if there are characters that are not in both character sets.

If you do not want the server to perform any conversion, set character_set_results to NULL:

mysql> SET character_set_results = NULL;

10.3.7. Character String Literal Character Set and Collation

Every character string literal has a character set and a collation, which may not be null.

A character string literal may have an optional character set introducer and COLLATE clause:

[_charset_name]'string' [COLLATE collation_name]

Examples:

SELECT 'string';
SELECT _latin1'string';
SELECT _latin1'string' COLLATE latin1_danish_ci;

For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.

The _charset_name expression is formally called an introducer. It tells the parser, “the string that is about to follow uses character set X.” Because this has confused people in the past, we emphasize that an introducer does not cause any conversion; it is strictly a signal that does not change the string's value. An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), and before ? (parameter substitution when using prepared statements within a programming language interface).

Examples:

SELECT _latin1 x'AABBCC';
SELECT _latin1 0xAABBCC;
SELECT _latin1 ?;

MySQL determines a literal's character set and collation thus:

  • If both _X and COLLATE Y were specified, then character set X and collation Y are used.

  • If _X is specified but COLLATE is not specified, then character set X and its default collation are used.

  • Otherwise, the character set and collation given by the character_set_connection and collation_connection system variables are used.

Examples:

  • A string with latin1 character set and latin1_german1_ci collation:

    SELECT _latin1'Müller' COLLATE latin1_german1_ci;
    
  • A string with latin1 character set and its default collation (that is, latin1_swedish_ci):

    SELECT _latin1'Müller';
    
  • A string with the connection default character set and collation:

    SELECT 'Müller';
    

Character set introducers and the COLLATE clause are implemented according to standard SQL specifications.

10.3.8. Using COLLATE in SQL Statements

With the COLLATE clause, you can override whatever the default collation is for a comparison. COLLATE may be used in various parts of SQL statements. Here are some examples:

  • With ORDER BY:

    SELECT k
    FROM t1
    ORDER BY k COLLATE latin1_german2_ci;
    
  • With AS:

    SELECT k COLLATE latin1_german2_ci AS k1
    FROM t1
    ORDER BY k1;
    
  • With GROUP BY:

    SELECT k
    FROM t1
    GROUP BY k COLLATE latin1_german2_ci;
    
  • With aggregate functions:

    SELECT MAX(k COLLATE latin1_german2_ci)
    FROM t1;
    
  • With DISTINCT:

    SELECT DISTINCT k COLLATE latin1_german2_ci
    FROM t1;
    
  • With WHERE:

         SELECT *
         FROM t1
         WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
    
         SELECT *
         FROM t1
         WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
    
  • With HAVING:

    SELECT k
    FROM t1
    GROUP BY k
    HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
    

10.3.9. COLLATE Clause Precedence

The COLLATE clause has high precedence (higher than ||), so the following two expressions are equivalent:

x || y COLLATE z
x || (y COLLATE z)

10.3.10. BINARY Operator

The BINARY operator is a shorthand for a COLLATE clause. BINARY 'x' is equivalent to 'x' COLLATE y, where y is the name of the binary collation for the character set of 'x'. Every character set has a binary collation. For example, the binary collation for the latin1 character set is latin1_bin, so if the column a is of character set latin1, the following two statements have the same effect:

SELECT * FROM t1 ORDER BY BINARY a;
SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;

10.3.11. Some Special Cases Where the Collation Determination Is Tricky

In the great majority of queries, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is “the column collation of column x”:

SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

SELECT x FROM T WHERE x = 'Y';

Should this query use the collation of the column x, or of the string literal 'Y'?

Standard SQL resolves such questions using what used to be called “coercibility” rules. Basically, this means: Since both x and 'Y' have collations, whose collation takes precedence? This can be difficult to resolve, but the following rules take care of most situations:

  • An explicit COLLATE clause has a coercibility of 0. (Not coercible at all.)

  • The concatenation of two strings with different collations has a coercibility of 1.

  • A column's collation has a coercibility of 2.

  • A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.

  • A literal's collation has a coercibility of 4.

  • NULL or an expression that is derived from NULL has a coercibility of 5.

The preceding coercibility values are current as of MySQL 5.0.3. See the note later in this section for additional version-related information.

Those rules resolve ambiguities thus:

  • Use the collation with the lowest coercibility value.

  • If both sides have the same coercibility, then it is an error if the collations aren't the same.

Examples:

column1 = 'A'Use collation of column1
column1 = 'A' COLLATE xUse collation of 'A'
column1 COLLATE x = 'A' COLLATE yError

The COERCIBILITY() function can be used to determine the coercibility of a string expression:

mysql> SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
        -> 0
mysql> SELECT COERCIBILITY(VERSION());
        -> 3
mysql> SELECT COERCIBILITY('A');
        -> 4

See Section 12.9.3, “Information Functions”.

In MySQL 5.0 prior to release 5.0.3, there is no system constant or ignorable coercibility. Functions such as USER() have a coercibility of 2 rather than 3, and literals have a coercibility of 3 rather than 4.

10.3.12. Collations Must Be for the Right Character Set

Recall that each character set has one or more collations, and each collation is associated with one and only one character set. Therefore, the following statement causes an error message because the latin2_bin collation is not legal with the latin1 character set:

mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1251: COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'

10.3.13. An Example of the Effect of Collation

Suppose that column X in table T has these latin1 column values:

Muffler
Müller
MX Systems
MySQL

And suppose that the column values are retrieved using the following statement:

SELECT X FROM T ORDER BY X COLLATE collation_name;

The resulting order of the values for different collations is shown in this table:

latin1_swedish_cilatin1_german1_cilatin1_german2_ci
MufflerMufflerMüller
MX SystemsMüllerMuffler
MüllerMX SystemsMX Systems
MySQLMySQLMySQL

The table is an example that shows what the effect would be if we used different collations in an ORDER BY clause. The character that causes the different sort orders in this example is the U with two dots over it (ü), which the Germans call "U-umlaut".

  • The first column shows the result of the SELECT using the Swedish/Finnish collating rule, which says that U-umlaut sorts with Y.

  • The second column shows the result of the SELECT using the German DIN-1 rule, which says that U-umlaut sorts with U.

  • The third column shows the result of the SELECT using the German DIN-2 rule, which says that U-umlaut sorts with UE.

10.4. Operations Affected by Character Set Support

This section describes operations that take character set information into account in MySQL 5.0.

10.4.1. Result Strings

MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?

For simple functions that take string input and return a string result as output, the output's character set and collation are the same as those of the principal input value. For example, UPPER(X) returns a string whose character string and collation are the same as that of X. The same applies for INSTR(), LCASE(), LOWER(), LTRIM(), MID(), REPEAT(), REPLACE(), REVERSE(), RIGHT(), RPAD(), RTRIM(), SOUNDEX(), SUBSTRING(), TRIM(), UCASE(), and UPPER(). (Also note: The REPLACE() function, unlike all other functions, always ignores the collation of the string input and performs a case-insensitive comparison.)

For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply:

  • If an explicit COLLATE X occurs, then use X.

  • If explicit COLLATE X and COLLATE Y occur, then raise an error.

  • Otherwise, if all collations are X, then use X.

  • Otherwise, the result has no collation.

For example, with CASE ... WHEN a THEN b WHEN b THEN c COLLATE X END, the resulting collation is X. The same applies for CASE, UNION, ||, CONCAT(), ELT(), GREATEST(), IF(), and LEAST().

For operations that convert to character data, the character set and collation of the strings that result from the operations are defined by the character_set_connection and collation_connection system variables. This applies to CAST(), CHAR(), CONV(), FORMAT(), HEX(), and SPACE().

10.4.2. CONVERT()

CONVERT() provides a way to convert data between different character sets. The syntax is:

CONVERT(expr USING transcoding_name)

In MySQL, transcoding names are the same as the corresponding character set names.

Examples:

SELECT CONVERT(_latin1'Müller' USING utf8);
INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8) FROM latin1table;

CONVERT(... USING ...) is implemented according to the standard SQL specification.

In TRADITIONAL SQL mode, if you convert a “zero” date string to a date, CONVERT() returns NULL. MySQL 5.0.4 and above also produce a warning.

10.4.3. CAST()

You may also use CAST() to convert a string to a different character set. The syntax is:

CAST(character_string AS character_data_type CHARACTER SET charset_name)

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);

If you use CAST() without specifying CHARACTER SET, the resulting character set and collation are defined by the character_set_connection and collation_connection system variables. If you use CAST() with CHARACTER SET X, then the resulting character set and collation are X and the default collation of X.

You may not use a COLLATE clause inside a CAST(), but you may use it outside. That is, CAST(... COLLATE ...) is illegal, but CAST(...) COLLATE ... is legal.

Example:

SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;

In TRADITIONAL SQL mode, if you convert a “zero” date string to a date, CAST() returns NULL. MySQL 5.0.4 and above also produce a warning.

10.4.4. SHOW Statements

Several SHOW statements provide additional character set information. These include SHOW CHARACTER SET, SHOW COLLATION, SHOW CREATE DATABASE, SHOW CREATE TABLE and SHOW COLUMNS.

The SHOW CHARACTER SET command shows all available character sets. It takes an optional LIKE clause that indicates which character set names to match. For example:

mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description                 | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
| latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
| latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
| latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
+---------+-----------------------------+-------------------+--------+

See Section 13.5.4.1, “SHOW CHARACTER SET Syntax”.

The output from SHOW COLLATION includes all available character sets. It takes an optional LIKE clause that indicates which collation names to match. For example:

mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci  | latin1  | 15 |         |          |       0 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       0 |
| latin1_general_ci | latin1  | 48 |         |          |       0 |
| latin1_general_cs | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci | latin1  | 94 |         |          |       0 |
+-------------------+---------+----+---------+----------+---------+

See Section 13.5.4.2, “SHOW COLLATION Syntax”.

SHOW CREATE DATABASE displays the CREATE DATABASE statement that creates a given database. The result includes all database options. DEFAULT CHARACTER SET and COLLATE are supported. All database options are stored in a text file named db.opt that can be found in the database directory.

mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database                                                 |
+----------+-----------------------------------------------------------------+
| test     | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+

See Section 13.5.4.4, “SHOW CREATE DATABASE Syntax”.

SHOW CREATE TABLE is similar, but displays the CREATE TABLE statement to create a given table. The column definitions indicate any character set specifications, and the table options include character set information.

See Section 13.5.4.5, “SHOW CREATE TABLE Syntax”.

The SHOW COLUMNS statement displays the collations of a table's columns when invoked as SHOW FULL COLUMNS. Columns with CHAR, VARCHAR, or TEXT data types have non-NULL collations. Numeric and other non-character types have NULL collations. For example:

mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
     Field: id
      Type: smallint(5) unsigned
 Collation: NULL
      Null: NO
       Key: PRI
   Default: NULL
     Extra: auto_increment
Privileges: select,insert,update,references
   Comment:
*************************** 2. row ***************************
     Field: name
      Type: char(60)
 Collation: latin1_swedish_ci
      Null: NO
       Key:
   Default:
     Extra:
Privileges: select,insert,update,references
   Comment:

The character set is not part of the display. (The character set name is implied by the collation name.)

See Section 13.5.4.3, “SHOW COLUMNS Syntax”.

10.5. Unicode Support

In MySQL 5.0, there are two character sets for storing Unicode data:

  • ucs2, the UCS-2 Unicode character set.

  • utf8, the UTF8 encoding of the Unicode character set.

In UCS-2 (binary Unicode representation), every character is represented by a two-byte Unicode code with the most significant byte first. For example: "LATIN CAPITAL LETTER A" has the code 0x0041 and it's stored as a two-byte sequence: 0x00 0x41. "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.

Currently, UCS-2 cannot yet be used as a client character set, which means that SET NAMES 'ucs2' does not work.

The UTF8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC 3629. The idea of the UTF8 character set is that various Unicode characters are encoded using byte sequences of different lengths:

  • Basic Latin letters, digits, and punctuation signs use one byte.

  • Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.

  • Korean, Chinese, and Japanese ideographs use three-byte sequences.

RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL UTF8 support does not include four-byte sequences. (An older standard for UTF8 encoding is given by RFC 2279, which describes UTF8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)

Tip: To save space with UTF8, use VARCHAR instead of CHAR. Otherwise, MySQL has to reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column, because this is the maximum possible length.

10.6. UTF8 for Metadata

Metadata is “the data about the data”. Anything that describes the database — as opposed to being the contents of the database — is metadata. Thus column names, database names, usernames, version names, and most of the string results from SHOW are metadata. This is also true of the contents of tables in the INFORMATION_SCHEMA database, because those tables by definition contain information about database objects.

Representation of metadata must satisfy these requirements:

  • All metadata must be in the same character set. Otherwise, neither the SHOW commands nor SELECT queries against tables in the INFORMATION_SCHEMA database would work properly because different rows in the same column of the results of these operations would be in different character sets.

  • Metadata must include all characters in all languages. Otherwise, users wouldn't be able to name columns and tables in their own languages.

In order to satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF8. This does not cause any disruption if you never use accented characters. But if you do, you should be aware that metadata is in UTF8.

This means that the return values of the USER(), CURRENT_USER(), DATABASE(), and VERSION() functions have the UTF8 character set by default, as do synonyms such as SESSION_USER() and SYSTEM_USER().

The server sets the character_set_system system variable to the name of the metadata character set:

mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_system | utf8  |
+----------------------+-------+

Storage of metadata using Unicode does not mean that the headers of columns and the results of DESCRIBE functions are in the character_set_system character set by default. When you use SELECT column1 FROM t, the name column1 itself is returned from the server to the client in the character set as determined by the SET NAMES statement. More specifically, the character set used is determined by the value of the character_set_results system variable. If this variable is set to NULL, no conversion is performed and the server returns metadata using its original character set (the set indicated by character_set_system).

If you want the server to pass metadata results back in a non-UTF8 character set, then use SET NAMES to force the server to perform character set conversion (see Section 10.3.6, “Connection Character Sets and Collations”), or else cause the client to perform the conversion. It is more efficient to have the client perform the conversion, but this option is not always available for all clients.

If you are using (for example) the USER() function for comparison or assignment within a single statement, don't worry. MySQL performs some automatic conversion for you.

SELECT * FROM Table1 WHERE USER() = latin1_column;

This works because the contents of latin1_column are automatically converted to UTF8 before the comparison.

INSERT INTO Table1 (latin1_column) SELECT USER();

This works because the contents of USER() are automatically converted to latin1 before the assignment. Automatic conversion is not fully implemented yet, but should work correctly in a later version.

Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Since it is a well-known principle that “what applies to a superset can apply to a subset”, we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.

Note: In MySQL 5.0, the errmsg.txt files all use UTF8. Conversion to the client character set is automatic, as with metadata.

10.7. Compatibility with Other DBMSs

For MaxDB compatibility these two statements are the same:

CREATE TABLE t1 (f1 CHAR(n) UNICODE);
CREATE TABLE t1 (f1 CHAR(n) CHARACTER SET ucs2);

10.8. New Character Set Configuration File Format

In MySQL 5.0, character set configuration is stored in XML files, one file per character set.

10.9. National Character Set

ANSI SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL 5.0 uses utf8 as this predefined character set. For example, these column type declarations are equivalent:

CHAR(10) CHARACTER SET utf8
NATIONAL CHARACTER(10)
NCHAR(10)

As are these:

VARCHAR(10) CHARACTER SET utf8
NATIONAL VARCHAR(10)
NCHAR VARCHAR(10)
NATIONAL CHARACTER VARYING(10)
NATIONAL CHAR VARYING(10)

You can use N'literal' to create a string in the national character set. These two statements are equivalent:

SELECT N'some text';
SELECT _utf8'some text';

10.10. Character Sets and Collations That MySQL Supports

MySQL supports 70+ collations for 30+ character sets. The character sets and their default collations are displayed by the SHOW CHARACTER SET statement:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+

10.10.1. Unicode Character Sets

MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.

  • ucs2 (UCS-2 Unicode) collations:

    mysql> SHOW COLLATION LIKE 'ucs2%';
    +--------------------+---------+-----+---------+----------+---------+
    | Collation          | Charset | Id  | Default | Compiled | Sortlen |
    +--------------------+---------+-----+---------+----------+---------+
    | ucs2_general_ci    | ucs2    |  35 | Yes     | Yes      |       1 |
    | ucs2_bin           | ucs2    |  90 |         | Yes      |       1 |
    | ucs2_unicode_ci    | ucs2    | 128 |         | Yes      |       8 |
    | ucs2_icelandic_ci  | ucs2    | 129 |         | Yes      |       8 |
    | ucs2_latvian_ci    | ucs2    | 130 |         | Yes      |       8 |
    | ucs2_romanian_ci   | ucs2    | 131 |         | Yes      |       8 |
    | ucs2_slovenian_ci  | ucs2    | 132 |         | Yes      |       8 |
    | ucs2_polish_ci     | ucs2    | 133 |         | Yes      |       8 |
    | ucs2_estonian_ci   | ucs2    | 134 |         | Yes      |       8 |
    | ucs2_spanish_ci    | ucs2    | 135 |         | Yes      |       8 |
    | ucs2_swedish_ci    | ucs2    | 136 |         | Yes      |       8 |
    | ucs2_turkish_ci    | ucs2    | 137 |         | Yes      |       8 |
    | ucs2_czech_ci      | ucs2    | 138 |         | Yes      |       8 |
    | ucs2_danish_ci     | ucs2    | 139 |         | Yes      |       8 |
    | ucs2_lithuanian_ci | ucs2    | 140 |         | Yes      |       8 |
    | ucs2_slovak_ci     | ucs2    | 141 |         | Yes      |       8 |
    | ucs2_spanish2_ci   | ucs2    | 142 |         | Yes      |       8 |
    | ucs2_roman_ci      | ucs2    | 143 |         | Yes      |       8 |
    | ucs2_persian_ci    | ucs2    | 144 |         | Yes      |       8 |
    +--------------------+---------+-----+---------+----------+---------+
    
  • utf8 (UTF-8 Unicode) collations:

    mysql> SHOW COLLATION LIKE 'utf8%';
    +--------------------+---------+-----+---------+----------+---------+
    | Collation          | Charset | Id  | Default | Compiled | Sortlen |
    +--------------------+---------+-----+---------+----------+---------+
    | utf8_general_ci    | utf8    |  33 | Yes     | Yes      |       1 |
    | utf8_bin           | utf8    |  83 |         | Yes      |       1 |
    | utf8_unicode_ci    | utf8    | 192 |         | Yes      |       8 |
    | utf8_icelandic_ci  | utf8    | 193 |         | Yes      |       8 |
    | utf8_latvian_ci    | utf8    | 194 |         | Yes      |       8 |
    | utf8_romanian_ci   | utf8    | 195 |         | Yes      |       8 |
    | utf8_slovenian_ci  | utf8    | 196 |         | Yes      |       8 |
    | utf8_polish_ci     | utf8    | 197 |         | Yes      |       8 |
    | utf8_estonian_ci   | utf8    | 198 |         | Yes      |       8 |
    | utf8_spanish_ci    | utf8    | 199 |         | Yes      |       8 |
    | utf8_swedish_ci    | utf8    | 200 |         | Yes      |       8 |
    | utf8_turkish_ci    | utf8    | 201 |         | Yes      |       8 |
    | utf8_czech_ci      | utf8    | 202 |         | Yes      |       8 |
    | utf8_danish_ci     | utf8    | 203 |         | Yes      |       8 |
    | utf8_lithuanian_ci | utf8    | 204 |         | Yes      |       8 |
    | utf8_slovak_ci     | utf8    | 205 |         | Yes      |       8 |
    | utf8_spanish2_ci   | utf8    | 206 |         | Yes      |       8 |
    | utf8_roman_ci      | utf8    | 207 |         | Yes      |       8 |
    | utf8_persian_ci    | utf8    | 208 |         | Yes      |       8 |
    +--------------------+---------+-----+---------+----------+---------+
    

Note: Support for the ucs2_esperanto_ci and utf8_esperanto_ci collations was added in MySQL 5.0.13.

The utf8_unicode_ci collation is implemented according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. (The following discussion uses utf8_unicode_ci, but it is also true for ucs2_unicode_ci.)

Currently, the utf8_unicode_ci collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari.

The most significant feature in utf8_unicode_ci is that it supports expansions, that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘ß’ is equal to ‘ss’.

utf8_general_ci is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci).

For example, the following equalities hold in both utf8_general_ci and utf8_unicode_ci:

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

Language-specific collations for the utf8 character set are implemented only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci is also satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The utf8_spanish_ci and utf8_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’. In addition, for traditional Spanish, ‘ch’ is a separate letter between ‘c’ and d, and ‘ll’ is a separate letter between ‘l’ and ‘m

10.10.2. West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

  • ascii (US ASCII) collations:

    • ascii_bin

    • ascii_general_ci (default)

  • cp850 (DOS West European) collations:

    • cp850_bin

    • cp850_general_ci (default)

  • dec8 (DEC Western European) collations:

    • dec8_bin

    • dec8_swedish_ci (default)

  • hp8 (HP Western European) collations:

    • hp8_bin

    • hp8_english_ci (default)

  • latin1 (cp1252 West European) collations:

    • latin1_bin

    • latin1_danish_ci

    • latin1_general_ci

    • latin1_general_cs

    • latin1_german1_ci

    • latin1_german2_ci

    • latin1_spanish_ci

    • latin1_swedish_ci (default)

    latin1 is the default character set. The latin1_swedish_ci collation is the default that probably is used by the majority of MySQL customers. While it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.

    The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation”.

    • latin1_german1_ci (dictionary) rules:

      Ä = A
      Ö = O
      Ü = U
      ß = s
      
    • latin1_german2_ci (phone-book) rules:

      Ä = AE
      Ö = OE
      Ü = UE
      ß = ss
      

    In the latin1_spanish_ci collation, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’.

  • macroman (Mac West European) collations:

    • macroman_bin

    • macroman_general_ci (default)

  • swe7 (7bit Swedish) collations:

    • swe7_bin

    • swe7_swedish_ci (default)

10.10.3. Central European Character Sets

We also provide some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.

  • cp1250 (Windows Central European) collations:

    • cp1250_bin

    • cp1250_croatian_ci

    • cp1250_czech_cs

    • cp1250_general_ci (default)

  • cp852 (DOS Central European) collations:

    • cp852_bin

    • cp852_general_ci (default)

  • keybcs2 (DOS Kamenicky Czech-Slovak) collations:

    • keybcs2_bin

    • keybcs2_general_ci (default)

  • latin2 (ISO 8859-2 Central European) collations:

    • latin2_bin

    • latin2_croatian_ci

    • latin2_czech_cs

    • latin2_general_ci (default)

    • latin2_hungarian_ci

  • macce (Mac Central European) collations:

    • macce_bin

    • macce_general_ci (default)

10.10.4. South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish:

  • armscii8 (ARMSCII-8 Armenian) collations:

    • armscii8_bin

    • armscii8_general_ci (default)

  • cp1256 (Windows Arabic) collations:

    • cp1256_bin

    • cp1256_general_ci (default)

  • geostd8 (GEOSTD8 Georgian) collations:

    • geostd8_bin

    • geostd8_general_ci (default)

  • greek (ISO 8859-7 Greek) collations:

    • greek_bin

    • greek_general_ci (default)

  • hebrew (ISO 8859-8 Hebrew) collations:

    • hebrew_bin

    • hebrew_general_ci (default)

  • latin5 (ISO 8859-9 Turkish) collations:

    • latin5_bin

    • latin5_turkish_ci (default)

10.10.5. Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages. There are two Baltic character sets currently supported:

  • cp1257 (Windows Baltic) collations:

    • cp1257_bin

    • cp1257_general_ci (default)

    • cp1257_lithuanian_ci

  • latin7 (ISO 8859-13 Baltic) collations:

    • latin7_bin

    • latin7_estonian_cs

    • latin7_general_ci (default)

    • latin7_general_cs

10.10.6. Cyrillic Character Sets

Here are the Cyrillic character sets and collations for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.

  • cp1251 (Windows Cyrillic) collations:

    • cp1251_bin

    • cp1251_bulgarian_ci

    • cp1251_general_ci (default)

    • cp1251_general_cs

    • cp1251_ukrainian_ci

  • cp866 (DOS Russian) collations:

    • cp866_bin

    • cp866_general_ci (default)

  • koi8r (KOI8-R Relcom Russian) collations:

    • koi8r_bin

    • koi8r_general_ci (default)

  • koi8u (KOI8-U Ukrainian) collations:

    • koi8u_bin

    • koi8u_general_ci (default)

10.10.7. Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters.

  • big5 (Big5 Traditional Chinese) collations:

    • big5_bin

    • big5_chinese_ci (default)

  • cp932 (SJIS for Windows Japanese) collations:

    • cp932_bin

    • cp932_japanese_ci (default)

  • eucjpms (UJIS for Windows Japanese) collations:

    • eucjpms_bin

    • eucjpms_japanese_ci (default)

  • euckr (EUC-KR Korean) collations:

    • euckr_bin

    • euckr_korean_ci (default)

  • gb2312 (GB2312 Simplified Chinese) collations:

    • gb2312_bin

    • gb2312_chinese_ci (default)

  • gbk (GBK Simplified Chinese) collations:

    • gbk_bin

    • gbk_chinese_ci (default)

  • sjis (Shift-JIS Japanese) collations:

    • sjis_bin

    • sjis_japanese_ci (default)

  • tis620 (TIS620 Thai) collations:

    • tis620_bin

    • tis620_thai_ci (default)

  • ujis (EUC-JP Japanese) collations:

    • ujis_bin

    • ujis_japanese_ci (default)

10.10.7.1. The cp932 Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various venders.

For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted via Unicode (ucs2).

  • The sjis character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems. It is available as of MySQL 5.0.3.

Because MySQL 5.0 supports character set conversion, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

  • cp932 supports NEC special characters, NEC selected — IBM extended characters, and IBM selected characters.

  • Some cp932 characters have two different code points, both of which convert to the same Unicode code point. So, when converting from Unicode back to cp932, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)

    The conversion rule works like this:

    • If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.

    • If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.

    • If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.

    Information about the Unicode values of cp932 characters is given in the table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm. For cp932 table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2) encoding. For table entries with an underlined two-digit value appears, there is a range of cp932 character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the cp932 characters that begin with those digits.

    The following links are of special interest. They correspond to the encodings for the following sets of characters:

  • Starting from version 5.0.3, cp932 supports conversion of user-defined characters in combination with eucjpms, and solves the problems with sjis/ujis conversion. For details, please refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.

  • For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

    Conversion to ucs2:

    sjis/cp932 Valuesjisucs2 Conversioncp932ucs2 Conversion
    5C005C005C
    7E007E007E
    815C20152015
    815F005CFF3C
    8160301CFF5E
    816120162225
    817C2212FF0D
    819100A2FFE0
    819200A3FFE1
    81CA00ACFFE2

    Conversion from ucs2:

    ucs2 valueucs2sjis Conversionucs2cp932 Conversion
    005C815F5C
    007E7E7E
    00A281913F
    00A381923F
    00AC81CA3F
    2015815C815C
    201681613F
    2212817C3F
    22253F8161
    301C81603F
    FF0D3F817C
    FF3C3F815F
    FF5E3F8160
    FFE03F8191
    FFE13F8192
    FFE23F81CA