Table of Contents
COLLATE
in SQL StatementsCOLLATE
Clause PrecedenceBINARY
OperatorThis chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system
Character set syntax
Affected functions and operations
Unicode support
The meaning of each individual character set and collation
Character set support in MySQL 5.0 is included in the
MyISAM
, MEMORY
, and
InnoDB
storage engines.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
‘A
’,
‘B
’,
‘a
’,
‘b
’. We give each letter a number:
‘A
’ = 0,
‘B
’ = 1,
‘a
’ = 2,
‘b
’ = 3. The letter
‘A
’ is a symbol, the number 0 is
the encoding for
‘A
’, and the combination of all
four letters and their encodings is a
character set.
Suppose that we want to compare two string values,
‘A
’ and
‘B
’. The simplest way to do this is
to look at the encodings: 0 for ‘A
’
and 1 for ‘B
’. Because 0 is less
than 1, we say ‘A
’ is less than
‘B
’. What we've just done is apply
a collation to our character set. The collation is a set of rules
(only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules: (1)
treat the lowercase letters ‘a
’ and
‘b
’ as equivalent to
‘A
’ and
‘B
’; (2) then compare the
encodings. We call this a case-insensitive
collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just
‘A
’ and
‘B
’ but whole alphabets, sometimes
multiple alphabets or eastern writing systems with thousands of
characters, along with many special symbols and punctuation marks.
Also in real life, most collations have many rules: not just case
insensitivity but also accent insensitivity (an
“accent” is a mark attached to a character as in
German ‘Ö
’) and multiple-character
mappings (such as the rule that
‘Ö
’ =
‘OE
’ in one of the two German
collations).
MySQL 5.0 can do these things for you:
Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in the same server, the same database, or even the same table
Allow specification of character set and collation at any level
In these respects, not only is MySQL 5.0 far more flexible than earlier versions of MySQL, it also is far ahead of other DBMSs. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
The MySQL server can support multiple character sets. To list the
available character sets, use the SHOW CHARACTER
SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
(For a complete listing, see Section 10.10, “Character Sets and Collations That MySQL Supports”.)
Any given character set always has at least one collation. It may have several collations.
To list the collations for a character set, use the SHOW
COLLATION
statement. For example, to see the collations
for the latin1
(“ISO-8859-1 West
European”) character set, use this statement to find those
collation names that begin with latin1
:
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1
collations have the following
meanings:
Collation | Meaning |
latin1_german1_ci | German DIN-1 |
latin1_swedish_ci | Swedish/Finnish |
latin1_danish_ci | Danish/Norwegian |
latin1_german2_ci | German DIN-2 |
latin1_bin | Binary according to latin1 encoding |
latin1_general_ci | Multilingual (Western European) |
latin1_general_cs | Multilingual (ISO Western European), case sensitive |
latin1_spanish_ci | Modern Spanish |
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the
default collation. For example, the
default collation for latin1
is
latin1_swedish_ci
.
There is a convention for collation names: They start with the
name of the character set with which they are associated, they
usually include a language name, and they end with
_ci
(case insensitive),
_cs
(case sensitive), or
_bin
(binary).
COLLATE
in SQL StatementsCOLLATE
Clause PrecedenceBINARY
OperatorThere are default settings for character sets and collations at four levels: server, database, table, and connection. The following description may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
The MySQL Server has a server character set and a server collation, neither of which which may be null.
MySQL determines the server character set and server collation as follows:
According to the option settings in effect when the server starts
According to the values set at runtime
At the server level, the decision is simple. The server
character set and collation depend initially on the options that
you use when you start mysqld. You can use
--default-character-set
for the character set,
and along with it you can add
--default-collation
for the collation. If you
don't specify a character set, that is the same as saying
--default-character-set=latin1
. If you specify
only a character set (for example, latin1
)
but not a collation, that is the same as saying
--default-charset=latin1
--default-collation=latin1_swedish_ci
because
latin1_swedish_ci
is the default collation
for latin1
. Therefore, the following three
commands all have the same effect:
shell>mysqld
shell>mysqld --default-character-set=latin1
shell>mysqld --default-character-set=latin1 \
--default-collation=latin1_swedish_ci
One way to change the settings is by recompiling. If you want to
change the default server character set and collation when
building from sources, use: --with-charset
and
--with-collation
as arguments for
configure. For example:
shell> ./configure --with-charset=latin1
Or:
shell>./configure --with-charset=latin1 \
--with-collation=latin1_german1_ci
Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.
The current server character set and collation are available as
the values of the character_set_server
and
collation_server
system variables. These
variables can be changed at runtime.
Every database has a database character set and a database
collation, which may not be null. The CREATE
DATABASE
and ALTER DATABASE
statements have optional clauses for specifying the database
character set and collation:
CREATE DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
] ALTER DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
]
Example:
CREATE DATABASE db_name
DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation thus:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation.
Otherwise, the server character set and server collation.
MySQL's CREATE DATABASE ... DEFAULT CHARACTER SET
...
syntax is analogous to the standard SQL
CREATE SCHEMA ... CHARACTER SET ...
syntax.
Because of this, it is possible to create databases with
different character sets and collations on the same MySQL
server.
The database character set and collation are used as default
values if the table character set and collation are not
specified in CREATE TABLE
statements. They
have no other purpose.
The character set and collation for the default database are
available as the values of the
character_set_database
and
collation_database
system variables. The
server sets these variables whenever the default database
changes. If there is no default database, the variables have the
same value as the corresponding server-level variables,
character_set_server
and
collation_server
.
Every table has a table character set and a table collation,
which may not be null. The CREATE TABLE
and
ALTER TABLE
statements have optional clauses
for specifying the table character set and collation:
CREATE TABLEtbl_name
(column_list
) [DEFAULT CHARACTER SETcharset_name
[COLLATEcollation_name
]] ALTER TABLEtbl_name
[DEFAULT CHARACTER SETcharset_name
] [COLLATEcollation_name
]
Example:
CREATE TABLE t1 ( ... ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the following manner:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation.
Otherwise, the database character set and collation.
The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
Every “character” column (that is, a column of type
CHAR
, VARCHAR
, or
TEXT
) has a column character set and a column
collation, which may not be null. Column definition syntax has
optional clauses for specifying the column character set and
collation:
col_name
{CHAR | VARCHAR | TEXT} (col_length
) [CHARACTER SETcharset_name
[COLLATEcollation_name
]]
Example:
CREATE TABLE Table1 ( column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci );
MySQL chooses the column character set and collation in the following manner:
If both CHARACTER SET
and X
COLLATE
were specified, then
character set Y
X
and collation
Y
are used.
If CHARACTER SET
was specified without
X
COLLATE
, then character set
X
and its default collation are
used.
Otherwise, the table character set and collation are used.
The CHARACTER SET
and
COLLATE
clauses are standard SQL.
The following examples show how MySQL determines default character set and collation values.
Example 1: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character
set and a latin1_german1_ci
collation. The
definition is explicit, so that's straightforward. Notice that
there is no problem with storing a latin1
column in a latin2
table.
Example 2: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character set and a default collation. Although it might seem
natural, the default collation is not taken from the table
level. Instead, because the default collation for
latin1
is always
latin1_swedish_ci
, column
c1
has a collation of
latin1_swedish_ci
(not
latin1_danish_ci
).
Example 3: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and a default
collation. In this circumstance, MySQL looks up to the table
level for inspiration in determining the column character set
and collation. So, the character set for column
c1
is latin1
and its
collation is latin1_danish_ci
.
Example 4: Database, Table, and Column Definition
CREATE DATABASE d1 DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci; USE d1; CREATE TABLE t1 ( c1 CHAR(10) );
We create a column without specifying its character set and
collation. We're also not specifying a character set and a
collation at the table level. In this circumstance, MySQL looks
up to the database level for inspiration. (The database's
settings become the table's settings, and thereafter become the
column's setting.) So, the character set for column
c1
is latin2
and its
collation is latin2_czech_ci
.
Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:
The server character set and collation are available as the
values of the character_set_server
and
collation_server
variables.
The character set and collation of the default database are
available as the values of the
character_set_database
and
collation_database
variables.
Additional character set and collation variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation variables.
Consider what a “connection” is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
What character set is the query in when it leaves the client?
The server takes the character_set_client
variable to be the character set in which queries are sent
by the client.
What character set should the server translate a query to after receiving it?
For this, character_set_connection
and
collation_connection
are used by the
server. It converts queries sent by the client from
character_set_client
to
character_set_connection
(except for
string literals that have an introducer such as
_latin1
or _utf8
).
collation_connection
is important for
comparisons of literal strings. For comparisons of strings
with column values, it does not matter because columns have
a higher collation precedence.
What character set should the server translate to before shipping result sets or error messages back to the client?
The character_set_results
variable
indicates the character set in which the server returns
query results to the client. This includes result data such
as column values, and result metadata such as column names.
You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip this section).
There are two statements that affect the connection character sets:
SET NAMES 'charset_name
' SET CHARACTER SETcharset_name
SET NAMES
indicates what is in the SQL
statements that the client sends. Thus, SET NAMES
'cp1251'
tells the server “future incoming
messages from this client are in character set
cp1251
.” It also specifies the
character set for results that the server sends back to the
client. (For example, it indicates what character set column
values are if you use a SELECT
statement.)
A SET NAMES '
statement is equivalent to these three statements:
x
'
mysql>SET character_set_client =
mysql>x
;SET character_set_results =
mysql>x
;SET character_set_connection =
x
;
Setting character_set_connection
to
x
also sets
collation_connection
to the default collation
for x
.
SET CHARACTER SET
is similar but sets the
connection character set and collation to be those of the
default database. A SET CHARACTER SET x
statement is equivalent to these three statements:
mysql>SET character_set_client =
mysql>x
;SET character_set_results =
mysql>x
;SET collation_connection = @@collation_database;
When a client connects, it sends to the server the name of the
character set that it wants to use. The server sets the
character_set_client
,
character_set_results
, and
character_set_connection
variables to that
character set. (In effect, the server performs a SET
NAMES
operation using the character set.)
With the mysql client, it is not necessary to
execute SET NAMES
every time you start up if
you want to use a character set different from the default. You
can add the --default-character-set
option
setting to your mysql statement line, or in
your option file. For example, the following option file setting
changes the three character set variables set to
koi8r
each time you run
mysql:
[mysql] default-character-set=koi8r
Example: Suppose that column1
is defined as
CHAR(5) CHARACTER SET latin2
. If you do not
say SET NAMES
or SET CHARACTER
SET
, then for SELECT column1 FROM
t
, the server sends back all the values for
column1
using the character set that the
client specified when it connected. On the other hand, if you
say SET NAMES 'latin1'
or SET
CHARACTER SET latin1
, then just before sending results
back, the server converts the latin2
values
to latin1
. Conversion may be lossy if there
are characters that are not in both character sets.
If you do not want the server to perform any conversion, set
character_set_results
to
NULL
:
mysql> SET character_set_results = NULL;
Every character string literal has a character set and a collation, which may not be null.
A character string literal may have an optional character set
introducer and COLLATE
clause:
[_charset_name
]'string
' [COLLATEcollation_name
]
Examples:
SELECT 'string
'; SELECT _latin1'string
'; SELECT _latin1'string
' COLLATE latin1_danish_ci;
For the simple statement SELECT
'
, the string has
the character set and collation defined by the
string
'character_set_connection
and
collation_connection
system variables.
The _
expression is formally called an
introducer. It tells the parser, “the
string that is about to follow uses character set
charset_name
X
.” Because this has confused
people in the past, we emphasize that an introducer does not
cause any conversion; it is strictly a signal that does not
change the string's value. An introducer is also legal before
standard hex literal and numeric hex literal notation
(x'
and
literal
'0x
), and
before nnnn
?
(parameter substitution when using
prepared statements within a programming language interface).
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC; SELECT _latin1 ?;
MySQL determines a literal's character set and collation thus:
If both _X
and COLLATE
were specified, then
character set Y
X
and collation
Y
are used.
If _X
is specified but
COLLATE
is not specified, then character
set X
and its default collation
are used.
Otherwise, the character set and collation given by the
character_set_connection
and
collation_connection
system variables are
used.
Examples:
A string with latin1
character set and
latin1_german1_ci
collation:
SELECT _latin1'Müller' COLLATE latin1_german1_ci;
A string with latin1
character set and
its default collation (that is,
latin1_swedish_ci
):
SELECT _latin1'Müller';
A string with the connection default character set and collation:
SELECT 'Müller';
Character set introducers and the COLLATE
clause are implemented according to standard SQL specifications.
With the COLLATE
clause, you can override
whatever the default collation is for a comparison.
COLLATE
may be used in various parts of SQL
statements. Here are some examples:
With ORDER BY
:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
With AS
:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
With GROUP BY
:
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
With DISTINCT
:
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
With WHERE
:
SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
With HAVING
:
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
The COLLATE
clause has high precedence
(higher than ||
), so the following two
expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
The BINARY
operator is a shorthand for a
COLLATE
clause. BINARY
'
is equivalent to
x
''
, where
x
' COLLATE
y
y
is the name of the binary collation
for the character set of 'x
'. Every
character set has a binary collation. For example, the binary
collation for the latin1
character set is
latin1_bin
, so if the column
a
is of character set
latin1
, the following two statements have the
same effect:
SELECT * FROM t1 ORDER BY BINARY a; SELECT * FROM t1 ORDER BY a COLLATE latin1_bin;
In the great majority of queries, it is obvious what collation
MySQL uses to resolve a comparison operation. For example, in
the following cases, it should be clear that the collation is
“the column collation of column
x
”:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column
x
, or of the string literal
'Y'
?
Standard SQL resolves such questions using what used to be
called “coercibility” rules. Basically, this means:
Since both x
and 'Y'
have
collations, whose collation takes precedence? This can be
difficult to resolve, but the following rules take care of most
situations:
An explicit COLLATE
clause has a
coercibility of 0. (Not coercible at all.)
The concatenation of two strings with different collations has a coercibility of 1.
A column's collation has a coercibility of 2.
A “system constant” (the string returned by
functions such as USER()
or
VERSION()
) has a coercibility of 3.
A literal's collation has a coercibility of 4.
NULL
or an expression that is derived
from NULL
has a coercibility of 5.
The preceding coercibility values are current as of MySQL 5.0.3. See the note later in this section for additional version-related information.
Those rules resolve ambiguities thus:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then it is an error if the collations aren't the same.
Examples:
column1 = 'A' | Use collation of column1 |
column1 = 'A' COLLATE x | Use collation of 'A' |
column1 COLLATE x = 'A' COLLATE y | Error |
The COERCIBILITY()
function can be used to
determine the coercibility of a string expression:
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0 mysql>SELECT COERCIBILITY(VERSION());
-> 3 mysql>SELECT COERCIBILITY('A');
-> 4
See Section 12.9.3, “Information Functions”.
In MySQL 5.0 prior to release 5.0.3, there is no
system constant or ignorable coercibility. Functions such as
USER()
have a coercibility of 2 rather than
3, and literals have a coercibility of 3 rather than 4.
Recall that each character set has one or more collations, and
each collation is associated with one and only one character
set. Therefore, the following statement causes an error message
because the latin2_bin
collation is not legal
with the latin1
character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1251: COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
Suppose that column X
in table
T
has these latin1
column
values:
Muffler Müller MX Systems MySQL
And suppose that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name
;
The resulting order of the values for different collations is shown in this table:
latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci |
Muffler | Muffler | Müller |
MX Systems | Müller | Muffler |
Müller | MX Systems | MX Systems |
MySQL | MySQL | MySQL |
The table is an example that shows what the effect would be if
we used different collations in an ORDER BY
clause. The character that causes the different sort orders in
this example is the U with two dots over it
(ü
), which the Germans call "U-umlaut".
The first column shows the result of the
SELECT
using the Swedish/Finnish
collating rule, which says that U-umlaut sorts with Y.
The second column shows the result of the
SELECT
using the German DIN-1 rule, which
says that U-umlaut sorts with U.
The third column shows the result of the
SELECT
using the German DIN-2 rule, which
says that U-umlaut sorts with UE.
This section describes operations that take character set information into account in MySQL 5.0.
MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?
For simple functions that take string input and return a string
result as output, the output's character set and collation are
the same as those of the principal input value. For example,
UPPER(
returns a
string whose character string and collation are the same as that
of X
)X
. The same applies for
INSTR()
, LCASE()
,
LOWER()
, LTRIM()
,
MID()
, REPEAT()
,
REPLACE()
, REVERSE()
,
RIGHT()
, RPAD()
,
RTRIM()
, SOUNDEX()
,
SUBSTRING()
, TRIM()
,
UCASE()
, and UPPER()
.
(Also note: The REPLACE()
function, unlike
all other functions, always ignores the collation of the string
input and performs a case-insensitive comparison.)
For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply:
If an explicit COLLATE
occurs, then use
X
X
.
If explicit COLLATE
and X
COLLATE
occur, then raise an
error.
Y
Otherwise, if all collations are
X
, then use
X
.
Otherwise, the result has no collation.
For example, with CASE ... WHEN a THEN b WHEN b THEN c
COLLATE
, the
resulting collation is X
ENDX
. The same
applies for CASE
, UNION
,
||
, CONCAT()
,
ELT()
, GREATEST()
,
IF()
, and LEAST()
.
For operations that convert to character data, the character set
and collation of the strings that result from the operations are
defined by the character_set_connection
and
collation_connection
system variables. This
applies to CAST()
, CHAR()
,
CONV()
, FORMAT()
,
HEX()
, and SPACE()
.
CONVERT()
provides a way to convert data
between different character sets. The syntax is:
CONVERT(expr
USINGtranscoding_name
)
In MySQL, transcoding names are the same as the corresponding character set names.
Examples:
SELECT CONVERT(_latin1'Müller' USING utf8); INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;
CONVERT(... USING ...)
is implemented
according to the standard SQL specification.
In TRADITIONAL
SQL mode, if you convert a
“zero” date string to a date,
CONVERT()
returns NULL
.
MySQL 5.0.4 and above also produce a warning.
You may also use CAST()
to convert a string
to a different character set. The syntax is:
CAST(character_string
AScharacter_data_type
CHARACTER SETcharset_name
)
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
If you use CAST()
without specifying
CHARACTER SET
, the resulting character set
and collation are defined by the
character_set_connection
and
collation_connection
system variables. If you
use CAST()
with CHARACTER SET
X
, then the resulting character set and collation are
X
and the default collation of
X
.
You may not use a COLLATE
clause inside a
CAST()
, but you may use it outside. That is,
CAST(... COLLATE ...)
is illegal, but
CAST(...) COLLATE ...
is legal.
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
In TRADITIONAL
SQL mode, if you convert a
“zero” date string to a date,
CAST()
returns NULL
. MySQL
5.0.4 and above also produce a warning.
Several SHOW
statements provide additional
character set information. These include SHOW CHARACTER
SET
, SHOW COLLATION
, SHOW
CREATE DATABASE
, SHOW CREATE TABLE
and SHOW COLUMNS
.
The SHOW CHARACTER SET
command shows all
available character sets. It takes an optional
LIKE
clause that indicates which character
set names to match. For example:
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+---------+-----------------------------+-------------------+--------+
See Section 13.5.4.1, “SHOW CHARACTER SET
Syntax”.
The output from SHOW COLLATION
includes all
available character sets. It takes an optional
LIKE
clause that indicates which collation
names to match. For example:
mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 0 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+-------------------+---------+----+---------+----------+---------+
See Section 13.5.4.2, “SHOW COLLATION
Syntax”.
SHOW CREATE DATABASE
displays the
CREATE DATABASE
statement that creates a
given database. The result includes all database options.
DEFAULT CHARACTER SET
and
COLLATE
are supported. All database options
are stored in a text file named db.opt
that
can be found in the database directory.
mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
See Section 13.5.4.4, “SHOW CREATE DATABASE
Syntax”.
SHOW CREATE TABLE
is similar, but displays
the CREATE TABLE
statement to create a given
table. The column definitions indicate any character set
specifications, and the table options include character set
information.
See Section 13.5.4.5, “SHOW CREATE TABLE
Syntax”.
The SHOW COLUMNS
statement displays the
collations of a table's columns when invoked as SHOW
FULL COLUMNS
. Columns with CHAR
,
VARCHAR
, or TEXT
data
types have non-NULL
collations. Numeric and
other non-character types have NULL
collations. For example:
mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
Field: id
Type: smallint(5) unsigned
Collation: NULL
Null: NO
Key: PRI
Default: NULL
Extra: auto_increment
Privileges: select,insert,update,references
Comment:
*************************** 2. row ***************************
Field: name
Type: char(60)
Collation: latin1_swedish_ci
Null: NO
Key:
Default:
Extra:
Privileges: select,insert,update,references
Comment:
The character set is not part of the display. (The character set name is implied by the collation name.)
In MySQL 5.0, there are two character sets for storing Unicode data:
ucs2
, the UCS-2 Unicode character set.
utf8
, the UTF8 encoding of the Unicode
character set.
In UCS-2 (binary Unicode representation), every character is represented by a two-byte Unicode code with the most significant byte first. For example: "LATIN CAPITAL LETTER A" has the code 0x0041 and it's stored as a two-byte sequence: 0x00 0x41. "CYRILLIC SMALL LETTER YERU" (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.
Currently, UCS-2 cannot yet be used as a client character set,
which means that SET NAMES 'ucs2'
does not
work.
The UTF8 character set (transform Unicode representation) is an alternative way to store Unicode data. It is implemented according to RFC 3629. The idea of the UTF8 character set is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte sequences.
RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL UTF8 support does not include four-byte sequences. (An older standard for UTF8 encoding is given by RFC 2279, which describes UTF8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
Tip: To save space with UTF8, use
VARCHAR
instead of CHAR
.
Otherwise, MySQL has to reserve 30 bytes for a CHAR(10)
CHARACTER SET utf8
column, because this is the maximum
possible length.
Metadata is “the data about the
data”. Anything that describes the
database — as opposed to being the
contents of the database — is metadata.
Thus column names, database names, usernames, version names, and
most of the string results from SHOW
are
metadata. This is also true of the contents of tables in the
INFORMATION_SCHEMA
database, because those
tables by definition contain information about database objects.
Representation of metadata must satisfy these requirements:
All metadata must be in the same character set. Otherwise,
neither the SHOW
commands nor
SELECT
queries against tables in the
INFORMATION_SCHEMA
database would work
properly because different rows in the same column of the
results of these operations would be in different character
sets.
Metadata must include all characters in all languages. Otherwise, users wouldn't be able to name columns and tables in their own languages.
In order to satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF8. This does not cause any disruption if you never use accented characters. But if you do, you should be aware that metadata is in UTF8.
This means that the return values of the
USER()
, CURRENT_USER()
,
DATABASE()
, and VERSION()
functions have the UTF8 character set by default, as do synonyms
such as SESSION_USER()
and
SYSTEM_USER()
.
The server sets the character_set_system
system
variable to the name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_system | utf8 |
+----------------------+-------+
Storage of metadata using Unicode does not
mean that the headers of columns and the results of
DESCRIBE
functions are in the
character_set_system
character set by default.
When you use SELECT column1 FROM t
, the name
column1
itself is returned from the server to
the client in the character set as determined by the SET
NAMES
statement. More specifically, the character set
used is determined by the value of the
character_set_results
system variable. If this
variable is set to NULL
, no conversion is
performed and the server returns metadata using its original
character set (the set indicated by
character_set_system
).
If you want the server to pass metadata results back in a non-UTF8
character set, then use SET NAMES
to force the
server to perform character set conversion (see
Section 10.3.6, “Connection Character Sets and Collations”), or else cause the client to
perform the conversion. It is more efficient to have the client
perform the conversion, but this option is not always available
for all clients.
If you are using (for example) the USER()
function for comparison or assignment within a single statement,
don't worry. MySQL performs some automatic conversion for you.
SELECT * FROM Table1 WHERE USER() = latin1_column;
This works because the contents of
latin1_column
are automatically converted to
UTF8 before the comparison.
INSERT INTO Table1 (latin1_column) SELECT USER();
This works because the contents of USER()
are
automatically converted to latin1
before the
assignment. Automatic conversion is not fully implemented yet, but
should work correctly in a later version.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Since it is a well-known principle that “what applies to a superset can apply to a subset”, we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
Note: In MySQL 5.0,
the errmsg.txt
files all use UTF8. Conversion
to the client character set is automatic, as with metadata.
For MaxDB compatibility these two statements are the same:
CREATE TABLE t1 (f1 CHAR(n
) UNICODE); CREATE TABLE t1 (f1 CHAR(n
) CHARACTER SET ucs2);
In MySQL 5.0, character set configuration is stored in XML files, one file per character set.
ANSI SQL defines NCHAR
or NATIONAL
CHAR
as a way to indicate that a CHAR
column should use some predefined character set. MySQL
5.0 uses utf8
as this predefined
character set. For example, these column type declarations are
equivalent:
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
You can use
N'
to create
a string in the national character set. These two statements are
equivalent:
literal
'
SELECT N'some text'; SELECT _utf8'some text';
MySQL supports 70+ collations for 30+ character sets. The
character sets and their default collations are displayed by the
SHOW CHARACTER SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Description | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
ucs2
(UCS-2 Unicode) collations:
mysql> SHOW COLLATION LIKE 'ucs2%';
+--------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| ucs2_general_ci | ucs2 | 35 | Yes | Yes | 1 |
| ucs2_bin | ucs2 | 90 | | Yes | 1 |
| ucs2_unicode_ci | ucs2 | 128 | | Yes | 8 |
| ucs2_icelandic_ci | ucs2 | 129 | | Yes | 8 |
| ucs2_latvian_ci | ucs2 | 130 | | Yes | 8 |
| ucs2_romanian_ci | ucs2 | 131 | | Yes | 8 |
| ucs2_slovenian_ci | ucs2 | 132 | | Yes | 8 |
| ucs2_polish_ci | ucs2 | 133 | | Yes | 8 |
| ucs2_estonian_ci | ucs2 | 134 | | Yes | 8 |
| ucs2_spanish_ci | ucs2 | 135 | | Yes | 8 |
| ucs2_swedish_ci | ucs2 | 136 | | Yes | 8 |
| ucs2_turkish_ci | ucs2 | 137 | | Yes | 8 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| ucs2_danish_ci | ucs2 | 139 | | Yes | 8 |
| ucs2_lithuanian_ci | ucs2 | 140 | | Yes | 8 |
| ucs2_slovak_ci | ucs2 | 141 | | Yes | 8 |
| ucs2_spanish2_ci | ucs2 | 142 | | Yes | 8 |
| ucs2_roman_ci | ucs2 | 143 | | Yes | 8 |
| ucs2_persian_ci | ucs2 | 144 | | Yes | 8 |
+--------------------+---------+-----+---------+----------+---------+
utf8
(UTF-8 Unicode) collations:
mysql> SHOW COLLATION LIKE 'utf8%';
+--------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | | Yes | 1 |
| utf8_unicode_ci | utf8 | 192 | | Yes | 8 |
| utf8_icelandic_ci | utf8 | 193 | | Yes | 8 |
| utf8_latvian_ci | utf8 | 194 | | Yes | 8 |
| utf8_romanian_ci | utf8 | 195 | | Yes | 8 |
| utf8_slovenian_ci | utf8 | 196 | | Yes | 8 |
| utf8_polish_ci | utf8 | 197 | | Yes | 8 |
| utf8_estonian_ci | utf8 | 198 | | Yes | 8 |
| utf8_spanish_ci | utf8 | 199 | | Yes | 8 |
| utf8_swedish_ci | utf8 | 200 | | Yes | 8 |
| utf8_turkish_ci | utf8 | 201 | | Yes | 8 |
| utf8_czech_ci | utf8 | 202 | | Yes | 8 |
| utf8_danish_ci | utf8 | 203 | | Yes | 8 |
| utf8_lithuanian_ci | utf8 | 204 | | Yes | 8 |
| utf8_slovak_ci | utf8 | 205 | | Yes | 8 |
| utf8_spanish2_ci | utf8 | 206 | | Yes | 8 |
| utf8_roman_ci | utf8 | 207 | | Yes | 8 |
| utf8_persian_ci | utf8 | 208 | | Yes | 8 |
+--------------------+---------+-----+---------+----------+---------+
Note: Support for the
ucs2_esperanto_ci
and
utf8_esperanto_ci
collations was added in
MySQL 5.0.13.
The utf8_unicode_ci
collation is implemented
according to the Unicode Collation Algorithm (UCA) described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
(The following discussion uses
utf8_unicode_ci
, but it is also true for
ucs2_unicode_ci
.)
Currently, the utf8_unicode_ci
collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci
is that it supports
expansions, that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß
’ is
equal to ‘ss
’.
utf8_general_ci
is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
).
For example, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
:
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
:
ß = ss
Language-specific collations for the utf8
character set are implemented only if the ordering with
utf8_unicode_ci
does not work well for a
language. For example, utf8_unicode_ci
works
fine for German and French, so there is no need to create
special utf8
collations for these two
languages.
utf8_general_ci
is also satisfactory for both
German and French, except that
‘ß
’ is equal to
‘s
’, and not to
‘ss
’. If this is acceptable for
your application, then you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it is
more accurate.
utf8_swedish_ci
, like other
utf8
language-specific collations, is derived
from utf8_unicode_ci
with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci
and
utf8_spanish2_ci
collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ
’ (n-tilde) is a
separate letter between ‘n
’ and
‘o
’. In addition, for traditional
Spanish, ‘ch
’ is a separate
letter between ‘c
’ and
d
, and ‘ll
’ is
a separate letter between ‘l
’ and
‘m
’
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
ascii
(US ASCII) collations:
ascii_bin
ascii_general_ci
(default)
cp850
(DOS West European) collations:
cp850_bin
cp850_general_ci
(default)
dec8
(DEC Western European) collations:
dec8_bin
dec8_swedish_ci
(default)
hp8
(HP Western European) collations:
hp8_bin
hp8_english_ci
(default)
latin1
(cp1252 West European) collations:
latin1_bin
latin1_danish_ci
latin1_general_ci
latin1_general_cs
latin1_german1_ci
latin1_german2_ci
latin1_spanish_ci
latin1_swedish_ci
(default)
latin1
is the default character set. The
latin1_swedish_ci
collation is the
default that probably is used by the majority of MySQL
customers. While it is frequently said that it is based on
the Swedish/Finnish collation rules, there are Swedes and
Finns who disagree with this statement.
The latin1_german1_ci
and
latin1_german2_ci
collations are based on
the DIN-1 and DIN-2 standards, where DIN stands for
Deutsches Institut für
Normung (the German equivalent of ANSI).
DIN-1 is called the “dictionary collation” and
DIN-2 is called the “phone book collation”.
latin1_german1_ci
(dictionary) rules:
Ä = A Ö = O Ü = U ß = s
latin1_german2_ci
(phone-book) rules:
Ä = AE Ö = OE Ü = UE ß = ss
In the latin1_spanish_ci
collation,
‘ñ
’ (n-tilde) is a separate
letter between ‘n
’ and
‘o
’.
macroman
(Mac West European) collations:
macroman_bin
macroman_general_ci
(default)
swe7
(7bit Swedish) collations:
swe7_bin
swe7_swedish_ci
(default)
We also provide some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.
cp1250
(Windows Central European)
collations:
cp1250_bin
cp1250_croatian_ci
cp1250_czech_cs
cp1250_general_ci
(default)
cp852
(DOS Central European) collations:
cp852_bin
cp852_general_ci
(default)
keybcs2
(DOS Kamenicky Czech-Slovak)
collations:
keybcs2_bin
keybcs2_general_ci
(default)
latin2
(ISO 8859-2 Central European)
collations:
latin2_bin
latin2_croatian_ci
latin2_czech_cs
latin2_general_ci
(default)
latin2_hungarian_ci
macce
(Mac Central European) collations:
macce_bin
macce_general_ci
(default)
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish:
armscii8
(ARMSCII-8 Armenian) collations:
armscii8_bin
armscii8_general_ci
(default)
cp1256
(Windows Arabic) collations:
cp1256_bin
cp1256_general_ci
(default)
geostd8
(GEOSTD8 Georgian) collations:
geostd8_bin
geostd8_general_ci
(default)
greek
(ISO 8859-7 Greek) collations:
greek_bin
greek_general_ci
(default)
hebrew
(ISO 8859-8 Hebrew) collations:
hebrew_bin
hebrew_general_ci
(default)
latin5
(ISO 8859-9 Turkish) collations:
latin5_bin
latin5_turkish_ci
(default)
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages. There are two Baltic character sets currently supported:
cp1257
(Windows Baltic) collations:
cp1257_bin
cp1257_general_ci
(default)
cp1257_lithuanian_ci
latin7
(ISO 8859-13 Baltic) collations:
latin7_bin
latin7_estonian_cs
latin7_general_ci
(default)
latin7_general_cs
Here are the Cyrillic character sets and collations for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.
cp1251
(Windows Cyrillic) collations:
cp1251_bin
cp1251_bulgarian_ci
cp1251_general_ci
(default)
cp1251_general_cs
cp1251_ukrainian_ci
cp866
(DOS Russian) collations:
cp866_bin
cp866_general_ci
(default)
koi8r
(KOI8-R Relcom Russian) collations:
koi8r_bin
koi8r_general_ci
(default)
koi8u
(KOI8-U Ukrainian) collations:
koi8u_bin
koi8u_general_ci
(default)
The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters.
big5
(Big5 Traditional Chinese)
collations:
big5_bin
big5_chinese_ci
(default)
cp932
(SJIS for Windows Japanese)
collations:
cp932_bin
cp932_japanese_ci
(default)
eucjpms
(UJIS for Windows Japanese)
collations:
eucjpms_bin
eucjpms_japanese_ci
(default)
euckr
(EUC-KR Korean) collations:
euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese)
collations:
gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese) collations:
gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:
sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:
tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:
ujis_bin
ujis_japanese_ci
(default)
Why is cp932
needed?
In MySQL, the sjis
character set
corresponds to the Shift_JIS
character set
defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS
that are
defined by various venders.
For example, “SHIFT JIS” used in Japanese Windows
environments is a Microsoft extension of
Shift_JIS
and its exact name is
Microsoft Windows Codepage : 932
or
cp932
. In addition to the characters
supported by Shift_JIS
,
cp932
supports extension characters such as
NEC special characters, NEC selected — IBM extended
characters, and IBM extended characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted via Unicode
(ucs2
).
The sjis
character set does not support
the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed
to solve these problems. It is available as of MySQL 5.0.3.
Because MySQL 5.0 supports character set
conversion, it is important to separate IANA
Shift_JIS
and cp932
into
two different character sets because they provide different
conversion rules.
How does cp932
differ
from sjis
?
The cp932
character set differs from
sjis
in the following ways:
cp932
supports NEC special characters,
NEC selected — IBM extended characters, and IBM
selected characters.
Some cp932
characters have two
different code points, both of which convert to the same
Unicode code point. So, when converting from Unicode back
to cp932
, one of the code points must
be selected. For this “round trip
conversion,” the rule recommended by Microsoft is
used. (See
http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
Information about the Unicode values of
cp932
characters is given in the table
shown at
http://www.microsoft.com/globaldev/reference/dbcs/932.htm.
For cp932
table entries with characters
under which a four-digit number appears, the number
represents the corresponding Unicode
(ucs2
) encoding. For table entries with
an underlined two-digit value appears, there is a range of
cp932
character values that begin with
those two digits. Clicking such a table entry takes you to
a page that displays the Unicode value for each of the
cp932
characters that begin with those
digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected — IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
Starting from version 5.0.3, cp932
supports conversion of user-defined characters in
combination with eucjpms
, and solves
the problems with
sjis
/ujis
conversion. For details, please refer to
http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2
is different for
sjis
and cp932
. The
following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932
Value | sjis →
ucs2 Conversion | cp932 →
ucs2 Conversion |
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value | ucs2 →
sjis Conversion | ucs2 →
cp932 Conversion |
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |