MySQL Localization and International Usage

MySQL Localization and International Usage
Prev	Chapter 5. Database Administration	Next

This section describes how to configure the server to use different character sets. It also discusses how to set the server's time zone and enable per-connection time zone support.

The Character Set Used for Data and Sorting

By default, MySQL uses the ISO-8859-1 (Latin1) character set with sorting according to Swedish/Finnish rules. These defaults are suitable for the United States and most of western Europe.

All MySQL binary distributions are compiled with --with-extra-charsets=complex. This adds code to all standard programs that enables them to handle latin1 and all multi-byte character sets within the binary. Other character sets will be loaded from a character-set definition file when needed.

The character set determines what characters are allowed in names. It also determines how strings are sorted by the ORDER BY and GROUP BY clauses of the SELECT statement.

You can change the character set with the --default-character-set option when you start the server. The character sets available depend on the --with-charset=charset and --with-extra-charsets= list-of-charsets | complex | all | none options to configure, and the character set configuration files listed in SHAREDIR/charsets/Index. See configure options.

As of MySQL 4.1.1, you can also change the character set collation with the --default-collation option when you start the server. The collation must be a legal collation for the default character set. (Use the SHOW COLLATION statement to determine which collations are available for each character set.) See configure options.

If you change the character set when running MySQL, that may also change the sort order. Consequently, you must run myisamchk -r -q --set-character-set=charset on all tables, or your indexes may not be ordered correctly.

When a client connects to a MySQL server, the server indicates to the client what the server's default character set is. The client will switch to use this character set for this connection.

You should use mysql_real_escape_string() when escaping strings for an SQL query. mysql_real_escape_string() is identical to the old mysql_escape_string() function, except that it takes the MYSQL connection handle as the first parameter so that the appropriate character set can be taken into account when escaping characters.

If the client is compiled with different paths than where the server is installed and the user who configured MySQL didn't include all character sets in the MySQL binary, you must tell the client where it can find the additional character sets it will need if the server runs with a different character set than the client.

You can do this by specifying a --character-sets-dir option to indicate the path to the directory in which the dynamic MySQL character sets are stored. For example, you can put the following in an option file:

[client]
character-sets-dir=/usr/local/mysql/share/mysql/charsets

You can force the client to use specific character set as follows:

[client]
default-character-set=charset

This is normally unnecessary, however.

Using the German Character Set

In MySQL 4.0, to get German sorting order, you should start mysqld with a --default-character-set=latin1_de option. This affects server behavior in several ways:

When sorting and comparing strings, the following mapping is performed on the strings before doing the comparison:
```
ä  ->  ae
ö  ->  oe
ü  ->  ue
ß  ->  ss
```
All accented characters are converted to their unaccented uppercase counterpart. All letters are converted to uppercase.
When comparing strings with LIKE, the one-character to two-character mapping is not done. All letters are converted to uppercase. Accents are removed from all letters except Ü, ü, Ö, ö, Ä, and ä.

In MySQL 4.1 and up, character set and collation are specified separately. You should select the latin1 character set and either the latin1_german1_ci or latin1_german2_ci collation. For example, to start the server with the latin1_german1_ci collation, use the --character-set-server=latin1 and --collation-server=latin1_german1_ci options.

For information on the differences between these two collations, see the section called “West European Character Sets”.

Setting the Error Message Language

By default, mysqld produces error messages in English, but they can also be displayed in any of these other languages: Czech, Danish, Dutch, Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or Swedish.

To start mysqld with a particular language for error messages, use the --language or -L option. The option value can be a language name or the full path to the error message file. For example:

shell> mysqld --language=swedish

Or:

shell> mysqld --language=/usr/local/share/swedish

The language name should be specified in lowercase.

The language files are located (by default) in the share/LANGUAGE directory under the MySQL base directory.

To change the error message file, you should edit the errmsg.txt file, and then execute the following command to generate the errmsg.sys file:

shell> comp_err errmsg.txt errmsg.sys

If you upgrade to a newer version of MySQL, remember to repeat your changes with the new errmsg.txt file.

Adding a New Character Set

This section discusses the procedure for adding add another character set to MySQL. You must have a MySQL source distribution to use these instructions.

To choose the proper procedure, decide whether the character set is simple or complex:

If the character set does not need to use special string collating routines for sorting and does not need multi-byte character support, it is simple.
If it needs either of those features, it is complex.

For example, latin1 and danish are simple character sets, whereas big5 and czech are complex character sets.

In the following procedures, the name of your character set is represented by MYSET.

For a simple character set, do the following:

Add MYSET to the end of the sql/share/charsets/Index file. Assign a unique number to it.
Create the file sql/share/charsets/MYSET.conf. (You can use a copy of sql/share/charsets/latin1.conf as the basis for this file.)
The syntax for the file is very simple:
- Comments start with a ‘#’ character and proceed to the end of the line.
- Words are separated by arbitrary amounts of whitespace.
- When defining the character set, every word must be a number in hexadecimal format.
- The ctype array takes up the first 257 words. The to_lower[], to_upper[] and sort_order[] arrays take up 256 words each after that.
See the section called “The Character Definition Arrays”.
Add the character set name to the CHARSETS_AVAILABLE and COMPILED_CHARSETS lists in configure.in.
Reconfigure, recompile, and test.

For a complex character set, do the following:

Create the file strings/ctype-MYSET.c in the MySQL source distribution.
Add MYSET to the end of the sql/share/charsets/Index file. Assign a unique number to it.
Look at one of the existing ctype-*.c files (such as strings/ctype-big5.c) to see what needs to be defined. Note that the arrays in your file must have names like ctype_MYSET, to_lower_MYSET, and so on. These correspond to the arrays for a simple character set. See the section called “The Character Definition Arrays”.
Near the top of the file, place a special comment like this:
```
/*
 * This comment is parsed by configure to create ctype.c,
 * so don't change it unless you know what you are doing.
 *
 * .configure. number_MYSET=MYNUMBER
 * .configure. strxfrm_multiply_MYSET=N
 * .configure. mbmaxlen_MYSET=N
 */
```
The configure program uses this comment to include the character set into the MySQL library automatically.
The strxfrm_multiply and mbmaxlen lines are explained in the following sections. You need include them only if you need the string collating functions or the multi-byte character set functions, respectively.
You should then create some of the following functions:
- my_strncoll_MYSET()
- my_strcoll_MYSET()
- my_strxfrm_MYSET()
- my_like_range_MYSET()
See the section called “String Collating Support”.
Add the character set name to the CHARSETS_AVAILABLE and COMPILED_CHARSETS lists in configure.in.
Reconfigure, recompile, and test.

The sql/share/charsets/README file includes additional instructions.

If you want to have the character set included in the MySQL distribution, mail a patch to the MySQL internals mailing list. See the section called “The MySQL Mailing Lists”.

The Character Definition Arrays

to_lower[] and to_upper[] are simple arrays that hold the lowercase and uppercase characters corresponding to each member of the character set. For example:

to_lower['A'] should contain 'a'
to_upper['a'] should contain 'A'

sort_order[] is a map indicating how characters should be ordered for comparison and sorting purposes. Quite often (but not for all character sets) this is the same as to_upper[], which means that sorting will be case-insensitive. MySQL will sort characters based on the values of sort_order[] elements. For more complicated sorting rules, see the discussion of string collating in the section called “String Collating Support”.

ctype[] is an array of bit values, with one element for one character. (Note that to_lower[], to_upper[], and sort_order[] are indexed by character value, but ctype[] is indexed by character value + 1. This is an old legacy convention to be able to handle EOF.)

You can find the following bitmask definitions in m_ctype.h:

#define _U      01      /* Uppercase */
#define _L      02      /* Lowercase */
#define _N      04      /* Numeral (digit) */
#define _S      010     /* Spacing character */
#define _P      020     /* Punctuation */
#define _C      040     /* Control character */
#define _B      0100    /* Blank */
#define _X      0200    /* heXadecimal digit */

The ctype[] entry for each character should be the union of the applicable bitmask values that describe the character. For example, 'A' is an uppercase character (_U) as well as a hexadecimal digit (_X), so ctype['A'+1] should contain the value:

_U + _X = 01 + 0200 = 0201

String Collating Support

If the sorting rules for your language are too complex to be handled with the simple sort_order[] table, you need to use the string collating functions.

Right now the best documentation for this is the character sets that are already implemented. Look at the big5, czech, gbk, sjis, and tis160 character sets for examples.

You must specify the strxfrm_multiply_MYSET=N value in the special comment at the top of the file. N should be set to the maximum ratio the strings may grow during my_strxfrm_MYSET (it must be a positive integer).

Multi-Byte Character Support

If you want to add support for a new character set that includes multi-byte characters, you need to use the multi-byte character functions.

Right now the best documentation on this consists of the character sets that are already implemented. Look at the euc_kr, gb2312, gbk, sjis, and ujis character sets for examples. These are implemented in the ctype-charset.c files in the strings directory.

You must specify the mbmaxlen_MYSET=N value in the special comment at the top of the source file. N should be set to the size in bytes of the largest character in the set.

Problems With Character Sets

If you try to use a character set that is not compiled into your binary, you might run into the following problems:

Your program has an incorrect path to where the character sets are stored. (Default /usr/local/mysql/share/mysql/charsets). This can be fixed by using the --character-sets-dir option when you run the program in question.
The character set is a multi-byte character set that can't be loaded dynamically. In this case, you must recompile the program with support for the character set.
The character set is a dynamic character set, but you don't have a configure file for it. In this case, you should install the configure file for the character set from a new MySQL distribution.
If your Index file doesn't contain the name for the character set, your program will display the following error message:
```
ERROR 1105: File '/usr/local/share/mysql/charsets/?.conf'
not found (Errcode: 2)
```
In this case, you should either get a new Index file or manually add the name of any missing character sets to the current file.

For MyISAM tables, you can check the character set name and number for a table with myisamchk -dvv tbl_name.

MySQL Server Time Zone Support

Before MySQL 4.1.3, you can set the time zone for the server with the --timezone=timezone_name option to mysqld_safe. You can also set it by setting the TZ environment variable before you start mysqld.

The allowable values for --timezone or TZ are system-dependent. Consult your operating system documentation to see what values are acceptable.

Beginning with MySQL 4.1.3, the server maintains several time zone settings:

The system time zone. When the server starts, it attempts to determine the time zone of the host machine and uses it to set the system_time_zone system variable.
The server's current time zone. The global time_zone system variable indicates the time zone the server currently is operating in. The initial value is 'SYSTEM', which indicates that the server time zone is the same as the system time zone. The initial value can be specified explicitly with the --default-time-zone=timezone option. If you have the SUPER privilege, you can set the global value at runtime with this statement:
```
mysql> SET GLOBAL time_zone = timezone;
```
Per-connection time zones. Each client that connects has its own time zone setting, given by the session time_zone variable. Initially this is the same as the global time_zone variable, but can be reset with this statement:
```
mysql> SET time_zone = timezone;
```

The current values of the global and per-connection time zones can be retrieved like this:

mysql> SELECT @@global.time_zone, @@session.time_zone;

timezone values can be given as strings indicating an offset from UTC, such as '+10:00' or '-6:00'. If the time zone-related tables in the mysql database have been created and populated, you can also used named time zones, such as 'Europe/Helsinki', 'US/Eastern', or 'MET'. The value 'SYSTEM' indicates that the time zone should be the same as the system time zone. Time zone names are not case sensitive.

The MySQL installation procedure creates the time zone tables in the mysql database, but does not load them. You must do so manually. (If you are upgrading to MySQL 4.1.3 or later from an earlier version, you should create the tables by upgrading your mysql database. Use the instructions in the section called “Upgrading the Grant Tables”.)

Note: Currently, time time zone tables can be populated only on Unix. This problem will be addressed soon for Windows.

The mysql_tzinfo_to_sql program is used to load the time zone tables. You will need to know the directory under which your system's time zone files are stored. A likely location is /usr/share/zoneinfo. Pass the directory name on the command line to mysql_tzinfo_to_sql, and send the output into the mysql program. For example:

shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql

mysql_tzinfo_to_sql reads your system's time zone files and generates SQL statements from them. mysql processes those statements to load the time zone tables.

mysql_tzinfo_to_sql also can be used to load a single time zone file, and to generate leap second information.

To load a single time zone file tz_file that corresponds to a time zone name tz_name, invoke mysql_tzinfo_to_sql like this:

shell> mysql_tzinfo_to_sql tz_file tz_name | mysql -u root mysql

If your time zone needs to account for leap seconds, initialize the leap second information like this, where tz_file is the name of your time zone file:

shell> mysql_tzinfo_to_sql --leap tz_file | mysql -u root mysql

Prev	Up	Next
Disaster Prevention and Recovery	Home	The MySQL Log Files