Character Sets and Encodings
Character Sets
A character set is a set of textual and graphic symbols, each of which is mapped to a set of nonnegative integers.
The first character set used in computing was US-ASCII. It is limited in that it can represent only American English. US-ASCII contains upper- and lower-case Latin alphabets, numerals, punctuation, a set of control codes, and a few miscellaneous symbols.
Unicode defines a standardized, universal character set that can be extended to accommodate additions. When the Java program source file encoding doesn't support Unicode, you can represent Unicode characters as escape sequences by using the notation
\u
XXXX
, whereXXXX
is the character's 16-bit representation in hexadecimal. For example, the Spanish version of the Duke's Bookstore message file uses Unicode for non-ASCII characters:{"TitleCashier", "Cajero"}, {"TitleBookDescription", "Descripci" + "\u00f3" + "n del Libro"}, {"Visitor", "Es visitanten" + "\u00fa" + "mero "}, {"What", "Qu" + "\u00e9" + " libros leemos"}, {"Talk", " describe como componentes de software de web pueden transformar la manera en que desrrollamos aplicaciones para el web. Este libro es obligatorio para cualquier programador de respeto!"}, {"Start", "Empezar a Comprar"},Character Encoding
A character encoding maps a character set to units of a specific width and defines byte serialization and ordering rules. Many character sets have more than one encoding. For example, Java programs can represent Japanese character sets using the EUC-JP or Shift-JIS encodings, among others. Each encoding has rules for representing and serializing a character set.
The ISO 8859 series defines 13 character encodings that can represent texts in dozens of languages. Each ISO 8859 character encoding can have up to 256 characters. ISO 8859-1 (Latin-1) comprises the ASCII character set, characters with diacritics (accents, diaereses, cedillas, circumflexes, and so on), and additional symbols.
UTF-8 (Unicode Transformation Format, 8-bit form) is a variable-width character encoding that encodes 16-bit Unicode characters as one to four bytes. A byte in UTF-8 is equivalent to 7-bit ASCII if its high-order bit is zero; otherwise, the character comprises a variable number of bytes.
UTF-8 is compatible with the majority of existing web content and provides access to the Unicode character set. Current versions of browsers and email clients support UTF-8. In addition, many new web standards specify UTF-8 as their character encoding. For example, UTF-8 is one of the two required encodings for XML documents (the other is UTF-16).
See Appendix A for more information on character encodings in the Java 2 platform.
Web components usually use
PrintWriter
to produce responses;PrintWriter
automatically encodes using ISO 8859-1. Servlets can also output binary data usingOutputStream
classes, which perform no encoding. An application that uses a character set that cannot use the default encoding must explicitly set a different encoding.For web components, three encodings must be considered:
Request Encoding
The request encoding is the character encoding in which parameters in an incoming request are interpreted. Currently, many browsers do not send a request encoding qualifier with the
Content-Type
header. In such cases, a web container will use the default encoding--ISO-8859-1--to parse request data.If the client hasn't set character encoding and the request data is encoded with a different encoding from the default, the data won't be interpreted correctly. To remedy this situation, you can use the
ServletRequest.setCharacterEncoding(String enc)
method to override the character encoding supplied by the container. To control the request encoding from JSP pages, you can use the JSTLfmt:requestEncoding
tag. You must call the method or tag before parsing any request parameters or reading any input from the request. Calling the method or tag once data has been read will not affect the encoding.Page Encoding
For JSP pages, the page encoding is the character encoding in which the file is encoded.
For JSP pages in standard syntax, the page encoding is determined from the following sources:
- The page encoding value of a JSP property group (see Setting Properties for Groups of JSP Pages, page 144) whose URL pattern matches the page.
- The
pageEncoding
attribute of thepage
directive of the page. It is a translation-time error to name different encodings in thepageEncoding
attribute of the page directive of a JSP page and in a JSP property group.- The
CHARSET
value of thecontentType
attribute of thepage
directive.If none of these is provided, ISO-8859-1 is used as the default page encoding.
For JSP pages in XML syntax (JSP documents), the page encoding is determined as described in section 4.3.3 and appendix F.1 of the XML specification.
The
pageEncoding
andcontentType
attributes determine the page character encoding of only the file that physically contains thepage
directive. A web container raises a translation-time error if an unsupported page encoding is specified.Response Encoding
The response encoding is the character encoding of the textual response generated by a web component. The response encoding must be set appropriately so that the characters are rendered correctly for a given locale. A web container sets an initial response encoding for a JSP page from the following sources:
If none of these is provided, ISO-8859-1 is used as the default response encoding.
The
setCharacterEncoding
,setContentType
, andsetLocale
methods can be called repeatedly to change the character encoding. Calls made after the servlet response'sgetWriter
method has been called or after the response is committed have no effect on the character encoding. Data is sent to the response stream on buffer flushes (for buffered pages) or on encountering the first content on unbuffered pages.Calls to
setContentType
set the character encoding only if the given content type string provides a value for thecharset
attribute. Calls tosetLocale
set the character encoding only if neithersetCharacterEncoding
norsetContentType
has set the character encoding before. To control the response encoding from JSP pages, you can use the JSTLfmt.setLocale
tag.To obtain the character encoding for a locale, the
setLocale
method checks the locale encoding mapping for the web application. For example, to map Japanese to the Japanese-specific encodingShift_JIS
, follow these steps:If a mapping is not set for the web application,
setLocale
uses a Application Server mapping.The first application in Chapter 4 allows a user to choose an English string representation of a locale from all the locales available to the Java 2 platform and then outputs a date localized for that locale. To ensure that the characters in the date can be rendered correctly for a wide variety of character sets, the JSP page that generates the date sets the response encoding to UTF-8 by using the following directive: