Chapter 19. Languages, characters and encoding

Table of Contents

Document encoding
Output encoding
Saxon output character representation
Special characters
Special characters in output
Missing characters
Language support
Using the lang attribute
Using language parameters
Language codes
Extending the set of languages

Characters in a computer are just numbers. An XML document examined directly in a computer's memory is a long string of numbers. A character set encoding is a mapping of those computer numbers to particular characters. For example, in the iso-8859-1 encoding, the number 225 is mapped to á (a acute). Whenever the computer displays the XML document, it uses an encoding to convert the numbers to character glyphs for display. There are many ways to do such mappings, and there are many character sets the numbers can map to. So there are many possible encodings. XML programs use Unicode internally to encode all characters in the computer's memory. However, your DocBook documents don't have to be written in Unicode and your output doesn't have to be Unicode.

If you want more details on encoding in XML, this website http://skew.org/xml/tutorial/ has an in-depth tutorial.

Document encoding

The creators of the XML specification were well aware that different documents may need different character encodings. So they let you specify the encoding right at the top of each document in the XML declaration:

<?xml  version="1.0"  encoding="iso-8859-1"?>

In this example, the encoding is specified as iso-8859-1 which is also known as ISO Latin 1. If the encoding is not specified, then UTF-8 encoding is assumed. With the encoding established, an XML program that opens the document knows how to convert the numbers it sees to logical characters, and then convert those characters into the Unicode numbers it uses internally. Of course, the content of the document must actually be encoded with this encoding. That is, you can't just change the label at the top and think you have a new encoding. The document itself would have to be converted to the new mapping of characters. If the encoding declaration of the document does not match the actual encoding, then you may end up with gibberish.

Here are several common encoding names. Usually either uppercase or lowercase letters are recognized. But don't forget the hyphens.

Table 19.1. Character encodings

UTF-8The default Unicode encoding.
UTF-16Another Unicode encoding.
US-ASCIIBasic 128 characters.
ISO-8859-1Western European languages.
ISO-8859-2Central European languages.
ISO-8859-4Baltic languages.
ISO-8859-5Cyrillic.
ISO-8859-6Arabic.
ISO-8859-7Modern Greek.
ISO-8859-8Hebrew.
ISO-8859-9Turkish
ISO-8859-15ISO-8859-1 plus the Euro symbol and other small changes.
Shift_JISJapanese on Windows
EUC-JPJapanese on Unix

What if you need to enter a character that a document's encoding does not include? For example, iso-8859-1 does not include a character for the trademark symbol ™. The solution is to use numerical character entities for any characters not in your encoding. The trademark symbol can be entered as &#x2122; in hexadecimal notation, or the equivalent &#8482; in decimal notation. Of course, having to remember that &#x2122; means trademark is an author's nightmare. Fortunately, the DocBook DTD provides more easily recognized text entities for hundreds of characters you might need. The following is one example from the DTD, this one declared in the iso-num.ent entities file.

<!ENTITY trade      "&#x2122;"> <!-- TRADE MARK SIGN -->

So you just need to enter &trade; in your document, and the DTD converts that to the numerical Unicode character that all XML applications recognize. You can examine the complete set of available character entities by looking in the directory that contains the DocBook DTD. The ent subdirectory contains a number of iso-something.ent files, where something identifies the set of entity declarations in that file.

It is entirely possible to write a document in any language that is supported by Unicode using only ASCII characters. All characters beyond the basic ASCII character set are written using numerical character references such &#255; for á, and so on. If you want to see examples of such XML files, look at the files containing generated text strings, such as fr.xml in the common directory of the DocBook XSL distribution. Those files for all languages are encoded as ASCII XML files using numerical character references. The raw XML isn't very readable, however, unless it is displayed in a program that converts such numerical Unicode references to displayable glyphs.