Internationalized indexes

The default indexing templates in DocBook XSL handle only the 26 letters of the Latin alphabet. But many languages use accented Latin characters, many add their own characters, and many are based on entirely different alphabets. Fortunately, DocBook contributor Jirka Kosek has created a customization that ships with the DocBook XSL stylesheets that can handle a wide range of languages.

There are two processes that govern how an index is generated: sorting and grouping. Sorting means arranging the characters of an alphabet in a certain order. Grouping means treating a set of characters as the same character as far as assigning them to index sections. For example, words starting with a, A, á and à should all be in the index section labelled A. Likewise, all the letters in the group should sort as if they were the same letter. For example, áb, ac, and ád should sort in that order, and not sort the two words starting with á together.

The basic steps for generating an index are to put each entry in a group based on its first letter, sort the groups, and then sort the words within each group. Jirka's customization adds the ability to define indexing groups and their members, and then process the entries by groups.

The definitions of the groups are in the gentext locale files, such as common/fr.xml for French.

<l:letters>
   <l:l i="1">A</l:l>
   <l:l i="1">a</l:l>
   <l:l i="1">&#224;</l:l>
   <l:l i="1">&#192;</l:l>
   <l:l i="1">&#226;</l:l>
   <l:l i="1">&#194;</l:l>
   <l:l i="1">&#198;</l:l>
   <l:l i="1">&#230;</l:l>
   <l:l i="2">B</l:l>
   <l:l i="2">b</l:l>
   <l:l i="3">C</l:l>
   <l:l i="3">c</l:l>

The groups are identified by the unique values of the i attribute in each letter element l:l. Group 1 has all the “A” letters, group 2 has all the B's, etc. Included in the “A” group are all the accented versions of upper- and lowercase A, entered as Unicode character entities (e.g., &#224; which is à).

All of the gentext files in the common directory of the stylesheet distribution have a set of groups defined. But many have not yet been actually prepared for the specific language, and are just a copy of the groups from the English file (which has many accented characters in its groups anyway). You can identify such groups by the lang="en" attribute on the l:letters element in the gentext file. If your language has not been properly prepared, you can create a further customization of the gentext elements. That process is described in the section “Customizing generated text”. If you are confident that it is correct, you could submit it back to the DocBook development team for inclusion in future releases.

The sorting of entries is triggered in the stylesheet by an xsl:sort element, with a lang attribute whose value is taken from the document being processed. So you must have an appropriate lang attribute on the root element of your document.

XSLT processors hand off the actual sorting process to the operating system. So the results will depend on how well your operating system can sort the language specified. If it does not have the proper collation rules for your language, then the results will likely be unsatisfactory.

To use Jirka's internationalized index customization for print output, you only have to xsl:include the file fo/autoidx-ng-xsl in your customization layer:

<xsl:include href="path/to/fo/autoidx-ng.xsl"/>

You must replace the path/to with the actual path to the stylesheet file in the XSL distribution. You can use an XML catalog entry to find the file as well. That file will also xsl:include common/autoidx-ng.xsl, which defines the extension functions used to process the groups. For HTML output, include the file html/autoidx-ng.xsl in your HTML customization layer.

For the customization to work, the XSLT processor must be able to use EXSLT extension functions, and it must be able to use them in xsl:key elements. Saxon is known to work with the customization. But xsltproc does not support using the EXSLT extensions in xsl:key, and so won't work.