http://xml.apache.org/http://www.apache.org/http://www.w3.org/

Home

Readme
Charter
Release Info

Installation
Download
Bug-Reporting

FAQs
Samples
API JavaDoc

Features
Properties

XNI Manual
XML Schema
SAX
DOM
Limitations

Source Repository
User Mail Archive
Dev Mail Archive

Questions
 

Answers
 
General Performance
 

Don't use XML where it doesn't make sense. XML is not a panacea. You will not get good performance by transferring and parsing a lot of XML files.

Using XML is memory, CPU, and network intensive.


Parser Performance
 

Avoid creating a new parser each time you parse; reuse parser instances. A pool of reusable parser instances might be a good idea if you have multiple threads parsing at the same time.

The parser configuration may affect the performance of the parser. For example, if you are interested in evaluating the parser performance with DTDs you may want to use the DTDConfiguration (Note: you can build Xerces with DTD-only support using dtdjars build target).


Parsing Documents Performance
 

There are a variety of things that you can do to improve the performance when parsing documents:

  • Convert the document to US ASCII ("US-ASCII") or Unicode ("UTF-8" or "UTF-16") before parsing. Documents written using ASCII are the fastest to parse because each character is guaranteed to be a single byte and map directly to their equivalent Unicode value. For documents that contain Unicode characters beyond the ASCII range, multiple byte sequences must be read and converted for each character. There is a performance penalty for this conversion. The UTF-16 encoding alleviates some of this penalty because each character is specified using two bytes, assuming no surrogate characters. However, using UTF-16 can roughly double the size of the original document which takes longer to parse.
  • Explicitly specify "US-ASCII" encoding if your document is in ASCII format. If no encoding is specified, the XML specification requires the parser to assume UTF-8 which is slower to process.
  • Avoid external entities and external DTDs. The extra file opens and transcoding setup is expensive.
  • Reduce character count; smaller documents are parsed quicker. Replace elements with attributes where it makes sense. Avoid gratuitous use of whitespace because the parser must scan past it.
  • Avoid using too many default attributes. Defaulting attribute values slows down processing.

XML Application Performance
 

When writing an XML application there are a number of choices you can make that effect performance. Some of the things which will affect the performance of your application are described below.

  • Grammar Caching -- if you do need validation, consider using grammar caching to reduce the cost of validation by allowing the parser to skip grammar loading and assessment. See this FAQ on how to perform grammar caching with Xerces.
  • Validation -- if you don't need validation (and infoset augmentation) of XML documents, don't include validators (DTD or XML Schema) in the pipeline. Including the validator components in the pipeline will result in a performance hit for your application: if a validator component is present in the pipeline, Xerces will try to augment the infoset even if the validation feature is set to false. If you are only interested in validating against DTDs don't include XML Schema validator in the pipeline.
  • DOCTYPE -- if you don't need validation, avoid using a DOCTYPE line in your XML document. The current version of the parser will always read the DTD if the DOCTYPE line is specified even when validation feature is set to false.
  • Deferred DOM -- by default, the DOM feature defer-node-expansion is true, causing DOM nodes to be expanded as the tree is traversed. The performance tests produced by Denis Sosnoski showed that Xerces DOM with deferred node expansion offers poor performance and large memory size for small documents (0K-10K). Thus, for best performance when using Xerces DOM with smaller documents you should disable the deferred node expansion feature. For larger documents (~100K and higher) the deferred DOM offers better performance than non-deferred DOM but uses a large memory size.
  • SAX -- if memory usage using DOM is a concern, SAX should be considered; the SAX parser uses very little memory and notifies the application as parts of the document are parsed.

For more detailed information on best practices for writing XML applications you may want to read the following series of articles:

  1. Write XML documents and develop applications using the SAX and DOM APIs
  2. Reuse parser instances with the Xerces2 SAX and DOM implementations
  3. XNI, Xerces2 features and properties, and caching schemas

These three articles discuss general performance tips in addition to ones specifically pertaining to Xerces2.




Copyright © 1999-2005 The Apache Software Foundation. All Rights Reserved.