Common Problems FAQs


	Questions

Parsing HTML Generated an Error.
UTF-8 Character Error
Error Accessing EBCDIC XML Files
EOF Character Error


	Answers


	I tried to use Xerces-J to parse an HTML file and it generated an error. What did I do wrong?

Unfortunately, HTML does not, in general, follow the XML grammar rules. Most HTML files do not meet the XML style quidelines. Therefore, the XML parser generates XML well-formedness errors.

Typical errors include:

Missing end tags, e.g. <P> with no </P> (end tags are not required in HTML)
Missing closing slash on <IMG HREF="foo" /> (not required in HTML)
Missing quotes on attribute values, e.g. <IMG width="600"> (not generally required in HTML)

HTML must match the XHTML standard for well-formedness before it can be parsed by Xerces-J or any other XML parser. You can find the XHTML standard on the W3C web site.


	I get an "invalid UTF-8 character" error.

There are many Unicode characters that are not allowed in an XML document, according to the XML spec. Typical disallowed characters are control characters, even if you escape them using the Character Reference form: &#xxxx; . See the XML spec, sections 2.2 and 4.1 for details. If the parser is generating this error, it is very likely that there is a character in the file that you can not see. You can generally use a UNIX command like "od -hc" to find it.


	I get an error when I access EBCDIC XML files, what is happening?

If an XML document/file is not UTF-8, then you MUST specify the encoding. When transcoding a UTF8 document to EBCDIC, remember to change this:

<?xml version="1.0" encoding="UTF-8"?>
to something like this:
<?xml version="1.0" encoding="ebcdic-cp-us"?>


	I get an error on the EOF character (0x1A) -- what is happening?

You are probably using the LPEX editor, which automatically inserts an End-of-file character (0x1A) at the end of your XML document (other editors might do this as well). Unfortunately, the EOF character (0x1A) is an illegal character according to the XML specification, and Xerces-J correctly generates an error.