org.w3c.tidy
Class Tidy

java.lang.Object
  |
  +--org.w3c.tidy.Tidy
All Implemented Interfaces:
Serializable

public class Tidy
extends Object
implements Serializable

HTML parser and pretty printer

(c) 1998-2000 (W3C) MIT, INRIA, Keio University See Tidy.java for the copyright notice. Derived from HTML Tidy Release 4 Aug 2000

Copyright (c) 1998-2000 World Wide Web Consortium (Massachusetts Institute of Technology, Institut National de Recherche en Informatique et en Automatique, Keio University). All Rights Reserved.

Contributing Author(s):
Dave Raggett
Andy Quick (translation to Java)

The contributing author(s) would like to thank all those who helped with testing, bug fixes, and patience. This wouldn't have been possible without all of you.

COPYRIGHT NOTICE:
This software and documentation is provided "as is," and the copyright holders and contributing author(s) make no representations or warranties, express or implied, including but not limited to, warranties of merchantability or fitness for any particular purpose or that the use of the software or documentation will not infringe any third party patents, copyrights, trademarks or other rights.

The copyright holders and contributing author(s) will not be liable for any direct, indirect, special or consequential damages arising out of any use of the software or documentation, even if advised of the possibility of such damage.

Permission is hereby granted to use, copy, modify, and distribute this source code, or portions hereof, documentation and executables, for any purpose, without fee, subject to the following restrictions:

  1. The origin of this source code must not be misrepresented.
  2. Altered versions must be plainly marked as such and must not be misrepresented as being the original source.
  3. This Copyright notice may not be removed or altered from any source or altered source distribution.

The copyright holders and contributing author(s) specifically permit, without fee, and encourage the use of this source code as a component for supporting the Hypertext Markup Language in commercial products. If you use this source code in a product, acknowledgment is not required but would be appreciated.

Version:
1.0, 1999/05/22
, 1.0.1, 1999/05/29 , 1.1, 1999/06/18 Java Bean , 1.2, 1999/07/10 Tidy Release 7 Jul 1999 , 1.3, 1999/07/30 Tidy Release 26 Jul 1999 , 1.4, 1999/09/04 DOM support , 1.5, 1999/10/23 Tidy Release 27 Sep 1999 , 1.6, 1999/11/01 Tidy Release 22 Oct 1999 , 1.7, 1999/12/06 Tidy Release 30 Nov 1999 , 1.8, 2000/01/22 Tidy Release 13 Jan 2000 , 1.9, 2000/06/03 Tidy Release 30 Apr 2000 , 1.10, 2000/07/22 Tidy Release 8 Jul 2000 , 1.11, 2000/08/16 Tidy Release 4 Aug 2000
Author:
Dave Raggett
, Andy Quick (translation to Java)
See Also:
Serialized Form

Constructor Summary
Tidy()
           
Tidy(boolean configLogger)
           
 
Method Summary
static Document createEmptyDocument()
          Creates an empty DOM Document.
 String getAltText()
           
 boolean getBreakBeforeBR()
           
 boolean getBurstSlides()
           
 int getCharEncoding()
           
 org.w3c.tidy.Configuration getConfiguration()
           
 String getDocType()
           
 boolean getDropEmptyParas()
           
 boolean getDropFontTags()
           
 boolean getEncloseBlockText()
           
 boolean getEncloseText()
           
 boolean getFixBackslash()
           
 boolean getFixComments()
           
 boolean getHideEndTags()
           
 boolean getIndentAttributes()
           
 boolean getIndentContent()
           
 String getInputStreamName()
           
 boolean getKeepFileTimes()
           
 boolean getLiteralAttribs()
           
 boolean getLogicalEmphasis()
           
 boolean getMakeClean()
           
 boolean getNumEntities()
           
 int getParseErrors()
          ParseErrors - the number of errors that occurred in the most recent parse operation
 int getParseWarnings()
          ParseWarnings - the number of warnings that occurred in the most recent parse operation
 boolean getQuoteAmpersand()
           
 boolean getQuoteMarks()
           
 boolean getQuoteNbsp()
           
 boolean getRawOut()
           
 String getSlidestyle()
           
 boolean getSmartIndent()
           
 int getSpaces()
           
 int getTabsize()
           
 boolean getTidyMark()
           
 boolean getUpperCaseAttrs()
           
 boolean getUpperCaseTags()
           
 boolean getWord2000()
           
 boolean getWrapAsp()
           
 boolean getWrapAttVals()
           
 boolean getWrapJste()
           
 int getWraplen()
           
 boolean getWrapPhp()
           
 boolean getWrapScriptlets()
           
 boolean getWrapSection()
           
 boolean getWriteback()
           
 boolean getXHTML()
           
 boolean getXmlOut()
           
 boolean getXmlPi()
           
 boolean getXmlPIs()
           
 boolean getXmlSpace()
           
 boolean getXmlTags()
           
static void main(String[] argv)
          Command line interface to parser and pretty printer.
 org.w3c.tidy.Node parse(InputStream in, OutputStream out)
          Parses InputStream in and returns the root Node.
 Document parseDOM(InputStream in, OutputStream out)
          Parses InputStream in and returns a DOM Document node.
 void pprint(Document doc, OutputStream out)
          Pretty-prints a DOM Document.
 void setAltText(String altText)
          AltText - default text for alt attribute
 void setBreakBeforeBR(boolean BreakBeforeBR)
          BreakBeforeBR - o/p newline before <br> or not?
 void setBurstSlides(boolean BurstSlides)
          BurstSlides - create slides on each h2 element
 void setCharEncoding(int charencoding)
          CharEncoding
 void setConfigurationFromFile(String filename)
          Sets the configuration from a configuration file.
 void setConfigurationFromProps(Properties props)
          Sets the configuration from a properties object.
 void setDocType(String doctype)
          DocType - user specified doctype omit | auto | strict | loose | fpi where the fpi is a string similar to "-//ACME//DTD HTML 3.14159//EN" Note: for fpi include the double-quotes in the string.
 void setDropEmptyParas(boolean DropEmptyParas)
          DropEmptyParas - discard empty p elements
 void setDropFontTags(boolean DropFontTags)
          DropFontTags - discard presentation tags
 void setEncloseBlockText(boolean EncloseBlockText)
          EncloseBlockText - if true text in blocks is wrapped in <p>'s
 void setEncloseText(boolean EncloseText)
          EncloseText - if true text at body is wrapped in <p>'s
 void setFixBackslash(boolean FixBackslash)
          FixBackslash - fix URLs by replacing \ with /
 void setFixComments(boolean FixComments)
          FixComments - fix comments with adjacent hyphens
 void setHideEndTags(boolean HideEndTags)
          HideEndTags - suppress optional end tags
 void setIndentAttributes(boolean IndentAttributes)
          IndentAttributes - newline+indent before each attribute
 void setIndentContent(boolean IndentContent)
          IndentContent - indent content of appropriate tags
 void setInputStreamName(String name)
          InputStreamName - the name of the input stream (printed in the header information).
 void setKeepFileTimes(boolean KeepFileTimes)
          KeepFileTimes - if true last modified time is preserved
this is NOT supported at this time.
 void setLiteralAttribs(boolean LiteralAttribs)
          LiteralAttribs - if true attributes may use newlines
 void setLogicalEmphasis(boolean LogicalEmphasis)
          LogicalEmphasis - replace i by em and b by strong
 void setMakeClean(boolean MakeClean)
          MakeClean - remove presentational clutter
 void setNumEntities(boolean NumEntities)
          NumEntities - use numeric entities
 void setQuoteAmpersand(boolean QuoteAmpersand)
          QuoteAmpersand - output naked ampersand as &
 void setQuoteMarks(boolean QuoteMarks)
          QuoteMarks - output " marks as &quot;
 void setQuoteNbsp(boolean QuoteNbsp)
          QuoteNbsp - output non-breaking space as entity
 void setRawOut(boolean RawOut)
          RawOut - avoid mapping values > 127 to entities
 void setSlidestyle(String slidestyle)
          Slidestyle - style sheet for slides
 void setSmartIndent(boolean SmartIndent)
          SmartIndent - does text/block level content effect indentation
 void setSpaces(int spaces)
          Spaces - default indentation
 void setTabsize(int tabsize)
          Tabsize
 void setTidyMark(boolean TidyMark)
          TidyMark - add meta element indicating tidied doc
 void setUpperCaseAttrs(boolean UpperCaseAttrs)
          UpperCaseAttrs - output attributes in upper not lower case
 void setUpperCaseTags(boolean UpperCaseTags)
          UpperCaseTags - output tags in upper not lower case
 void setWord2000(boolean Word2000)
          Word2000 - draconian cleaning for Word2000
 void setWrapAsp(boolean WrapAsp)
          WrapAsp - wrap within ASP pseudo elements
 void setWrapAttVals(boolean WrapAttVals)
          WrapAttVals - wrap within attribute values
 void setWrapJste(boolean WrapJste)
          WrapJste - wrap within JSTE pseudo elements
 void setWraplen(int wraplen)
          Wraplen - default wrap margin
 void setWrapPhp(boolean WrapPhp)
          WrapPhp - wrap within PHP pseudo elements
 void setWrapScriptlets(boolean WrapScriptlets)
          WrapScriptlets - wrap within JavaScript string literals
 void setWrapSection(boolean WrapSection)
          WrapSection - wrap within <!
 void setWriteback(boolean writeback)
          Writeback - if true then output tidied markup NOTE: this property is ignored when parsing from an InputStream.
 void setXHTML(boolean xHTML)
          XHTML - output extensible HTML
 void setXmlOut(boolean XmlOut)
          XmlOut - create output as XML
 void setXmlPi(boolean XmlPi)
          XmlPi - add <?
 void setXmlPIs(boolean XmlPIs)
          XmlPIs - if set to true PIs must end with ?
 void setXmlSpace(boolean XmlSpace)
          XmlSpace - if set to yes adds xml:space attr as needed
 void setXmlTags(boolean XmlTags)
          XmlTags - treat input as XML
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tidy

public Tidy()

Tidy

public Tidy(boolean configLogger)
Method Detail

getConfiguration

public org.w3c.tidy.Configuration getConfiguration()

getParseErrors

public int getParseErrors()
ParseErrors - the number of errors that occurred in the most recent parse operation


getParseWarnings

public int getParseWarnings()
ParseWarnings - the number of warnings that occurred in the most recent parse operation


setSpaces

public void setSpaces(int spaces)
Spaces - default indentation

See Also:
Configuration.spaces

getSpaces

public int getSpaces()

setWraplen

public void setWraplen(int wraplen)
Wraplen - default wrap margin

See Also:
Configuration.wraplen

getWraplen

public int getWraplen()

setCharEncoding

public void setCharEncoding(int charencoding)
CharEncoding

See Also:
Configuration.CharEncoding

getCharEncoding

public int getCharEncoding()

setTabsize

public void setTabsize(int tabsize)
Tabsize

See Also:
Configuration.tabsize

getTabsize

public int getTabsize()

setWriteback

public void setWriteback(boolean writeback)
Writeback - if true then output tidied markup NOTE: this property is ignored when parsing from an InputStream.

See Also:
Configuration.writeback

getWriteback

public boolean getWriteback()

setIndentContent

public void setIndentContent(boolean IndentContent)
IndentContent - indent content of appropriate tags

See Also:
Configuration.IndentContent

getIndentContent

public boolean getIndentContent()

setSmartIndent

public void setSmartIndent(boolean SmartIndent)
SmartIndent - does text/block level content effect indentation

See Also:
Configuration.SmartIndent

getSmartIndent

public boolean getSmartIndent()

setHideEndTags

public void setHideEndTags(boolean HideEndTags)
HideEndTags - suppress optional end tags

See Also:
Configuration.HideEndTags

getHideEndTags

public boolean getHideEndTags()

setXmlTags

public void setXmlTags(boolean XmlTags)
XmlTags - treat input as XML

See Also:
Configuration.XmlTags

getXmlTags

public boolean getXmlTags()

setXmlOut

public void setXmlOut(boolean XmlOut)
XmlOut - create output as XML

See Also:
Configuration.XmlOut

getXmlOut

public boolean getXmlOut()

setXHTML

public void setXHTML(boolean xHTML)
XHTML - output extensible HTML

See Also:
Configuration.xHTML

getXHTML

public boolean getXHTML()

setRawOut

public void setRawOut(boolean RawOut)
RawOut - avoid mapping values > 127 to entities

See Also:
Configuration.RawOut

getRawOut

public boolean getRawOut()

setUpperCaseTags

public void setUpperCaseTags(boolean UpperCaseTags)
UpperCaseTags - output tags in upper not lower case

See Also:
Configuration.UpperCaseTags

getUpperCaseTags

public boolean getUpperCaseTags()

setUpperCaseAttrs

public void setUpperCaseAttrs(boolean UpperCaseAttrs)
UpperCaseAttrs - output attributes in upper not lower case

See Also:
Configuration.UpperCaseAttrs

getUpperCaseAttrs

public boolean getUpperCaseAttrs()

setMakeClean

public void setMakeClean(boolean MakeClean)
MakeClean - remove presentational clutter

See Also:
Configuration.MakeClean

getMakeClean

public boolean getMakeClean()

setBreakBeforeBR

public void setBreakBeforeBR(boolean BreakBeforeBR)
BreakBeforeBR - o/p newline before <br> or not?

See Also:
Configuration.BreakBeforeBR

getBreakBeforeBR

public boolean getBreakBeforeBR()

setBurstSlides

public void setBurstSlides(boolean BurstSlides)
BurstSlides - create slides on each h2 element

See Also:
Configuration.BurstSlides

getBurstSlides

public boolean getBurstSlides()

setNumEntities

public void setNumEntities(boolean NumEntities)
NumEntities - use numeric entities

See Also:
Configuration.NumEntities

getNumEntities

public boolean getNumEntities()

setQuoteMarks

public void setQuoteMarks(boolean QuoteMarks)
QuoteMarks - output " marks as &quot;

See Also:
Configuration.QuoteMarks

getQuoteMarks

public boolean getQuoteMarks()

setQuoteNbsp

public void setQuoteNbsp(boolean QuoteNbsp)
QuoteNbsp - output non-breaking space as entity

See Also:
Configuration.QuoteNbsp

getQuoteNbsp

public boolean getQuoteNbsp()

setQuoteAmpersand

public void setQuoteAmpersand(boolean QuoteAmpersand)
QuoteAmpersand - output naked ampersand as &

See Also:
Configuration.QuoteAmpersand

getQuoteAmpersand

public boolean getQuoteAmpersand()

setWrapAttVals

public void setWrapAttVals(boolean WrapAttVals)
WrapAttVals - wrap within attribute values

See Also:
Configuration.WrapAttVals

getWrapAttVals

public boolean getWrapAttVals()

setWrapScriptlets

public void setWrapScriptlets(boolean WrapScriptlets)
WrapScriptlets - wrap within JavaScript string literals

See Also:
Configuration.WrapScriptlets

getWrapScriptlets

public boolean getWrapScriptlets()

setWrapSection

public void setWrapSection(boolean WrapSection)
WrapSection - wrap within <![ ... ]> section tags

See Also:
Configuration.WrapSection

getWrapSection

public boolean getWrapSection()

setAltText

public void setAltText(String altText)
AltText - default text for alt attribute

See Also:
Configuration.altText

getAltText

public String getAltText()

setSlidestyle

public void setSlidestyle(String slidestyle)
Slidestyle - style sheet for slides

See Also:
Configuration.slidestyle

getSlidestyle

public String getSlidestyle()

setXmlPi

public void setXmlPi(boolean XmlPi)
XmlPi - add <?xml?> for XML docs

See Also:
Configuration.XmlPi

getXmlPi

public boolean getXmlPi()

setDropFontTags

public void setDropFontTags(boolean DropFontTags)
DropFontTags - discard presentation tags

See Also:
Configuration.DropFontTags

getDropFontTags

public boolean getDropFontTags()

setDropEmptyParas

public void setDropEmptyParas(boolean DropEmptyParas)
DropEmptyParas - discard empty p elements

See Also:
Configuration.DropEmptyParas

getDropEmptyParas

public boolean getDropEmptyParas()

setFixComments

public void setFixComments(boolean FixComments)
FixComments - fix comments with adjacent hyphens

See Also:
Configuration.FixComments

getFixComments

public boolean getFixComments()

setWrapAsp

public void setWrapAsp(boolean WrapAsp)
WrapAsp - wrap within ASP pseudo elements

See Also:
Configuration.WrapAsp

getWrapAsp

public boolean getWrapAsp()

setWrapJste

public void setWrapJste(boolean WrapJste)
WrapJste - wrap within JSTE pseudo elements

See Also:
Configuration.WrapJste

getWrapJste

public boolean getWrapJste()

setWrapPhp

public void setWrapPhp(boolean WrapPhp)
WrapPhp - wrap within PHP pseudo elements

See Also:
Configuration.WrapPhp

getWrapPhp

public boolean getWrapPhp()

setFixBackslash

public void setFixBackslash(boolean FixBackslash)
FixBackslash - fix URLs by replacing \ with /

See Also:
Configuration.FixBackslash

getFixBackslash

public boolean getFixBackslash()

setIndentAttributes

public void setIndentAttributes(boolean IndentAttributes)
IndentAttributes - newline+indent before each attribute

See Also:
Configuration.IndentAttributes

getIndentAttributes

public boolean getIndentAttributes()

setDocType

public void setDocType(String doctype)
DocType - user specified doctype omit | auto | strict | loose | fpi where the fpi is a string similar to "-//ACME//DTD HTML 3.14159//EN" Note: for fpi include the double-quotes in the string.

See Also:
Configuration.docTypeStr, Configuration.docTypeMode

getDocType

public String getDocType()

setLogicalEmphasis

public void setLogicalEmphasis(boolean LogicalEmphasis)
LogicalEmphasis - replace i by em and b by strong

See Also:
Configuration.LogicalEmphasis

getLogicalEmphasis

public boolean getLogicalEmphasis()

setXmlPIs

public void setXmlPIs(boolean XmlPIs)
XmlPIs - if set to true PIs must end with ?>

See Also:
Configuration.XmlPIs

getXmlPIs

public boolean getXmlPIs()

setEncloseText

public void setEncloseText(boolean EncloseText)
EncloseText - if true text at body is wrapped in <p>'s

See Also:
Configuration.EncloseBodyText

getEncloseText

public boolean getEncloseText()

setEncloseBlockText

public void setEncloseBlockText(boolean EncloseBlockText)
EncloseBlockText - if true text in blocks is wrapped in <p>'s

See Also:
Configuration.EncloseBlockText

getEncloseBlockText

public boolean getEncloseBlockText()

setKeepFileTimes

public void setKeepFileTimes(boolean KeepFileTimes)
KeepFileTimes - if true last modified time is preserved
this is NOT supported at this time.

See Also:
Configuration.KeepFileTimes

getKeepFileTimes

public boolean getKeepFileTimes()

setWord2000

public void setWord2000(boolean Word2000)
Word2000 - draconian cleaning for Word2000

See Also:
Configuration.Word2000

getWord2000

public boolean getWord2000()

setTidyMark

public void setTidyMark(boolean TidyMark)
TidyMark - add meta element indicating tidied doc

See Also:
Configuration.TidyMark

getTidyMark

public boolean getTidyMark()

setXmlSpace

public void setXmlSpace(boolean XmlSpace)
XmlSpace - if set to yes adds xml:space attr as needed

See Also:
Configuration.XmlSpace

getXmlSpace

public boolean getXmlSpace()

setLiteralAttribs

public void setLiteralAttribs(boolean LiteralAttribs)
LiteralAttribs - if true attributes may use newlines

See Also:
Configuration.LiteralAttribs

getLiteralAttribs

public boolean getLiteralAttribs()

setInputStreamName

public void setInputStreamName(String name)
InputStreamName - the name of the input stream (printed in the header information).


getInputStreamName

public String getInputStreamName()

setConfigurationFromFile

public void setConfigurationFromFile(String filename)
Sets the configuration from a configuration file.


setConfigurationFromProps

public void setConfigurationFromProps(Properties props)
Sets the configuration from a properties object.


parse

public org.w3c.tidy.Node parse(InputStream in,
                               OutputStream out)
Parses InputStream in and returns the root Node. If out is non-null, pretty prints to OutputStream out.


parseDOM

public Document parseDOM(InputStream in,
                         OutputStream out)
Parses InputStream in and returns a DOM Document node. If out is non-null, pretty prints to OutputStream out.


createEmptyDocument

public static Document createEmptyDocument()
Creates an empty DOM Document.


pprint

public void pprint(Document doc,
                   OutputStream out)
Pretty-prints a DOM Document.


main

public static void main(String[] argv)
Command line interface to parser and pretty printer.