gate.corpora
Class HtmlDocumentFormat
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TextualDocumentFormat
gate.corpora.HtmlDocumentFormat
- All Implemented Interfaces:
- LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
public class HtmlDocumentFormat
- extends TextualDocumentFormat
The format of Documents. Subclasses of DocumentFormat know about
particular MIME types and how to unpack the information in any
markup or formatting they contain into GATE annotations. Each MIME
type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
with a static index residing here when they are constructed. Static
getDocumentFormat methods can then be used to get the appropriate
format class for a particular document.
- See Also:
- Serialized Form
Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getShouldCollectRepositioning, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
HtmlDocumentFormat
public HtmlDocumentFormat()
- Default construction
supportsRepositioning
public Boolean supportsRepositioning()
- We could collect repositioning information during XML parsing
- Overrides:
supportsRepositioning
in class DocumentFormat
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Old style of unpackMarkup (without collecting of RepositioningInfo)
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Throws:
DocumentFormatException
unpackMarkup
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. HTML) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
It always tryes to parse te doc's content. It doesn't matter if the
sourceUrl is null or not.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Parameters:
doc
- The gate document you want to parse.
- Throws:
DocumentFormatException
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Specified by:
init
in interface Resource
- Overrides:
init
in class TextualDocumentFormat
- Throws:
ResourceInstantiationException