gate.corpora
Class TextualDocumentFormat
java.lang.Object
gate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractLanguageResource
gate.DocumentFormat
gate.corpora.TextualDocumentFormat
- All Implemented Interfaces:
- LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
- Direct Known Subclasses:
- EmailDocumentFormat, HtmlDocumentFormat, RtfDocumentFormat, SgmlDocumentFormat, XmlDocumentFormat
public class TextualDocumentFormat
- extends DocumentFormat
The format of Documents. Subclasses of DocumentFormat know about
particular MIME types and how to unpack the information in any
markup or formatting they contain into GATE annotations. Each MIME
type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
with a static index residing here when they are constructed. Static
getDocumentFormat methods can then be used to get the appropriate
format class for a particular document.
- See Also:
- Serialized Form
Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getShouldCollectRepositioning, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, supportsRepositioning, unpackMarkup |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TextualDocumentFormat
public TextualDocumentFormat()
- Default construction
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Specified by:
init
in interface Resource
- Overrides:
init
in class AbstractResource
- Throws:
ResourceInstantiationException
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML, RTF) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
- Specified by:
unpackMarkup
in class DocumentFormat
- Throws:
DocumentFormatException
unpackMarkup
public void unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
throws DocumentFormatException
- Specified by:
unpackMarkup
in class DocumentFormat
- Throws:
DocumentFormatException
setNewLineProperty
protected void setNewLineProperty(Document doc)
- Check the new line sequence and set document property.
Possible values are CRLF, LFCR, CR, LF
annotateParagraphs
public void annotateParagraphs(Document aDoc,
int startOffset,
int endOffset,
String annotSetName)
throws DocumentFormatException
- This method annotates paragraphs in a GATE document. The investigated text
spans beetween start and end offsets and the paragraph annotations are
created in the annotSetName. If annotSetName is null then they are creted
in the default annotation set.
- Parameters:
aDoc
- is the gate document on which the paragraph detection would
be performed.If it is null or its content it's null then the method woul
simply return doing nothing.startOffset
- is the index form the document content from which the
paragraph detection will startendOffset
- is the offset where the detection will end.annotSetName
- is the name of the set in which paragraph annotation
would be created.The annotation type created will be "paragraph"
- Throws:
DocumentFormatException
getDataStore
public DataStore getDataStore()
- Description copied from class:
AbstractLanguageResource
- Get the data store that this LR lives in. Null for transient LRs.
- Specified by:
getDataStore
in interface LanguageResource
- Overrides:
getDataStore
in class AbstractLanguageResource