GATE
Version 3.1-2270

gate
Class DocumentFormat

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractLanguageResource
              extended by gate.DocumentFormat
All Implemented Interfaces:
LanguageResource, Resource, FeatureBearer, NameBearer, Serializable
Direct Known Subclasses:
MSWordDocumentFormat, PdfDocumentFormat, TextualDocumentFormat

public abstract class DocumentFormat
extends AbstractLanguageResource
implements LanguageResource

The format of Documents. Subclasses of DocumentFormat know about particular MIME types and how to unpack the information in any markup or formatting they contain into GATE annotations. Each MIME type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, RtfDocumentFormat, MpegDocumentFormat. These classes register themselves with a static index residing here when they are constructed. Static getDocumentFormat methods can then be used to get the appropriate format class for a particular document.

See Also:
Serialized Form

Field Summary
protected  Map element2StringMap
          This map is used inside uppackMarkup() method...
protected static boolean isGateXmlDocument
          This fields indicates whether the document being processed is in a Gate XML custom format.
protected static Map magic2mimeTypeMap
          Map of Set of magic numbers to MimeType.
protected  Map markupElementsMap
          Map of markup elements to annotation types.
protected static Map mimeString2ClassHandlerMap
          Map of MimeTypeString to ClassHandler class.
protected static Map mimeString2mimeTypeMap
          Map of MimeType to DocumentFormat Class.
protected static Map suffixes2mimeTypeMap
          Map of Set of file suffixes to MimeType.
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Constructor Summary
DocumentFormat()
          Default construction
 
Method Summary
 void addStatusListener(StatusListener l)
           
protected static boolean areEqual(MimeType aMimeType, MimeType anotherMimeType)
          Tests if two MimeType objects are equal.
protected static MimeType decideBetweenThreeMimeTypes(MimeType aMimeTypeFromWebServer, MimeType aMimeTypeFromFileSuffix, MimeType aMimeTypeFromMagicNumbers)
          This method decides what mimeType is in majority
protected static MimeType decideBetweenTwoMimeTypes(MimeType aMimeType, MimeType anotherMimeType)
          Decide between two mimeTypes.
protected  void fireStatusChanged(String e)
           
static DocumentFormat getDocumentFormat(Document aGateDocument, MimeType mimeType)
          Find a DocumentFormat implementation that deals with a particular MIME type, given that type.
static DocumentFormat getDocumentFormat(Document aGateDocument, String fileSuffix)
          Find a DocumentFormat implementation that deals with a particular MIME type, given the file suffix (e.g. ".txt") that the document came from.
static DocumentFormat getDocumentFormat(Document aGateDocument, URL url)
          Find a DocumentFormat implementation that deals with a particular MIME type, given the URL of the Document.
 Map getElement2StringMap()
          Get the element 2 string map
 FeatureMap getFeatures()
          Get the feature set
 Map getMarkupElementsMap()
          Get the markup elements map
 MimeType getMimeType()
          Gets the mime Type
 Boolean getShouldCollectRepositioning()
           
protected static MimeType guessTypeUsingMagicNumbers(InputStream aInputStream, String anEncoding)
          This method tries to guess the mime Type using some magic numbers.
 void removeStatusListener(StatusListener l)
           
protected static MimeType runMagicNumbers(InputStreamReader aReader)
          Performs magic over Gate Document
 void setElement2StringMap(Map anElement2StringMap)
          Set the element 2 string map
 void setFeatures(FeatureMap features)
          Set the features map
 void setMarkupElementsMap(Map markupElementsMap)
          Set the markup elements map
 void setMimeType(MimeType aMimeType)
          Set the mime type
 void setShouldCollectRepositioning(Boolean b)
           
 Boolean supportsRepositioning()
          If the document format could collect repositioning information during the unpack phase this method will return true.
abstract  void unpackMarkup(Document doc)
          Unpack the markup in the document.
abstract  void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo)
           
 void unpackMarkup(Document doc, String originalContentFeatureType)
          Unpack the markup in the document.
 
Methods inherited from class gate.creole.AbstractLanguageResource
cleanup, getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, init, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, init, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Field Detail

isGateXmlDocument

protected static boolean isGateXmlDocument
This fields indicates whether the document being processed is in a Gate XML custom format. Detection is done in runMagicNumbers().


mimeString2ClassHandlerMap

protected static Map mimeString2ClassHandlerMap
Map of MimeTypeString to ClassHandler class. This is used to find the language resource that deals with the specific Document format


mimeString2mimeTypeMap

protected static Map mimeString2mimeTypeMap
Map of MimeType to DocumentFormat Class. This is used to find the DocumentFormat subclass that deals with a particular MIME type.


suffixes2mimeTypeMap

protected static Map suffixes2mimeTypeMap
Map of Set of file suffixes to MimeType. This is used to figure out what MIME type a document is from its file name.


magic2mimeTypeMap

protected static Map magic2mimeTypeMap
Map of Set of magic numbers to MimeType. This is used to guess the MIME type of a document, when we don't have any other clues.


markupElementsMap

protected Map markupElementsMap
Map of markup elements to annotation types. If it is null, the unpackMarkup() method will convert all markup, using the element names for annotation types. If it is non-null, only those elements specified here will be converted.


element2StringMap

protected Map element2StringMap
This map is used inside uppackMarkup() method... When an element from the map is encounted, The corresponding string element is added to the document content

Constructor Detail

DocumentFormat

public DocumentFormat()
Default construction

Method Detail

supportsRepositioning

public Boolean supportsRepositioning()
If the document format could collect repositioning information during the unpack phase this method will return true.
You should override this method in the child class of the defined document format if it could collect the repositioning information.


setShouldCollectRepositioning

public void setShouldCollectRepositioning(Boolean b)

getShouldCollectRepositioning

public Boolean getShouldCollectRepositioning()

unpackMarkup

public abstract void unpackMarkup(Document doc)
                           throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format (e.g. XML, RTF) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use.

Throws:
DocumentFormatException

unpackMarkup

public abstract void unpackMarkup(Document doc,
                                  RepositioningInfo repInfo,
                                  RepositioningInfo ampCodingInfo)
                           throws DocumentFormatException
Throws:
DocumentFormatException

unpackMarkup

public void unpackMarkup(Document doc,
                         String originalContentFeatureType)
                  throws DocumentFormatException
Unpack the markup in the document. This method calls unpackMarkup on the GATE document, but after it saves its content as a feature atached to the document. This method is usefull if one wants to save the content of the document being unpacked. After the markups have been unpacked, the content of the document will be replaced with a new one containing the text between markups.

Parameters:
doc - the document that will be upacked
originalContentFeatureType - the name of the feature that will hold the document's content.
Throws:
DocumentFormatException

decideBetweenThreeMimeTypes

protected static MimeType decideBetweenThreeMimeTypes(MimeType aMimeTypeFromWebServer,
                                                      MimeType aMimeTypeFromFileSuffix,
                                                      MimeType aMimeTypeFromMagicNumbers)
This method decides what mimeType is in majority

Parameters:
aMimeTypeFromWebServer - a MimeType
aMimeTypeFromFileSuffix - a MimeType
aMimeTypeFromMagicNumbers - a MimeType
Returns:
the MimeType which occurs most. If all are null, then returns null

decideBetweenTwoMimeTypes

protected static MimeType decideBetweenTwoMimeTypes(MimeType aMimeType,
                                                    MimeType anotherMimeType)
Decide between two mimeTypes. The decistion is made on "Priority" parameter set into decideBetweenThreeMimeTypes method. If both mimeTypes doesn't have "Priority" paramether set, it will return one on them.

Parameters:
aMimeType - a MimeType object with "Prority" parameter set
anotherMimeType - a MimeType object with "Prority" parameter set
Returns:
One of the two mime types.

areEqual

protected static boolean areEqual(MimeType aMimeType,
                                  MimeType anotherMimeType)
Tests if two MimeType objects are equal.

Returns:
true only if boths MimeType objects are different than null and their Types and Subtypes are equals. The method is case sensitive.

guessTypeUsingMagicNumbers

protected static MimeType guessTypeUsingMagicNumbers(InputStream aInputStream,
                                                     String anEncoding)
This method tries to guess the mime Type using some magic numbers.

Parameters:
aInputStream - a InputStream which has to be transformed into a InputStreamReader
anEncoding - the encoding. If is null or unknown then a InputStreamReader with default encodings will be created.
Returns:
the mime type associated with magic numbers

runMagicNumbers

protected static MimeType runMagicNumbers(InputStreamReader aReader)
Performs magic over Gate Document


getDocumentFormat

public static DocumentFormat getDocumentFormat(Document aGateDocument,
                                               MimeType mimeType)
Find a DocumentFormat implementation that deals with a particular MIME type, given that type.

Parameters:
aGateDocument - this document will receive as a feature the associated Mime Type. The name of the feature is MimeType and its value is in the format type/subtype
mimeType - the mime type that is given as input

getDocumentFormat

public static DocumentFormat getDocumentFormat(Document aGateDocument,
                                               String fileSuffix)
Find a DocumentFormat implementation that deals with a particular MIME type, given the file suffix (e.g. ".txt") that the document came from.

Parameters:
aGateDocument - this document will receive as a feature the associated Mime Type. The name of the feature is MimeType and its value is in the format type/subtype
fileSuffix - the file suffix that is given as input

getDocumentFormat

public static DocumentFormat getDocumentFormat(Document aGateDocument,
                                               URL url)
Find a DocumentFormat implementation that deals with a particular MIME type, given the URL of the Document. If it is an HTTP URL, we can ask the web server. If it has a recognised file extension, we can use that. Otherwise we need to use a map of magic numbers to MIME types to guess the type, and then look up the format using the type.

Parameters:
aGateDocument - this document will receive as a feature the associated Mime Type. The name of the feature is MimeType and its value is in the format type/subtype
url - the URL that is given as input

getFeatures

public FeatureMap getFeatures()
Get the feature set

Specified by:
getFeatures in interface FeatureBearer
Overrides:
getFeatures in class AbstractFeatureBearer

getMarkupElementsMap

public Map getMarkupElementsMap()
Get the markup elements map


getElement2StringMap

public Map getElement2StringMap()
Get the element 2 string map


setMarkupElementsMap

public void setMarkupElementsMap(Map markupElementsMap)
Set the markup elements map


setElement2StringMap

public void setElement2StringMap(Map anElement2StringMap)
Set the element 2 string map


setFeatures

public void setFeatures(FeatureMap features)
Set the features map

Specified by:
setFeatures in interface FeatureBearer
Overrides:
setFeatures in class AbstractFeatureBearer

setMimeType

public void setMimeType(MimeType aMimeType)
Set the mime type


getMimeType

public MimeType getMimeType()
Gets the mime Type


removeStatusListener

public void removeStatusListener(StatusListener l)

addStatusListener

public void addStatusListener(StatusListener l)

fireStatusChanged

protected void fireStatusChanged(String e)

GATE
Version 3.1-2270