GATE
Version 3.1-2270

gate.creole.gazetteer
Class DefaultGazetteer

java.lang.Object
  extended by gate.util.AbstractFeatureBearer
      extended by gate.creole.AbstractResource
          extended by gate.creole.AbstractProcessingResource
              extended by gate.creole.AbstractLanguageAnalyser
                  extended by gate.creole.gazetteer.AbstractGazetteer
                      extended by gate.creole.gazetteer.DefaultGazetteer
All Implemented Interfaces:
ANNIEConstants, Gazetteer, Executable, LanguageAnalyser, ProcessingResource, Resource, FeatureBearer, NameBearer, Serializable

public class DefaultGazetteer
extends AbstractGazetteer

This component is responsible for doing lists lookup. The implementaion is based on finite state machines. The phrases to be recognised should be listed in a set of files, one for each type of occurences. The gazeteer is build with the information from a file that contains the set of lists (which are files as well) and the associated type for each list. The file defining the set of lists should have the following syntax: each list definition should be written on its own line and should contain:

  1. the file name (required)
  2. the major type (required)
  3. the minor type (optional)
  4. the language(s) (optional)
The elements of each definition are separated by ":". The following is an example of a valid definition:
personmale.lst:person:male:english Each list file named in the lists definition file is just a list containing one entry per line. When this gazetter will be run over some input text (a Gate document) it will generate annotations of type Lookup having the attributes specified in the definition file.

See Also:
Serialized Form

Nested Class Summary
static class DefaultGazetteer.CharMap
          class implementing the map using binary serach by char as key to retrive the coresponding object.
static interface DefaultGazetteer.Iter
           
 
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener
 
Field Summary
static String DEF_GAZ_ANNOT_SET_PARAMETER_NAME
           
static String DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME
           
static String DEF_GAZ_DOCUMENT_PARAMETER_NAME
           
static String DEF_GAZ_ENCODING_PARAMETER_NAME
           
static String DEF_GAZ_LISTS_URL_PARAMETER_NAME
           
protected  Set fsmStates
          A set containing all the states of the FSM backing the gazetteer
protected  FSMState initialState
          The initial state of the FSM that backs this gazetteer
protected  Map listsByNode
          a map of nodes vs gaz lists
 
Fields inherited from class gate.creole.gazetteer.AbstractGazetteer
annotationSetName, caseSensitive, definition, encoding, features, listeners, listsURL, mappingDefinition, wholeWordsOnly
 
Fields inherited from class gate.creole.AbstractLanguageAnalyser
corpus, document
 
Fields inherited from class gate.creole.AbstractProcessingResource
interrupted
 
Fields inherited from class gate.creole.AbstractResource
name
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DATE_POSTED_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, JOB_ID_ANNOTATION_TYPE, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
DefaultGazetteer()
          Build a gazetter using the default lists from the gate resources
 
Method Summary
 boolean add(String singleItem, Lookup lookup)
          Adds a new string to the gazetteer
 void addLookup(String text, Lookup lookup)
          Adds one phrase to the list of phrases recognised by this gazetteer
 void execute()
          This method runs the gazetteer.
 String getFSMgml()
          Returns a string representation of the deterministic FSM graph using GML.
 Resource init()
          Does the actual loading and parsing of the lists.
static boolean isWordInternal(char ch)
          Tests whether a character is internal to a word (i.e. if it's a letter or a combining mark (spacing or not)).
 Set lookup(String singleItem)
          lookup
protected  void readList(LinearNode node, boolean add)
          Reads one lists (one file) of phrases
 boolean remove(String singleItem)
          Removes a string from the gazetteer
 void removeLookup(String text, Lookup lookup)
          Removes one phrase to the list of phrases recognised by this gazetteer
 
Methods inherited from class gate.creole.gazetteer.AbstractGazetteer
addGazetteerListener, fireGazetteerEvent, getAnnotationSetName, getCaseSensitive, getEncoding, getFeatures, getLinearDefinition, getListsURL, getMappingDefinition, getWholeWordsOnly, reInit, setAnnotationSetName, setCaseSensitive, setEncoding, setFeatures, setListsURL, setMappingDefinition, setWholeWordsOnly
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Field Detail

DEF_GAZ_DOCUMENT_PARAMETER_NAME

public static final String DEF_GAZ_DOCUMENT_PARAMETER_NAME
See Also:
Constant Field Values

DEF_GAZ_ANNOT_SET_PARAMETER_NAME

public static final String DEF_GAZ_ANNOT_SET_PARAMETER_NAME
See Also:
Constant Field Values

DEF_GAZ_LISTS_URL_PARAMETER_NAME

public static final String DEF_GAZ_LISTS_URL_PARAMETER_NAME
See Also:
Constant Field Values

DEF_GAZ_ENCODING_PARAMETER_NAME

public static final String DEF_GAZ_ENCODING_PARAMETER_NAME
See Also:
Constant Field Values

DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME

public static final String DEF_GAZ_CASE_SENSITIVE_PARAMETER_NAME
See Also:
Constant Field Values

listsByNode

protected Map listsByNode
a map of nodes vs gaz lists


initialState

protected FSMState initialState
The initial state of the FSM that backs this gazetteer


fsmStates

protected Set fsmStates
A set containing all the states of the FSM backing the gazetteer

Constructor Detail

DefaultGazetteer

public DefaultGazetteer()
Build a gazetter using the default lists from the gate resources

Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Does the actual loading and parsing of the lists. This method must be called before the gazetteer can be used

Specified by:
init in interface Resource
Overrides:
init in class AbstractProcessingResource
Throws:
ResourceInstantiationException

readList

protected void readList(LinearNode node,
                        boolean add)
                 throws ResourceInstantiationException
Reads one lists (one file) of phrases

Parameters:
node - the node
add - if true will add the phrases found in the list to the ones recognised by this gazetter, if false the phrases found in the list will be removed from the list of phrases recognised by this gazetteer.
Throws:
ResourceInstantiationException

addLookup

public void addLookup(String text,
                      Lookup lookup)
Adds one phrase to the list of phrases recognised by this gazetteer

Parameters:
text - the phrase to be added
lookup - the description of the annotation to be added when this phrase is recognised

removeLookup

public void removeLookup(String text,
                         Lookup lookup)
Removes one phrase to the list of phrases recognised by this gazetteer

Parameters:
text - the phrase to be removed
lookup - the description of the annotation associated to this phrase

getFSMgml

public String getFSMgml()
Returns a string representation of the deterministic FSM graph using GML.


isWordInternal

public static boolean isWordInternal(char ch)
Tests whether a character is internal to a word (i.e. if it's a letter or a combining mark (spacing or not)).

Parameters:
ch - the character to be tested
Returns:
a boolean value

execute

public void execute()
             throws ExecutionException
This method runs the gazetteer. It assumes that all the needed parameters are set. If they are not, an exception will be fired.

Specified by:
execute in interface Executable
Overrides:
execute in class AbstractProcessingResource
Throws:
ExecutionException

lookup

public Set lookup(String singleItem)
lookup

Parameters:
singleItem - a single string to be looked up by the gazetteer
Returns:
set of the Lookups associated with the parameter

remove

public boolean remove(String singleItem)
Description copied from interface: Gazetteer
Removes a string from the gazetteer

Returns:
true if the operation was successful

add

public boolean add(String singleItem,
                   Lookup lookup)
Description copied from interface: Gazetteer
Adds a new string to the gazetteer

lookup - the lookup to be associated with the new string
Returns:
true if the operation was successful

GATE
Version 3.1-2270