|
GATE Version 3.1-2270 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectgate.util.AbstractFeatureBearer
gate.creole.AbstractResource
gate.creole.AbstractProcessingResource
gate.creole.AbstractLanguageAnalyser
gate.creole.tokeniser.SimpleTokeniser
public class SimpleTokeniser
Implementation of a Unicode rule based tokeniser.
The tokeniser gets its rules from a file an InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
Nested Class Summary |
---|
Nested classes/interfaces inherited from class gate.creole.AbstractProcessingResource |
---|
AbstractProcessingResource.InternalStatusListener, AbstractProcessingResource.IntervalProgressListener |
Field Summary | |
---|---|
protected String |
annotationSetName
the annotations et where the new annotations will be adde |
protected static String |
defaultResourceName
|
protected Set |
dfsmStates
A set containng all the states of the deterministic machin |
protected gate.creole.tokeniser.DFSMState |
dInitialState
The initial state of the deterministic machin |
protected FeatureMap |
features
|
protected Set |
fsmStates
A set containng all the states of the non deterministic machin |
protected gate.creole.tokeniser.FSMState |
initialState
The initial state of the non deterministic machin |
static int |
maxTypeId
The maximum int value used internally as a type i |
protected Map |
newStates
|
static String |
SIMP_TOK_ANNOT_SET_PARAMETER_NAME
|
static String |
SIMP_TOK_DOCUMENT_PARAMETER_NAME
|
static String |
SIMP_TOK_ENCODING_PARAMETER_NAME
|
static String |
SIMP_TOK_RULES_URL_PARAMETER_NAME
|
static Map |
stringTypeIds
Maps from type names to type internal id |
static Map |
typeIds
maps from int (the static value on Character to int
the internal value used by the tokeniser. |
static String[] |
typeMnemonics
Maps the internal type ids to the type name |
Fields inherited from class gate.creole.AbstractLanguageAnalyser |
---|
corpus, document |
Fields inherited from class gate.creole.AbstractProcessingResource |
---|
interrupted |
Fields inherited from class gate.creole.AbstractResource |
---|
name |
Constructor Summary | |
---|---|
SimpleTokeniser()
Creates a tokeniser |
Method Summary | |
---|---|
void |
execute()
The method that does the actual tokenisation. |
String |
getAnnotationSetName()
|
String |
getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML. |
String |
getEncoding()
|
FeatureMap |
getFeatures()
Get the feature set |
String |
getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language). |
String |
getRulesResourceName()
|
URL |
getRulesURL()
Gets the value of the rulesURL property hich holds an
URL to the file containing the rules for this tokeniser. |
Resource |
init()
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser. |
void |
reset()
Prepares this Processing resource for a new run. |
void |
setAnnotationSetName(String newAnnotationSetName)
|
void |
setEncoding(String newEncoding)
|
void |
setFeatures(FeatureMap features)
Set the feature set |
void |
setRulesResourceName(String newRulesResourceName)
|
void |
setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL
to the file containing the rules for this tokeniser. |
protected static String |
skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant token. |
Methods inherited from class gate.creole.AbstractLanguageAnalyser |
---|
getCorpus, getDocument, setCorpus, setDocument |
Methods inherited from class gate.creole.AbstractProcessingResource |
---|
addProgressListener, addStatusListener, cleanup, fireProcessFinished, fireProgressChanged, fireStatusChanged, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener |
Methods inherited from class gate.creole.AbstractResource |
---|
checkParameterValues, getBeanInfo, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface gate.ProcessingResource |
---|
reInit |
Methods inherited from interface gate.Resource |
---|
cleanup, getParameterValue, setParameterValue, setParameterValues |
Methods inherited from interface gate.util.NameBearer |
---|
getName, setName |
Methods inherited from interface gate.Executable |
---|
interrupt, isInterrupted |
Field Detail |
---|
public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME
public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME
public static final String SIMP_TOK_ENCODING_PARAMETER_NAME
protected FeatureMap features
protected String annotationSetName
protected gate.creole.tokeniser.FSMState initialState
protected Set fsmStates
protected gate.creole.tokeniser.DFSMState dInitialState
protected Set dfsmStates
public static Map typeIds
Character
to int
the internal value used by the tokeniser. The ins values used by the
tokeniser are consecutive values, starting from 0 and going as high as
necessary.
They map all the public static int members onCharacter
public static int maxTypeId
public static String[] typeMnemonics
public static Map stringTypeIds
protected static String defaultResourceName
protected transient Map newStates
Constructor Detail |
---|
public SimpleTokeniser()
Method Detail |
---|
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractProcessingResource
ResourceInstantiationException
public void reset()
protected static String skipIgnoreTokens(StringTokenizer st)
a set
public String getFSMgml()
public String getDFSMgml()
public FeatureMap getFeatures()
AbstractFeatureBearer
getFeatures
in interface FeatureBearer
getFeatures
in class AbstractFeatureBearer
public void setFeatures(FeatureMap features)
AbstractFeatureBearer
setFeatures
in interface FeatureBearer
setFeatures
in class AbstractFeatureBearer
public void execute() throws ExecutionException
execute
in interface Executable
execute
in class AbstractProcessingResource
ExecutionException
public void setRulesURL(URL newRulesURL)
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.
newRulesURL
- public URL getRulesURL()
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.
public void setAnnotationSetName(String newAnnotationSetName)
public String getAnnotationSetName()
public void setRulesResourceName(String newRulesResourceName)
public String getRulesResourceName()
public void setEncoding(String newEncoding)
public String getEncoding()
|
GATE Version 3.1-2270 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |