Name Matcher

Overview

The Name Matcher module aims to recognise relations between previously recognised entities. It may also assign an entity type to a previously unclassified name, based on a relation with an existing entity. It may not, however, reclassify any other previously classified entity.

Resources

The Name Matcher uses an external lookup list of aliases to record non-matching strings which represent the same entity, e.g. "New York" and "The Big Apple"; "Coca Cola" and "Coke". This table is used only for the case where no rules would apply, because the two strings are too dissimilar. It would not be used to record acronyms or shorter forms, e.g. "ICI" and "Imperial Chemical Industries". It also uses several other external lists for things like prepositions which are used in the rules.

Options

The Name Matcher also makes use of lists such as company designators. These can be loaded in two ways: either from a set of specific external lists, or from the results of the earlier gazetteer lookup module. The switch is set at runtime, i.e. via the UsedComponents menu which appears when the NameMatcher is added to the application. The option "extLists" should be set to true for the former and false for the latter. By default, the latter will be used. An alternative AnnotationSet and AnnotationType names can also be specified at this point; otherwise the default set will be used. Some rules are only applied for the annotations that are organisations or persons. So, it is necessary to specify the type of the organisation or the person. By default, these are Organization and Person. The user can choose a document from a list with all the documents of the application; otherwise the current document will be the last element from the list. By default, the annotation set is the default annotation set of the document, but the user also can specify the name of the annotation set.

Figure 1. Namematcher parameters in the Gate GUI

Rules

The Name Matcher recognises relations between entities based on a number of handcoded rules, detailed below. The rules may be applied to annotations produced by the JAPE transducer for people (P), organisations (O) or to all types of annotations (A). The rules may be case-dependent or case-independent, and may consist of full or partial matches between entities, and are both transitive and bi-directional. This means

IF name1 matches name2 by ruleA AND name2 matches name3 by ruleB THEN name1 also matches name3
IF name1 matches name2 by ruleA, THEN name2 also matches name1 by ruleA

Types of match:

Full:
- A and B are identical (with or without case matching)
  e.g. WHSmith and WHSMITH
Partial:
- A contains B
  e.g. WHSmith Ltd. and WHSmith
- A can be transformed into B (e.g. reversal around a preposition)
  e.g. Defense Department and U.S. Department of Defense
- A is a concatenated contraction of B
  e.g. Pan American and Pan Am
Semantic:
- A is an alias of B but bears no orthographical resemblance
  e.g. New York and The Big Apple
No match:
- A is a spurious match of B
  e.g. Eastern Airways and Eastern Air (which are different companies)

The matching rules are hard-coded in the Namematcher, as is priority handling of rules. Essentially, the first rule to be applied is the spurious rule from the lookup list (where two names are listed specifically as not matching), followed by the rules for all annotations, organizations, and persons respectively. In general, rules matching more similar strings precede those matching less similar ones.

Classification of Unknown Entities

The namematcher also checks entities marked as "Unknown" to see if they match any entities of type "Person", "Location" or "Organization", according to the rules described above. If a match is found, the "Unknown" annotation is deleted, and the entity is reannotated with the same type as that of its matching entity.

Maintenance

The external list switch enables specific changes to be made to the lists of company designators etc. without having to either rebuild the NameMatcher or modify the gazetteer lists (which may have implications on the transduction phase). Using the gazetteer lists, on the other hand, prevents the need for updating the external lists independently every time the gazetteer lists are modified. The structure of the external lists is as follows. The lists are stored as separate files, with an index file NMlists.def in which the list names are defined (as for the gazetteer lists). Each entry in a list consists of a name, a separator "Ł" and an identifier (generally a number, though it can be any token or string. The identifier is used to match entries, i.e. all entries with the same identifier are to be considered as related. For lists such as the preposition list, all entries are considered as equivalent (since there is no special relation between them other than the fact that they are all prepositions), so they are all given the same identifier. The set of "all annotations" to which the rules may be applied is specified in the AnnotationTypes vector in the Used Components menu - by default this is Organization, Person and Location. The rules themselves, and their ordering, are hard-coded and can only be modified via the Namematch.java code.