A gazetteer is a list of exact matches in a predefined, closed class. For example, you can use a gazetteer to match all of the countries in the world, as there is a precise and unambiguous list of countries. The gazetteers are very fast at extracting entities. If you are searching for specific words or phrases in your data, you can use a custom gazetteer to quickly find them.
When creating custom gazetteers, put the new file in the appropriate location in the
data/gazetteer tree. If the file is for finding French entities, put it in
data/gazetteer/fra/accept. If it extracts entities in multiple languages, put it in
A gazetteer file is a .txt file that is encoded in UTF-8, and each comment line is prefixed with #. The first non-comment line is TYPE[:SUBTYPE], where the type is required and subtype is optional. They are applied to the entire gazetteer and define the entity type name for output. Type and subtype may be predefined or user-defined.
In order to match entities with differences in whitespace, gazetteer entries and potential matches are space normalized to treat any whitespace between words as a single space.
For example, to track common infectious diseases you can create a gazetteer like this:
# File: infectious-diseases-gazetteer.txt
A single gazetteer may not be enough; you can create as many gazetteers as you need. To search for the scientific names of the infectious disease, you can create a file like this:
# File: latin-infectious-gazetteer.txt
To track certain diseases by their causes:
# File: infectious-bacterial-gazetteer.txt
Or to track the drugs used to treat them:
# File: antimicrobial-drugs-gazetteer.txt
For enhanced performance, REX internally compiles gazetteer text files into a binary format before performing entity extraction.
Add New Values to an Existing Type
In this example, we're adding an additional list of entity names to the existing entity type TITLE.
For example, let's add a list of military ranks to the entity type TITLE by creating a gazetteer file such as:
Example 2. Military ranks gazetteer
# File: military-rank-gazetteer.txt
private first class
When creating custom gazetteers, the new file is placed in the appropriate location in the data/gazetteer tree, based on language. If the file contains French entities, put it in
data/gazetteer/fra/accept. If it contains entities in multiple languages, put it in
Since this file is all English entities, place it in
Restart Rosette Enterprise to use the new definitions.
This example will cause the word "private" to be identified as a TITLE wherever it occurs, which may not be the desired behavior. Addressing this problem requires retraining the statistical models, which is described in the REX Application Developer's Guide.
In addition to supplementing Rosette's default entity types, gazetteers can also be used to create new entity types, using the same method as above. For example, to add a list of military units, entity type MILITARY:UNIT, create a gazetteer file such as:
Example 3. Military units gazetteer
# File: military-unit-gazetteer.txt
Place this file under the REX directory, e.g.:
Restart Rosette to include the new gazetteer.
Define a New Entity Type in Multiple Languages
To add entity names in multiple languages, create a separate gazetteer for each language. For instance, to extract names of military units in both English and Russian, create the English file as in the previous example, then create a Russian file such as:
Example 4. Russian military units gazetteer
# File: military-unit-gazetteer-rus.txt
Place this file in the appropriate language directory::
Instead of adding entity types you can define a list of entities to reject if they are matched. These are reject gazetteers.
The format of a reject gazetteer is identical to the format of an accept gazetteer except the wildcard (
*) is allowed in the entity type. As with accept gazetteers, they are arranged by language. If, for example, it is for rejecting German entities, put it in
data/gazetteer/deu/reject. If it is for rejecting entities in multiple languages, put it in
For example, the following .txt file in
data/gazetteer/eng/reject, rejects the PERSON entity named "George Watson" when processing English documents.
A wildcard entity type would match any types. The value "George Watson" would be rejected from all entity types, not just PERSON.