REX provides multiple processors for extracting entities. You can optimize REX for your entity extraction tasks by configuring the processors. Examples of the modifications you can make include:
-
Removing one or more processors
-
Adding gazetteers or gazetteer entries for selecting or rejecting entities
-
Adding regex files or individual regex entries
-
Adding custom processors
-
Customizing the statistical model with Rosette Model Training Suite.
Each processor has its own set of parameters to customize its behavior.
By default, REX uses all the processors. You can select to use a subset of the processors. For example, you can decide to return only entities extracted by statistical analysis.
REX includes the following processors:
-
statistical
: Entity extractor processor using a statistically-trained model
-
deepNeuralNetwork
Entity extractor processor using a model trained using a deep neural network
-
acceptGazetteer
Rule-based entity extractor based on gazetteers
-
acceptRegex
: Rule-based entity extractor based on regular expressions
-
kbLinker
: Entity extractor based on a knowledge base of known entities
-
redactor
Chooses an entity when multiple processors extract the same or overlapping entities
-
joiner
Joins adjacent entities into a single entity
-
rejectGazetteer
Rule-based entity rejector based on gazetteers
-
rejectRegex
: Rule-based entity rejector based on regular expressions
-
indocCoref
Chains together mentions that refer to the same entity (in-document coreference)
-
pronominalResolver
Pronomial resolver
The order of execution of the processors is determined internally and cannot be changed. Some processors are prerequisites for other processors. REX will throw an exception if the processor list is missing a required processor.
Edit the rex-factory-config.yaml
file, modifying the list of active processors for an entity extraction run.
Example 3. Return Statistical Entities Only
#List the set of active processors for an entity extraction run.
#All processors are active by default. This method provides a way
#to turn off selected processors. The order of the processors cannot be changed.
#Note that turning off redactor can cause overlapping and unsorted
#entities to be returned.
#Default processors:
#acceptGazetteer,
#acceptRegex,
#rejectGazetteer,
#rejectRegex,
#statistical,
#indocCoref,
#redactor,
#joiner
#
processors:
statistical
Note
The redactor chooses among the entities when processors extract the same or overlapping entities. Turning off the redactor will return all entities found by all processors. This can cause overlapping and unsorted entities to be returned.
The statistical processor uses models based on computational linguistics and human-annotated training documents. You can add other statistical models to improve extraction for your use case.
You can train a new statistical model to extract different entity types or to improve the results of the statistical model using the Rosette Model Training Suite. Contact your sales representative or support@rosette.com for more information on model training.
Statistical model based extractions can return confidence scores for each entity. Confidence scores correlate well with precision and may be used for thresholding and removal fo false positives. Confidence is calculated by default if linking is enabled. Otherwise, use the calculateConfidence
parameter to enable confidence scores. To set a threshold value, use the confidenceThreshold
parameter.
Table 3. Statistical Processor Parameters
Parameter
|
Description
|
Default
|
calculateConfidence
|
If true, entity confidence values are calculated. Can be overridden by specifying calculateConfidence in the API call.
|
false
|
confidenceThreshold
|
The confidence value threshold below which entities extracted by the statistical processor are ignored.
|
-1.0
|
statisticalModels
|
Additional files used to produce statistical entities for the given language.
You may pass multiple statistical models. The parameter should be formatted in trios of values specfying language, case-sensitivity and the model file, separated by commas. Case-sensitivity can be automatic , caseInsensitive or caseSensitive . For example, setting two models for case-sensitive English and Japanese might look like : eng,caseSensitive,english-model.bin,jpn,automatic,japanese-model.bin
|
null
|
caseSensitivity
|
The capitalization (aka 'case') used in the input texts. Processing standard documents requires caseSensitive, which is the default. Documents with all-caps, no-caps or headline capitalization may yield higher accuracy if processed with the caseInsensitive value.
Can be automatic , caseSensitive or caseInsensitive
|
caseSensitive
|
Adding a custom statistical model
Custom trained entity extraction models can be added to REX, replacing or supplementing the standard model shipped with the product. Use Rosette Model Training Suite (MTS) to train the new models.
You must choose whether to extract entities using both the new and the default statistical models together, which we call model mixing, or if you want to exclusively use the new statistical model.
With model mixing, REX runs both the new and the default models in parallel and uses the redactor module to adjudicate the overlapping results.
The trained models are moved from MTS to the production instance of REX through the following steps:
-
Export the entity extraction model from MTS.
-
Rename the model.
Tip
Model Naming Convention
The prefix must be model.
and the suffix must be -LE.bin
. Any alphanumeric ASCII characters are allowed in between.
Example valid model names:
-
model.fruit-LE.bin
-
model.customer4-LE.bin
-
Copy the model into the default data directory in the REX root folder.
Deep Neural Network Processor
REX has a deep neural network (DNN) model that can be used in place of the statistical model for selected languages. By default, REX uses the statistical models rather than the DNN model. You can customize which model Rosette uses.
The deep neural network processor is using TensorFlow 2.3.1 (Java version 0.2.0). Ubuntu Linux 14.04+, Windows 7+, and MacOS 10.11+ are fully supported, but you should be able to run the processor successfully on other modern Linux flavors as well. To use the processor on platforms which are not otherwise supported, or to improve the speed on supported platforms, you can replace the TensorFlow library shipped with the product with one that’s built from source.
To make use of GPUs, you should download tensorflow-core-platform-gpu and add it to the top of your classpath.
To select which model will be used, set the modelType
option in your calls. The default value for modelType
is statistical
. To enable the deep neural network model, provide DNN
for the modelType
. Example:
{"content": "your_text_here", "options": {"modelType": "DNN"}}
Currently, REX has DNN models for the following languages:
-
Arabic (ara
)
-
English (eng
)
-
Hebrew (heb
)
-
Korean (kor
)
Important
The deep neural network model and the statistical model cannot be used together. When selected, the DNN replaces the statistical model.
REX has a name classifier which can be used in place of the statistical model for structured regions. The name classifier is a machine learning model that tries to predict an entity type for an input string. It processes the entire structured region (the input string) as a single entity, predicting a label (PERSON, LOCATION, ORGANIZATION, or NONE) for the string. It works best on tables cells or list items where the entire entry is a single entity. If a structured region contains more text than the entity mention itself, the name classifier will usually label it as NONE.
To enable the name classifier for structured regions, set structuredRegionsProcessingType
to nameClassifier
in the rex-factory-config.yaml
file.
Currently, REX supports the name classifier processor for the following languages:
-
Arabic
-
English
-
French
-
German
-
Hebrew
-
Japanese
Each language has its own configuration file, data/name_classifier/<lang>/<lang>_config.yaml
, where <lang>
is the 3 letter language code. The labelScoreThresholds
field determines the chance that a classifier will label a phrase with a given entity type. Lowering the threshold will label more phrases, which will find more true positives, but may also identify more false positives.
To disable an entity type completely, remove or comment out the corresponding entry from the <lang>_config.yaml
file. Example:
# labelScoreThresholds
# Set the model score thresholds for each entity type.
# To turn off an entity from the model, comment it out.
# The accuracy of the current ORG model is too low and so it is better to turn it off for now.
labelScoreThresholds:
PER: 1.2
LOC: 3.2
# ORG: 5.2
Note
Currently, the ORG entity type is excluded for all languages. LOC is enabled for English and Japanese only.
A gazetteer is a list of exact matches in a predefined closed class. For example, you can use a gazetteer to match all the countries in the world, as there is a precise and unambiguous list of countries. An entry would count as ambiguous if it has multiple possible meanings, such as "Apple", which could be either an ORGANIZATION or a fruit. The gazetteers are very fast at extracting entities. If you are searching for specific words or phrases in your data, a custom gazetteer is a good way to find them quickly.
REX is shipped with default gazetteer files which you can modify, Gazetteer files are located in a subdirectory of the data
directory, defined by language using the three-letter ISO-639-3 language code. A directory which applies to all languages, uses xxx
for the language code. For example:
<install-directory>/roots/rex-<version>/data/gazetteer/eng/reject/
<install-directory>/roots/rex-<version>/data/gazetteer/xxx/accept/
By default, the data files are located in the <install-directory>/roots/rex-<version>
directory. If you want your custom files to be in a separate location, use an Overlay Data Directory.
Table 4. Accept Gazetteer Parameters
Parameter
|
Description
|
Default
|
allowPartialGazetteerMatches
|
The option to allow partial gazetteer matches. For the purposes of this setting, a partial match is one that does not line up with token boundaries as determined by the internal tokenizer. This only applies to accept gazetteers.
|
false
|
acceptGazetteers
|
Additional gazetteer files used to produce entities for the given language.
|
null
|
Creating a Custom Gazetteer
You can create your own, custom gazetteers. To create a custom gazetteer, put the new file in the appropriate location in the data/gazetteer
tree.
A gazetteer file:
-
Each comment line is prefixed with #.
-
The first non-comment line is TYPE[:SUBTYPE]
, where TYPE is required and SUBTYPE is optional. The type is applied to the entire gazetteer and defines the entity type name for output. TYPE and SUBTYPE may be predefined or user-defined.
Gazetteer entries and potential matches are space normalized to treat any whitespace between words as a single space. This enables the gazetteer to match entities with differences in whitespace.
Tip
To improve performance, text gazetteers can be compiled to a binary gazetteer using build-binary-gazetteer with the REX Field Training Kit. The binary gazetteer file name must end with -LE.bin
.
Example 4. Gazetteers to Track Infectious Diseases
To track common infectious diseases, create a gazetteer like this:
# File: infectious-diseases-gazetteer.txt
#
DISEASE:INFECTIOUS
tuberculosis
e. coli
malaria
influenza
A single gazetteer may not be enough; you can create as many gazetteers as you need. To search for the scientific names of the infectious disease, you can create a file like this:
# File: latin-infectious-gazetteer.txt
#
DISEASE:INFECTIOUS
Mycobacterium tuberculosis
Escherichia coli
Plasmodium malariae
Orthomyxoviridae
To track certain diseases by their causes:
# File: infectious-bacterial-gazetteer.txt
#
DISEASE:BACTERIAL
Escherichia coli
E. coli
Staphylococcus aureus
Streptococcus pneuminiae
Salmonella
Or to track the drugs used to treat them:
# File: antimicrobial-drugs-gazetteer.txt
#
DRUG:ANTIMICROBIAL
methicillin
vancomycin
macrolide
fluoroquinolone
Tip
By default, the data files are located in the <install-directory>/roots/rex-<version>
directory. To install custom gazetteer files in a separate directory, use an Overlay Data Directory.
Partial Gazetteer Matches
By default, gazetteer matches must match token boundaries in the input text. You can enable partial matches that do not start and/or do not end on token boundaries. You can also set individual regexes to return partial matches by including allow-partial-matches="yes"
in a regex.
Partial matches require in-document coreference to be disabled. As a result, the mentions will not be grouped into entities.
#An option for document entity resolution (also known as entity chaining).
indocType: NULL
Tip
We do not recommend that you enable partial matches. It adds processing time and may match more than you expect. An entry such as "red" in a COLOR gazetteer will match "Frederick" in the input text.
REX can analyze both simplified and traditional Chinese language documents. The following three language codes for are all used for Chinese:
zho
is the Chinese language code; it applies to both simplified and traditional Chinese. Gazetteers using zho
as the language code apply to documents with a language code of zhs
or zht
. Users should include both simplified and traditional Chinese words in the zho
gazetteer, so that it will work for all Chinese language codes.
Example 5. Adding a Simplified and Traditional Word for "lion"
{"language": "zho",
"configuration": {
"entities": { "ANIMAL": [ "狮子", "獅子" ] }
}
}
Adding Dynamic Gazetteers
You can use the API to dynamically add gazetteer entries to the /entities endpoint. The REST endpoint is:
https://localhost:8181/rest/v1/entities/configuration/gazetteer/add
Parameters:
-
language: The 3 letter language code of the new values. For example, to add an English value, the language would be eng
. To add the value to all languages, the language code is xxx
. The language must be supported by the /entities endpoint.
-
entity type: The type of the entity. For example, PERSON, LOCATION, ORGANIZATION, or TITLE. The entity type must already exist in the system.
-
values: One or more values to be added to the gazetteer.
-
profileId (Optional): Custom profile id
Example 6. Dynamically adding a gazetteer entry as a string
In this example, we're adding the companies New Corp and Best Business, to the entities gazetteer for all languages (xxx).
curl --request POST \
--url http://localhost:8181/rest/v1/entities/configuration/gazetteer/add \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '{"language": "xxx", "configuration":{"entities":{ "COMPANY": ["New Corp", "Best Business"]}}}'
Example 7. Dynamically adding a gazetteer entry to a custom profile
In this example, we're adding the same data as above, to the profile named group1
.
curl --request POST \
--url http://localhost:8181/rest/v1/entities/configuration/gazetteer/add \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '{"language": "xxx", \
"configuration":{"entities":{ "COMPANY": ["New Corp", "Best Business"]}}, "profileId": "group1"}'
Example 8. Dynamically adding a gazetteer entry as a file
In this example, the new values are in a file called new_companies.json:
{"language": "xxx", "configuration": {"entities":{ "COMPANY": ["New Corp", "Best Business"] } } }
The cURL command to add the file values:
curl --request POST \
--url http://localhost:8181/rest/v1/entities/configuration/gazetteer/add \
--header 'accept: application/json' \
--header 'content-type: application/json' \
--data '@new_companies.json'
Caution
Dynamic gazetteer entries are held completely in memory and state is not saved on disk. When Rosette enterprise is brought down, the contents are lost. To save the new entries, add the new values to the related gazetteer file before restarting Rosette enterprise.
Regular expressions (regexes) are used for finding entities which follow a strict pattern with a rigid form and infinite combinations, such as URLs and credit card numbers. In the default REX installation the regex files are:
You can modify these files to add new patterns to extract the same entity type.
Table 5. Accept Regex Parameters
Parameter
|
Description
|
Default
|
acceptRegularExpressionSets
|
Additional files used to produce regex entities.
|
null
|
supplementalRegularExpressionPaths
|
The option to add supplemental regex files, usually for entity types that are excluded by #default. The supplemental regex files are located at data/regex/<lang>/accept/supplemental and are not used unless specified.
|
null
|
regexCurrencySplit
|
When set to true, REX will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These types represent the amount of the currency (50,000) and the currency type ($), respectively
|
false
|
To extract new entity types that have predictable patterns, add a new XML regex file, either the language-specific (<lang>
) or generic (xxx
) location. REX uses the Tcl regex format for defining the regex patterns.
REX modifies the regex matcher so that \n
in a regex expression matches straight new lines (\n
\), carriage returns (\r
), or a combination of both (\r\n
). Regardless of what is matches, offsets and lengths in the result will match the input document.
By default, the data files are located in the <install-directory>/roots/rex-<version>
directory. If you want your custom files to be in a separate location, use an Overlay Data Directory.
Each regex is defined in a regexp, which may contain a lang attribute and may refer to define elements.
The lang attribute designates the language for the regex. If the regex applies to text in any language, there is no lang attribute. For example, all the regexes in data/regex/eng
should include lang="eng"
. The regexes in data/regex/xxx
do not include the lang
attribute, since they apply to text in any language.
A define element contains a regex and a name attribute. By naming the regex, you can include the regex in multiple regexp files.
Example 9. Defining a Regular Expression: time_ampm
-
Define the regular expression in a define
statement:
<define lang="eng" name="time_ampm">(?:[pa]\.?\s?m\.?)</define>
-
Use the regular expression in a regexp
statement:
<regexp lang="eng" type="TEMPORAL:TIME">...${time_ampm}...<regexp>
When REX evaluates the regexp
statement, it follows these steps:
-
When ${time_ampm}
appears in a regexp lang="eng"
element, REX looks for a define name="time_ampm" lang="eng"
statement.
-
If it does not find the element, Rosette REX looks for a define name="time_ampm"
element without the lang attribute.
-
If it does not find such an element, an error occurs.
If you include an id
attribute setting, that value is returned as the "subsource" of an entity returned by this regexp.
REX is shipped with supplemental regexes which are not activated by default. The supplemental regexes are located in the data/regex/<lang>/accept/supplemental
directory.
#The option to add supplemental regex files, usually for entity types that are excluded by
default. The supplemental regex files are located at data/regex/<lang>/accept/supplemental and
are not used unless specified.
supplementalRegularExpressionPaths:
- data/regex/eng/accept/supplemental/geo-regexes.xml
Table 6. Supplemental Regexes by Language
Language (ISO Code)
|
Currency
|
Date
|
Distance
|
Geo
|
License-Plate
|
Numbers
|
Org
|
Personal-ID
|
Phone
|
Time
|
Arab ara
|
|
X
|
X
|
X
|
X
|
X
|
|
|
|
X
|
German deu
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
English eng
|
|
X
|
X
|
X
|
|
X
|
X
|
|
|
X
|
Farsi fas
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
French fra
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Hebrew heb
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Hungarian hun
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Hindu ind
|
|
X
|
X
|
|
|
|
|
|
|
X
|
Italian ita
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Japanese jpn
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Korean kor
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Dutch nld
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Portuguese por
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Pursian pus
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Russian rus
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Spanish spa
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Swedish swe
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Upper-case English uen
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Vietnamese vie
|
X
|
X
|
X
|
|
|
|
|
|
|
X
|
Simplified Chinese zhs
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Traditional Chinese zht
|
|
X
|
X
|
X
|
|
X
|
|
|
|
X
|
Malay, Standard zsm
|
|
X
|
X
|
|
|
|
|
|
|
X
|
The joiner combines adjacent entities into a single entity, based on the joiners rules. REX then returns the single entity.
The configuration file for joining adjacent entities is in data/etc
.
Table 7. Joiner Parameters
Parameter
|
Description
|
Default
|
joinerRuleFiles
|
File containing additional joiner rules.
|
null
|
runJoinerPostRedactor
|
Run the joiner after the redactor, instead of before.
|
false
|
The file neredact-config.xml
specifies the rules for joining adjacent entities. Adjacent TITLE entities are joined into a single TITLE entity. The joiner
elements for joining TITLE and PERSON entities into a PERSON entity are commented out by default.
<neredactconfig>
<joiners>
<joiner left='TITLE' right='TITLE' joined='TITLE'/>
<!-- Not joined by default
<joiner language='eng' left='TITLE' right='PERSON' joined='PERSON'/>
<joiner language='jpn' left='PERSON' right='TITLE' joined='PERSON'/>
-->
</joiners>
</neredactconfig>
Rules can optionally specify a language, in which case they will apply only to entities of that specific language. If a language is not specified, the rule will apply for any language.
Entities are considered adjacent if they are separated by no more than 5 whitespace characters.
For example, to join "Barack Obama" and "President" in "Barack Obama, President", the joiner rule is:
<joiner left='PERSON' adjacency-regex=',\s+' right='TITLE' joined='PERSON'/>
The joiner runs before the redactor, as of release 7.46.2. To run the joiner after the redactor, set the parameter runJoinerPostRedactor
to true
in the rex-factory-config.yaml
file.
The redactor determines which entity to choose when multiple mentions for the same entity are extracted. The redactor first chooses longer entity mentions over shorter ones. If the length of the mentions are the same, the redactor uses weightings to select an entity mention.
Different processors can extract overlapping entities. For example, a gazetteer extracts "Newton", Massachusetts as a LOCATION, and the statistical processor extracts "Isaac Newton" as a PERSON. When two processors return the same or overlapping entities, the redactor chooses an entity based on the length of the competing entity strings. By default, a conflict between overlapping entities is resolved in favor of the longer candidate, "Isaac Newton".
Tip
The correct entity mention is almost always the longer mention. There can be examples, such as the example of "Newton" above, where the shorter mention is the correct mention. While it might seem that turning off the option to prefer length is the easiest fix, it usually just fixes a specific instance while reducing overall accuracy. We strongly recommend keeping the default redactorPreferLength
as true.
The redactor can be configured to set weights by:
The configuration file for setting redactor weights is in <data/etc
.
Set weight by entity type
Each of the ne-type
elements in ne-types.xml
defines weightings for a specified entity type. For example, to assign weights for IDENTIFER entities:
<ne_type>
<name>IDENTIFIER</name>
<subtypes>
<name>EMAIL</name>
<name>URL</name>
<name>DOMAIN_NAME</name>
<name>IP_ADDRESS</name>
<name>PHONE_NUMBER</name>
<name>FAX_NUMBER</name>
<name>PERSONAL_ID_NUM</name>
<name>CREDIT_CARD_NUM</name>
<name>MONEY</name>
<name>PERCENT</name>
<name>NUMBER</name>
</subtypes>
<weight name="statistical" value="9" />
<weight name="gazetteer" value="10" />
<weight name="regex" value="10" />
</ne_type>
This assigns weights for the IDENTIFIER entities. They are also weighted by processor.
Set weights by processor
The processor weights are relative values; they do not have to add up to any specific value. For example, to favor gazetteer entries over regexes, and favor both over values returned by statistical analysis, you could set the weights as follows:
<weight name="statistical" value="1" />
<weight name="gazetteer" value="10" />
<weight name="regex" value="5" />
Some processors offer subsources to identify specific instances. The kb-linker processor returns a subsource indicating the knowledge base the extraction originated in. To set a weight to a specific subsource set the name
property to PROCESSOR:SUBSOURCE
. For example, to favor your custom knowledge base (myKB) over other extractions but keep other linker extractions low, you could set the weights as follows:
<weight name="kb-linker:MyKB" value="20" />
<weight name="kb-linker" value="1" />
When you define new entity types for gazetteers and regexes, you should add those entity types to ne-types.xml
if you want to control how the redactor resolves conflicts. Types that do not appear in this file receive weights of 10 for all three processors.
For an entity type with subtypes, the settings apply to all the subtypes.
Instead of adding entities to extract when matched you can define a list of entities to reject when matched. These are reject gazetteers.
The format of a reject gazetteer is identical to the format of an accept gazetteer except the wildcard (*
) is allowed in the entity type. As with accept gazetteers, they are arranged by language.
If, for example, it is for rejecting German entities, put it in data/gazetteer/deu/reject
. If it is for rejecting entities in multiple languages, put it in data/gazetteer/xxx/reject.
Table 8. Pronominal Resolution Parameters
Parameter
|
Description
|
Default
|
rejectGazetteers
|
Additional gazetter files used to reject entities for the given language.
|
null
|
Example 10. Reject Gazetteer
The following .txt file in data/gazetteer/eng/reject
, rejects the PERSON entity named "George Watson" when processing English documents.
PERSON
George Watson
A wildcard entity type would match any types. The value "George Watson" would be rejected from all entity types, not just PERSON.
*
George Watson
A typical regex is used to identify entities of a specified entity type. You can also define a regex to reject entities; that is whenever the pattern is identified, the entity is rejected as the defined type. Reject regexes follow the same format as accept regexes with the addition that the wildcard (*) is allowed for the entity type.
Place your reject regex files in the following directories:
Table 9. Reject Regex Parameters
Parameter
|
Description
|
Default
|
rejectRegularExpressionSets
|
Additional regex files used to reject entities.
|
null
|
For example, a file to reject German entities, is named data/regex/deu/reject
. Files rejecting entities in multiple languages go in data/regex/xxx/reject
.
Example 11. Regex to Reject a Location
The following .xml file in data/regex/eng/reject rejects Baltimore as a LOCATION entity when processing English documents.
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE regexps PUBLIC "-//basistech.com//DTD RLP Regular Expression Config 7.1//EN"
"urn:basistech.com:7.1:rlpregexp.dtd">
<regexps>
<regexp lang="eng" type="LOCATION">Baltimore</regexp>
</regexps>
Note
Lookbehind assertions are not supported.
Documents often contain multiple references to a single entity. In-document coreference (indoc coref) chains together all mentions to the same entity.
Table 10. Configuration Parameters
Parameter
|
Description
|
Default
|
indocType
|
An option for document entity resolution (also known as entity chaining). Valid values are: HIGH , STANDARD , STANDARD_MINUS or NULL
|
STANDARD
|
maxResolvedEntities
|
The maximum number of entities for in-document coreference resolution (a.k.a. chaining).
|
2000
|
If resolvePronouns
is enabled (it is disabled by default), REX will try to resolve pronouns with the corresponding antecedent entities.
Table 11. Pronominal Resolution Parameters
Parameter
|
Description
|
Default
|
resolvePronouns
|
When true, resolve pronouns to person entities.
|
false
|