Rosette Enterprise is a Java-based service that offers Rosette as a locally-deployable package, providing access to Rosette's functions either as RESTful web service endpoints or embedded Java library. Each endpoint is supported by a root directory; your installation includes only roots for the endpoints you have licensed. Rosette Entity Extraction is contained in the rex root. When installing entity extraction, you only receive files for your licensed languages.
To customize Entity extraction in Rosette Enterprise, you will add files to the rex
root directory.
Entity Extraction and Linking Default Configuration
The default configuration of the Entity Extraction and Linking endpoint has been optimized to be more performance-oriented, with fewer options enabled. This change became effective in the 1.14.0 (August 2019) version of Rosette Enterprise. The endpoint is now aligned with the configuration of the REX-JE SDK. Rosette Cloud has a different configuration, providing a fully-functional demonstration environment. The current parameter defaults are shown in the table below. Prior to 1.14.0, the default values for Rosette Enterprise were the same as those listed below for Rosette Cloud.
Table 7. Entity Extraction and Linking Endpoint Defaults
Feature |
Rosette Enterprise |
Parameter Setting in Rosette Enterprise |
Rosette Cloud |
Entity Linking |
false (disabled) |
linkEntities:false
|
true (enabled) |
Regular Expression Files |
no files are loaded |
supplementalRegularExpressionPaths null
|
all files are loaded |
Pronominal Resolution |
false (disabled) |
resolvePronouns: false
|
true (enabled) |
Case Sensitivity |
case sensitive |
caseSensitivity: caseSensitive
|
automatic |
Text that refers to an entity is called an entity mention, such as “Bill Clinton” and “William Jefferson Clinton”. Rosette connects these two entity mentions with entity linking, since they refer to the same real-world PERSON entity. Linking helps establish the identity of the entity by disambiguating common names and matching a variety of names, such as nicknames and formal titles, with an entity ID.
Rosette uses the Wikidata knowledge base as a base to link Person, Location, and Organization entities. If the entity exists in Wikidata, then Rosette returns the Wikidata QID, such as Q1 for the Universe. If Rosette cannot link the entity, then it creates a placeholder temporary (“T”) entity ID to link mentions of the same entity in the document. However, the TID may be different across documents for the same entity.
Rosette supports linking to other knowledge bases, specifically the DBpedia ontology and the Thomson Reuters PermID. Note that these features are still in LABS and subject to change in the future.
Entity linking in Rosette Enterprise is off by default, to improve call speed. When entity linking is turned off, Rosette returns the entities with a TID.
You can enable entity linking in Rosette Enterprise for a single call or as a system default in your environment.
Per-call: add {"options": {"linkEntities": true}}
to your call.
-
Default: edit the /launcher/config/rosapi/rex-factory-config.yaml
file as shown below:
#The option to link mentions to knowledge base entities with #disambiguation model.
#Enabling this option also enables calculateConfidence.
linkEntities: true
Loading Regex and Gazetteer Files
By default, the regex and gazetteer files are not loaded when the entity extraction endpoint is loaded. To add either default files or new files, edit the /launcher/config/rosapi/rex-factory-config.yaml
file adding the supplementalRegularExpressionPaths
statements to the file, as show below.
Tip
The files are named following the pattern: data/regex/<lang>/accept/regexes.xml
, where <lang>
is either the ISO 69303 language code for the supported language, or xxx
for all or any languages.
#The option to add supplemental regex files, usually for entity types that are excluded by
#default. The supplemental regex files are located at data/regex/<lang>/accept/supplemental and
#are not used unless specified.
supplementalRegularExpressionPaths:
- "${rex-root}/data/regex/eng/accept/supplemental/date-regexes.xml"
- "${rex-root}/data/regex/eng/accept/supplemental/geo-regexes.xml"
Entity extraction can try to resolve pronouns with their antecedent entities. For example, in the sentences:
John Smith lives in Boston. He is originally from New York
pronominal resolution would resolve he
with John Smith
. By default, pronominal resolution is disabled.
To enable it, edit the /launcher/config/rosapi/rex-factory-config.yaml
file as shown below:
#The option to resolve pronouns to person entities.
resolvePronouns: true
Case sensitivity refers to the capitalization (aka 'case') used in the input texts. Entity extraction can use case to help identify named entities (such as proper nouns) in documents.
Valid values for caseSensitivity
caseSensitive:
(default) Case found in standard documents, those in which case follows grammar for the most part.
caseInsensitive:
Used for documents with all-caps, no-caps, or headline capitalization. These are documents in which capitalization is not a good indicator for named entities.
automatic:
Rosette detects the case from the input model and chooses an appropriate model to use.
To change the default case sensitivity, edit the /launcher/config/rosapi/rex-factory-config.yaml
file as shown below:
#The capitalization (aka 'case') used in the input texts. Processing standard documents
#requires caseSensitive, which is the default. Documents with all-caps, no-caps or headline
#capitalization may yield higher accuracy if processed with the caseInsensitive value.
caseSensitivity: automatic
The gazetteer and regex files for customizing entity extraction are in the <rexroot>/data
directory.
The structure for both regex and gazetteer entries are the same. For each type of customization, the files are organized by language. The xxx
directory contains files that are not language-dependent, but are relevant for all languages. Each language directory can have both accept
and reject
subdirectories, to determine whether matches are added or removed from the entity extraction.
Adding or Modifying Regexes
Regular expressions, (regexes) are used for finding entities which follow a strict pattern with a rigid form and infinite combinations, such as URLs and credit card numbers. Within the default REX, the language-specific entities that are extracted by regexes are defined in data/regex/<lang>/accept/regexes.xml
, where <lang>
is the ISO 693-3 language code. Regexes for finding generic cross-language entities are defined in data/regex/xxx/accept/regexes.xml
, where xxx
is the cross-language code.
You can modify these files if you want to add new patterns to extract the same entity type. To extract new entity types that have predictable patterns, you can add a new XML regex file to either the language-specific or generic location. REX uses the Tcl regex format for defining the regex patterns.
Each regex is defined in a regexp
, which may contain a lang
attribute and may refer to define
elements. The lang
attribute designates the language for which this regex applies. If the regex applies to text in any language, then do not include a lang
attribute. All the regexes in data/regex/eng
, for example, should include lang="eng"
. The regexes in data/regex/xxx
do not include the lang
attribute, which means that they may apply to text in any language.
A define
element contains a regex and a name
attribute allowing you to include this regex in multiple regexp
files.
Example:
<define lang="eng" name="time_ampm">(?:[pa]\.?\s?m\.?)</define>
<!-- ... stands for the rest of the regular expression in which the reference is embedded -->
<regexp lang="eng" type="TEMPORAL:TIME">...${time_ampm}...<regexp>
If ${time_ampm}
appears in a regexp lang="eng"
element, REX substitutes this expression. If it does not find a define name="time_ampm lang="eng"
element, Rosette REX looks for a define name="time_ampm
element without the lang
attribute. If it does not find such an element, an error occurs.
If you include an id
attribute setting, that value is returned as the "subsource" of an entity returned by this regexp
.
Define an entity type (or add to an existing type) from a regular expression
A regex can be used to extract new entity types that have predictable patterns, by adding a new XML regex file to either the language-specific or generic location. For example, to add a regex for battalion designations, create a regex file such as:
Example 1. Battalion designation regex
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE regexps PUBLIC "-//basistech.com//DTD RLP Regular Expression Config 7.1//EN"
"urn:basistech.com:7.1:rlpregexp.dtd">
<!-- File: military-unit-regex.xml -->
<regexps>
<!-- Ordinals, as in: 101st -->
<define lang="eng" name="num_ordinal">\d*(?:1st|2nd|3rd|4th|5th|6th|7th|8th|9th|0th|11th|12th|13th)</define>
<!-- Battalion labels, partial list, as in: Bn -->
<define lang="eng" name="label_battalion">(?:bn|battalion|squadron)</define>
<!-- Regiment labels, partial list, as in: Cavalry [Regiment] -->
<define lang="eng" name="label_regiment">(?:armor|cavalry|infantry)(?:\s+regiment)?</define>
<!-- Battalion designation, as in: 2nd Battalion, 1st Infantry -->
<regexp lang="eng" type="MILITARY:UNIT">(?)${num_ordinal}\s+${label_battalion}
(?:,\s+${num_ordinal}\s+${label_regiment})?</regexp>
</regexps>
Place this file in the product regex
directory:<install-directory>/roots/rex/<version>/data/regex/eng/accept/military-unit-regex.xml.
To make the regex apply across all languages, replace eng
with xxx
, in the file path and remove the define lang="eng"
statement from the regex file.
Regexes for rejecting entities follow the same format as accept regexes except wildcard (*) is allowed in the entity type. If it is for rejecting German entities, put it in data/gazetteer/deu/reject. If it is for rejecting entities in multiple languages, put it in data/gazetteer/xxx/reject
. For example, the following .txt file in data/gazetteer/eng/reject
, rejects the PERSON entity named "George Watson" when processing English documents. If you want to create regexes to reject entities, put the file in data/regex/<lang>/reject,
or data/regex/xxx/reject
for multiple languages. For example, the following .xml file in data/regex/eng/reject
rejects "Baltimore" as a LOCATION entity when processing English documents.
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE regexps PUBLIC "-//basistech.com//DTD RLP Regular Expression Config 7.1//EN"
"urn:basistech.com:7.1:rlpregexp.dtd">
<regexps>
<regexp lang="eng" type="LOCATION">Baltimore</regexp>
</regexps>
Note: Lookbehind assertions are not supported.
Regexes can improve the extraction when used with a statistical model or a gazetteer. The supplemental regexes are in the data/regex/<lang>/accept/supplemental
directory.
In Rosette Enterprise, the supplemental regexes are activated by default. To remove a file, comment out the file entry in the rex-factory-config.yaml
file.
The following languages have supplemental regexes:
A gazetteer is a list of exact matches in a predefined, closed class. For example, you can use a gazetteer to match all the countries in the world, as there is a precise and unambiguous list of countries. An entry would count as ambiguous if it has multiple possible meanings, such as "Apple" which could be either an ORGANIZATION or a fruit. The gazetteers are very fast at extracting entities. If you are searching for specific words or phrases in your data, you can use a custom gazetteer to quickly find them.
When creating custom gazetteers, put the new file in the appropriate location in the data/gazetteer
tree. If the file is for finding French entities, put it in data/gazetteer/fra/accept
. If it extracts entities in multiple languages, put it in data/gazetteer/xxx/accept
.
A gazetteer file is a .txt file that is encoded in UTF-8, and each comment line is prefixed with #. The first non-comment line is TYPE[:SUBTYPE], where the type is required and subtype is optional. They are applied to the entire gazetteer and define the entity type name for output. Type and subtype may be predefined or user-defined.
In order to match entities with differences in whitespace, gazetteer entries and potential matches are space normalized to treat any whitespace between words as a single space.
For example, to track common infectious diseases you can create a gazetteer like this:
# File: infectious-diseases-gazetteer.txt
#
DISEASE:INFECTIOUS
tuberculosis
e. coli
malaria
influenza
A single gazetteer may not be enough; you can create as many gazetteers as you need. To search for the scientific names of the infectious disease, you can create a file like this:
# File: latin-infectious-gazetteer.txt
#
DISEASE:INFECTIOUS
Mycobacterium tuberculosis
Escherichia coli
Plasmodium malariae
Orthomyxoviridae
To track certain diseases by their causes:
# File: infectious-bacterial-gazetteer.txt
#
DISEASE:BACTERIAL
Escherichia coli
E. coli
Staphylococcus aureus
Streptococcus pneuminiae
Salmonella
Or to track the drugs used to treat them:
# File: antimicrobial-drugs-gazetteer.txt
#
DRUG:ANTIMICROBIAL
methicillin
vancomycin
macrolide
fluoroquinolone
For enhanced performance, REX internally compiles gazetteer text files into a binary format before performing entity extraction.
Add New Values to an Existing Type
In this example, we're adding an additional list of entity names to the existing entity type TITLE.
For example, let's add a list of military ranks to the entity type TITLE by creating a gazetteer file such as:
Example 2. Military ranks gazetteer
# File: military-rank-gazetteer.txt
TITLE
private
private first class
lance corporal
corporal
sergeant
...
When creating custom gazetteers, the new file is placed in the appropriate location in the data/gazetteer tree, based on language. If the file contains French entities, put it in data/gazetteer/fra/accept
. If it contains entities in multiple languages, put it in data/gazetteer/xxx/accept
.
Since this file is all English entities, place it in <install-directory>/roots/rex/<version>/data/gazetteer/eng/accept/military-unit-gazetteer.txt
Restart Rosette Enterprise to use the new definitions.
Note
This example will cause the word "private" to be identified as a TITLE wherever it occurs, which may not be the desired behavior. Addressing this problem requires retraining the statistical models, which is described in the REX Application Developer's Guide.
Instead of adding entity types you can define a list of entities to reject if they are matched. These are reject gazetteers.
The format of a reject gazetteer is identical to the format of an accept gazetteer except the wildcard (*
) is allowed in the entity type. As with accept gazetteers, they are arranged by language. If, for example, it is for rejecting German entities, put it in data/gazetteer/deu/reject
. If it is for rejecting entities in multiple languages, put it in data/gazetteer/xxx/reject
.
For example, the following .txt file in data/gazetteer/eng/reject
, rejects the PERSON entity named "George Watson" when processing English documents.
PERSON
George Watson
A wildcard entity type would match any types. The value "George Watson" would be rejected from all entity types, not just PERSON.
*
George Watson