Entities are the key actors in your text data: the organizations, people, locations, products, and dates mentioned in documents. Rosette uncovers these entities, delivering structure, clarity, and insight to your data with adaptability, easy deployment, and consistent accuracy and performance across a broad range of languages and text genres.
Rosette Entity Extractor (REX) ingests text and identifies people, locations, and organizations, in addition to many other entity types including product, date/time, URL, and email. These entities can be used to add structured metadata to a document or in downstream natural language processing (NLP) tasks, such as extracting themes and ideas, sentiment analysis, and relationship extraction.
Entity Extraction REX comes with multiple entity extraction processors along with a linker processor to link entities to a knowledge base. In case of conflicting entities, a redactor decides which entity extraction result “wins.” REX has extensive customization features, including adding new entity patterns to the pattern-matching processor and new entity lists to the exact match processor. You can add a custom processor to systematically process REX results. Numerous configuration settings let you fit REX to your specific use case.
Entity Linking REX has an entity linking processor which can identify the real-world entities extracted from the text as well as disambiguating between different entities with the same name. Entity linking can determine not only that "Tim Cook" is a person, but it can also determine who "Tim Cook" is and disambiguate between multiple possibilities. For example, is he the CEO of Apple or a political science scholar? The entity linking processor looks at the context of each extracted entity to link entities against Wikidata. REX supports linking to other public knowledge bases as well as your organization's custom knowledge bases.
Adaptation & Customization REX gives you a good start, but as with any natural language processor, Basis Technology assumes you will need to adapt REX to your specific task for best results. There is a field training kit (FTK) to optimize REX’s performance on your specific data or to add new entity types to the statistical model. The statistical model is context-sensitive, meaning it identifies entities based on the context it appears in and thus can find names of people even if the name has been misspelled. See Customizing the Statistical Models with the FTK. The FTK also enables you to perform entity linking against your own entity knowledge base.
Basic Entity Extraction with REX:
Using Rosette base linguistics, REX processes plain text input into sentences and tokens.
Entities are extracted by running the tokens through the statistical processor or DNN, regexes, and gazetteers. If the linker is enabled, the tokens are also run through the linker processor to link entities to a knowledge base.
Reject regexes and gazetteers may remove entities from the output. Some adjacent entities may be combined by the joiner into a single result. The final entities are selected by running the extractor results through the redactor.
The final extracted entities are returned as output.
Processors for Entity Extraction
REX uses multiple complementary methods to identify entity mentions in the input text: statistical models, pattern matching, and exact matching. With REX version 7.32, we added a deep neural network model which is currently in beta. Pattern-matching and exact matching processors can run in parallel with the statistical or the deep neural network processors, but the statistical and deep neural network processors cannot be used simultaneously.
Statistical Processor: The statistical processor that uses contextual features of the input to identify entities. Using computational linguistics, it has been trained on a body of annotated news stories to extract a variety of entities in a number of languages.
Pattern Matching Processor (regular expressions): Regular expressions (regexes) are a good way to identify language-specific entities and generic entities that appear in a variety of languages. You can modify the standard regexes that we supply, and add your own regexes.
Exact Matching Processor (gazetteers): Gazetteers (entity lists) return exact matches to a predefined list. The REX distribution includes gazetteers for each language and a number of entity types, and a cross-language gazetteer for corporation names (as the name of the corporation does not generally change when it enters international markets). You can modify the standard gazetteers that we supply and add your own gazetteers to extract new entities or entity types.
-
Deep Neural Network Processor: This processor uses a model trained using a deep neural network. It is slower than the statistical processor, but has shown an error reduction of about 10% for English and Arabic and 30% for Korean, as measured by F-Score, for extracting person, location, organization, and titles. The model is trained on the same data as the statistical model. The model is based on an LSTM neural network and is backed by the TensorFlow library.
Note
The deep neural network processor is currently available in English, Arabic, Hebrew, and Korean.
Name Classifier Processor: This processor predicts entity types for text that lacks the syntactic context of complete sentences. It can extract entities from structured text, such as list items and tables, which typically contains text fragments instead of full sentences.
Redaction: When two processors return the same or overlapping entities, the redactor chooses an entity based on the length of the competing entity strings. You can also configure the redactor to choose which same-length mention to return based on entity type and/or processor.
Processors to Customize Results
These processors run on the extracted entities to further customize the results.
Joining: You can use a configuration file and the API to establish rules for joining adjacent entities into one (such as joining titles with personal names).
Rejections: You can define regexes and gazetteers to reject entities that otherwise may be returned.
Indoc Coref: In a single document, REX chains together mentions that refer to the same entity (i.e., in-document coreference).
Linker Processor: This processor extracts and links entity mentions to a knowledge base of known entities, each with a unique ID. This processor is disabled by default. REX is shipped with a prepackaged default knowledge base linking entity mentions to a Wikidata QID. You can replace the default entity knowledge base with a custom knowledge base.
Notice
Currently, the linker performs its own entity extraction and does NOT use entities found by the default entity extraction processors (statistical, pattern-matching, exact-matching). Therefore, the linker processor’s entities will not necessarily match those from the default entity extraction processors.
Pronominal Resolver: REX tries to resolve pronouns with their antecedent entities. This processor is disabled by default. The pronominal resolver is only available for English.
Custom Processor by User: REX allows the user to write custom processors and insert them into the REX pipeline prior to the input phase or the entity redactor phase.
REX is pre-trained to extract the following entity types.
-
LOCATION
A city, state, country, region, or other location that contains both a population and a government.
A geographic place such as a body of water, mountain, park, or address.
A structure such as a building or monument.
-
ORGANIZATION
A corporation, institution, government agency, or other group of people defined by an established organizational structure.
-
PERSON
-
TITLE
-
NATIONALITY
-
RELIGION
IDENTIFIER:CREDIT_CARDNUM
IDENTIFIER:DISTANCE*
IDENTIFIER:EMAIL
IDENTIFIER:LATITUDE_LONGITUDE*
IDENTIFIER:MONEY
-
IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE
If CURRENCY is enabled, MONEY extractions will be replaced with CURRENCY_AMT and CURRENCY_TYPE whenever possible (both AMT and TYPE can be extracted). If the extracted value cannot be split, MONEY may be extracted instead.
To enable CURRENCY, set regexCurrencySplit
to true. By default, it is set to false.
IDENTIFIER:PERSONAL_ID_NUM
IDENTIFIER:PHONE_NUMBER
IDENTIFIER:URL
-
IDENTIFIER:UTM*
TEMPORAL:DATE
TEMPORAL:TIME
Entity types marked with a * are not returned by default. Activate them by instructing REX to load the supplemental regexes in each language’s supplemental directory.
When the call includes {"options": {"includeDBpediaTypes": true}
, Rosette supports additional top-level entity types and over 700 additional types drawn from the DBpedia ontology. Entity linking must be enabled to return DBpedia entity types.
There are several ways to train REX to extract entity types beyond the standard set.
Create new gazetteers (i.e., entity lists).
Create new regexes for entities that fit a pattern, such as telephone numbers.
-
Retrain the statistical processor on your data to adapt it to the syntax and vocabulary of your text and domain with the FTK.
Note
The FTK is not included with the standard REX distribution but is free to any licensee of REX. Contact support@rosette.com to get the FTK package.
The following tables describe the entity types returned by the different REX processors for each supported language.
Key to processor used to identify each entity type:
S = statistical processor
G = exact matching processor (gazetteer)
R = pattern matching processor (regex)
L = entity linking available
D = deep neural network processor
Table 1. Statistical, Exact Match (Gazetteer) Extracted Entities, and Linked Entities
Language
(ISO code)
|
Entity Type |
LOC |
ORG |
PER |
PROD |
TTL |
NAT |
REL |
Arabic ara |
S/G/D/L |
S/G/D/L |
S/D/L |
L |
S |
G |
G |
Chinese, Script-insensitive zho |
S/G/L |
S/G/L |
S/L |
L |
S |
G |
G |
Chinese, Simplified zhs |
S/G/L |
S/G/L |
S/L |
L |
S |
G |
G |
Chinese, Traditional zhs |
S/G/L |
S/G/L |
S/L |
L |
S |
G |
G |
Dutch nld |
S/L |
S/G/L |
S/L |
L |
G |
|
|
English eng |
S/G/L/D |
S/R/G/L/D |
S/L/D |
S/L |
S |
G |
G |
French fra |
S/L |
S/G/L |
S/L |
L |
S |
|
|
German deu |
S/L |
S/G/L |
S/L |
L |
S |
|
|
Hebrew heb |
S/L/D |
S/G/L/D |
S/L/D |
L |
|
|
|
Hungarian hun |
S/G/L |
S/G/L |
S/G/L |
S/L |
S |
|
|
Indonesian ind |
S/G/L |
S/G/L |
S/L |
L |
|
|
|
Italian ita |
S/L |
S/G/L |
S/L |
L |
S |
|
|
Japanese jpn |
S/L |
S/G/L |
S/L |
L |
S |
G |
G |
Korean kor |
S/D/L |
S/G/D/L |
S/D/L |
L |
S |
G |
G |
Malay, Standard zsm |
S/G/L |
S/G/L |
S/L |
L |
|
|
|
Pashto pus |
S/L |
S/G/L |
S/L |
L |
S |
|
|
Persian fas |
S/L |
S/G/L |
S/L |
L |
G |
G |
G |
Portuguese por |
S/L |
S/G/L |
S/L |
L |
S |
|
|
Russian rus |
S/L |
S/G/L |
S/L |
L |
S |
G |
G |
Spanish spa |
S/L |
S/G/L |
S/L |
L |
S |
|
|
Swedish swe |
S/L |
S/G/L |
S/L |
L |
S |
S/G |
S/G |
Urdu urd |
S/L |
S/G/L |
S/L |
L |
G |
|
|
Vietnamese vie |
S/L |
S/L |
S/L |
L |
G |
G |
G |
The following entity types are not returned by default
Table 2. Rule-based Extracted Entities
Language
(ISO Code)
|
Entity Type |
CC# |
Dista |
EM |
LATLNGa |
MONEY/CURRENCY |
PERS ID |
TEL# |
URL |
UTMa |
DATEa |
TIME |
Arabic ara |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Chinese, Script-insensitive zho |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Chinese, Simplified zhs |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Chinese, Traditional zhs |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Dutch nld |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
English eng |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
French fra |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
German deu |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Hebrew heb |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Hungarian hun |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Indonesian ind |
R |
R |
R |
|
R |
R |
R |
R |
R |
R |
R |
Italian ita |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Japanese jpn |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Korean kor |
S |
S/G |
S |
|
S |
G |
G |
|
|
|
|
Malay, Standard zsm |
R |
R |
R |
|
R |
R |
R |
R |
R |
R |
R |
Pashto pus |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Persian fas |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Portuguesepor |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Russian rus |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Spanish spa |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Swedish swe |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
R |
Urdu urd |
R |
|
R |
|
R |
R |
R |
R |
R |
|
|
Vietnamese vie |
R |
R |
R |
|
R |
R |
R |
R |
|
R |
R |