The trained entity extraction models are moved from the REX training Server to the production instance of Rosette Server through the following steps:
The REX training server is not used for entity extraction once the model is trained.
This script performs the steps above:
#!/bin/bash
set -e
lang=$1
model_id=$2
model_location=/custom_profiles/rex/data/statistical/$lang/model_LE.bin
curl -OJ “http://RTS_SERVER:9080/model/rex/download-model?language=$lang&modelId=$model_id”
scp ${lang}.bin ROSETTE:$model_location
Export the Entity Extraction Model
Export the trained model from the Rosette Model Training Suite.
From Rosette Adaptation Studio:
-
Open the project that trained the model you are interested in.
-
Select Manage from the project navigation bar.
-
From the Model Training Status block, select Export Model.
If Export Model is not enabled, the model is not ready to be exported.
The trained model will download to your machine.
The model downloaded from Rosette Adaptation Studio does not follow the REX naming conventions to avoid unintentionally overwriting the model in the production server. The model must be renamed before uploading the model to the production instance of Rosette Server.
Tip
Model Naming Convention
The prefix must be model.
and the suffix must be -LE.bin
. Any alphanumeric ASCII characters are allowed in between.
Example valid model names:
-
model.fruit-LE.bin
-
model.customer4-LE.bin
Upload the Model to the Production Server
Copy the model file to the production server. You can either:
-
Copy the model file to the default data directory in the REX root folder.
<RosetteServerInstallDir>/roots/rex/<version>/data/statistical/<lang>/<modelfile>
where <lang> is the 3 letter language code for the model.
-
Copy the model to the data directory of a custom profile.
<profile-data-root>/<profileId>/data/statistical/<lang>/<modelfile>
where <lang> is the 3 letter language code for the model.
The custom profile must be set up as described in Setting up Custom Profiles
A custom profile allows multiple configurations, each with its own data files, models, gazetteers, and settings, to exist on the same instance of Rosette Server. For complete instructions on how to create and configure a custom profile, refer to Creating a Custom Profile.
Calling the /entities Endpoint
https://<PRODSERVER>/rest/v1/entities
Rosette Entity Extraction (REX) uses statistical or deep neural network based models, patterns, and exact matching to identify entities in documents. An entity refers to an object of interest such as a person, organization, location, date, or email address. Identifying entities can help you classify documents and the kinds of data they contain.
The statistical models are based on computational linguistics and human-annotated training documents. The patterns are regular expressions that identify entities such as dates, times, and geographical coordinates. The exact matcher uses lists of entities to match words exactly in one or more entities.
Through the Rosette Model Training Suite you can customize, retrain, or train new statistical models to improve the extraction results in your domain. The two primary types of customization are:
The custom models can be deployed alongside the BasisTech statistical model.
Call the /info method to list all entity types known by the /entities endpoint:
https://<PRODSERVER>/rest/v1/entities/info
Tip
Entity linking must be enabled to return DBpediaTypes and PermIDs.
The models trained by the REX Training Server (RTS) are statistically-trained models. Multiple statistical models can be deployed and used in each call to the entities endpoint.
Example 4. Without Custom Profiles
If you are not using custom profiles, the custom models are automatically used with each call to the entities endpoint.
curl -s -X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Cache-Control: no-cache" \
-d '{"content": "Sample text for extraction"}' \"
http://<PRODSERVER>/rest/v1/entities"
Example 5. With Custom Profiles
If your installation is using custom profiles, you must specify the profileId
where the model is installed.
curl -s -X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Cache-Control: no-cache" \
-d '{"content": "Sample text for extraction",
"profileId": "<profileId>"}' \"
http://<PRODSERVER>/rest/v1/entities"
The redactor determines which entity to choose when multiple mentions for the same entity are extracted. The redactor first chooses longer entity mentions over shorter ones. If the length of the mentions are the same, the redactor uses weightings to select an entity mention.
Different processors can extract overlapping entities. For example, a gazetteer extracts "Newton", Massachusetts as a LOCATION, and the statistical processor extracts "Isaac Newton" as a PERSON. When two processors return the same or overlapping entities, the redactor chooses an entity based on the length of the competing entity strings. By default, a conflict between overlapping entities is resolved in favor of the longer candidate, "Isaac Newton".
Tip
The correct entity mention is almost always the longer mention. There can be examples, such as the example of "Newton" above, where the shorter mention is the correct mention. While it might seem that turning off the option to prefer length is the easiest fix, it usually just fixes a specific instance while reducing overall accuracy. We strongly recommend keeping the default redactorPreferLength
as true.
The redactor can be configured to set weights by: