The full production installation consists of the following components installed on the same machine:
Model deployment architecture
Trained models from Rosette Model Training Studio must be copied to the Rosette Server production instance to perform entity and event extraction.
-
Entity extraction models: Custom-trained models are copied into a directory. This directory may be part of an optional custom profile.
-
Events extraction models: Trained models are copied into the production server instance of ETS.
The production instance of Rosette Server must include the Events Training Server. The REX Training Server is not required in the production instance.
Event Extraction Server requirements
The optimal system configuration for the production server depends on the size of the input provided for event extraction. Benchmarks for different server configurations are provided to help you select the proper hardware for the production environment.
Table 15. Server Configurations
System Size
|
CPU Cores
|
CPU Threads
|
Total RAM
|
RAM allocated to Java Heap
|
Small
|
4
|
8
|
32Gb
|
20Gb
|
Medium
|
8
|
26
|
64Gb
|
24Gb
|
Large
|
16
|
32
|
64Gb
|
32Gb
|
Overall combined throughput across 20 concurrent users (requests/second)
Table 16. Throughput Measurements (requests/second)
System Size
|
SMS (50 characters)
|
Tweet (200 characters)
|
Email (1000 characters)
|
Book Chapter (16000 characters)
|
Small
|
49.3
|
26.3
|
8.43
|
0.6
|
Medium
|
107.9
|
58.7
|
18.2
|
1.2
|
Large
|
154.3
|
91.6
|
28.8
|
2.1
|
Install Event Training Server (ETS)
The Events Training Server must be installed on both the training and the Rosette Server production instance (extraction). The same ETS file is installed, either in training or extraction mode.
You must have Docker, dockercompose, and unzip installed.
The product can be installed interactively or with a headless installer.
To install interactively:
-
Unzip the file ets-installation-<version>.zip
.
-
Start the installation:
./install-ets.sh
To run the headless install, use the --headless
flag. The .properties
file is in the same directories as the installation script.
Use the --dry-run
flag to validate the properties file, print the settings, and exit without changing anything.
The Event Training Server installer will prompt you for the following information:
Table 17. Event Training Server Installer Prompts
Prompt
|
Purpose
|
Options
|
Notes
|
ETS mode
|
Determine if installation is for training or extraction (production) mode
|
1) Training
2) Extraction
3) Exit Installer
|
Sets the mode. Training mode prompts for location of Rosette Server; extraction mode does not.
|
Installation directory
|
Installation directory for Events Training Server files
|
Default:
/<installDir>/ets
|
This is now the <ETSinstallDir>
|
Port Event Training Server should listen on
|
|
Default: 9999
|
This port and hostname will be required when installing the other servers.
|
Directory for ETS workspaces
|
This directory will be mounted as a volume.
|
Default: /<ETSinstallDir>/workspaces
|
This directory holds the events models.
|
Fully qualified host name where Rosette Server is installed
|
Not asked when installing in extraction mode (production server)
|
The suggested value will be the host name of your current machine
|
Cannot be empty, localhost or 127.0.0.1
|
Port Rosette Server is listening on
|
Not asked when installing in extraction mode (production server)
|
Default: 8181
|
|
Configure Rosette Server for event extraction
Important
The Rosette Server configuration must be updated to support events. The rex-factory-config.yaml
installed by the install scripts contains the correct values. You only need to run this update script if you are using a different copy of the yaml file.
-
Copy the file ./scripts/update-rs-configuration.sh
from the ETS directory to the Rosette Server machine or directory.
-
Run the script from the Rosette Server directory.
./update-rs-configuration.sh
The script will prompt you for the following information:
Table 18. Rosette Server Events Update Prompts
Prompt
|
Purpose
|
Options
|
Notes
|
Should Rosette Server be updated to communicate with Events Training Server?
|
Rosette Server only communicates with ETS in production.
|
Y for the production server
N for the training server
|
|
Fully qualified host name where Events Training Server is installed
|
|
The suggested value will be the host name of your current machine
|
Cannot be empty, localhost or 127.0.0.1
|
Port Events Training Server is listening on
|
|
Default: 9999
|
|
Location of Rosette Server configuration
|
This directory will be mounted as a volume.
|
Default:
/basis/rs/config
|
The configuration file to customize Rosette Server.
|
Location of Rosette Server roots
|
This directory will be mounted as a volume.
|
Default:
/basis/rs/roots
|
|
Event extraction requires specific REX configuration parameters. The install scripts install a version of the rex-factory-config.yaml
file containing the correct values for the parameters. The parameters added or modified by the install scripts are in the table below.
Table 19. REX Configuration Parameters for Event Extraction
Parameter
|
Value for Events
|
Default Value
|
Notes
|
structuredRegionProcessingType
|
nerModel
|
NULL
|
Entire document processed as unstructured text.
|
calculateConfidence
|
true
|
false
|
Entity confidence values are returned.
|
resolvePronouns
|
true
|
false
|
REX will resolve pronounces to person entities.
|
linkEntities
|
true
|
false
|
Entities are disambiguated to a known knowledge base, Wikidata.
|
caseSensitivity
|
automatic
|
caseSensitive
|
REX determines case sensitivity.
|
startingWithDefaultConfiguration
|
true
|
|
|
supplementalRegularExpressionPaths
|
"${rex-root}/data/regex/<lang>/accept/supplemental/date-regexes.xml"
"${rex-root}/data/regex/<lang>/accept/supplemental/time-regexes.xml"
${rex-root}/data/regex/<lang>/accept/supplemental/geo-regexes.xml"
"${rex-root}/data/regex/<lang>/accept/supplemental/distance-regexes.xml"
|
|
Activate the supplemental regexes for date, time, geo, and distance. These are shipped with REX but need to be activated for each installed language, along with unspecified (xxx) language.
|
Rosette event extraction takes advantage of the advanced entity extraction capabilities provided by Rosette entity extractor (REX). REX uses pre-trained statistical models to extract the following entity types:
-
Location
-
Organization
-
Person
-
Title
-
Product
You can also use custom-trained entity extraction models, trained by the Rosette Model Training Suite, to extract additional entity types. These models are loaded into Rosette Server. They can be called in the default configuration or through a custom profile.
REX also includes rule-based extractors, including statistical regex extractors that can extract additional entity types such as:
-
Date
-
Time
-
Credit Card numbers
-
Phone Numbers
The rule-based extractors are not returned by default, To use rule-based REX extractors, modify the supplementalRegularExpressionPaths
in the REX configuration (rex-factory-config.yaml)
file. You can also add custom regex files to create new Exact extractors.
Note
Any models, gazetteers, and regular expressions used when training a model must also be used when performing event extraction. Use the same custom profile to configure REX for model training and event extraction. The custom profile is set in the schema definition for event model training.
Rosette Server can support multiple profiles, each with different data domains (such as user dictionaries, regular expressions files, and custom models) as well as different parameter and configuration settings. Each profile is defined by its own root directory, thus any data or configuration files that live in the root directory of an endpoint can be part of a custom profile.
Using custom profiles, a single endpoint can simultaneously support users with different processing requirements within a single instance of Rosette Server. For example, one user may work with product reviews and have a custom sentiment analysis model they want to use, while another user works with news articles and wants to use the default sentiment analysis model.
Each unique profile in Rosette Server is identified by a string, profileId
. The profile is specified when calling the API, by adding the profileId
parameter, indicating the set of configuration and data files to be used for that call.
Custom profiles and their associated data are contained in a <profile-data-root>
directory. This directory can be anywhere in your environment; it does not have to be in the Rosette Server install directory.
Table 20. Examples of types of customizable data by endpoint
Endpoint
|
Applicable data files for custom profile
|
/categories
|
Custom models
|
/entities
|
Gazetteers, regular expression files, custom models, linking knowledge base
|
/morphology
|
User dictionaries
|
/sentiment
|
Custom models
|
/tokens
|
Custom tokenization dictionaries
|
Note
Custom profiles are not currently supported for the address-similarity
, name-deduplication
, name-similarity
, and name-translation
endpoints.
Setting up custom profiles
-
Create a directory to contain the configuration and data files for the custom profile.
The directory name must be 1 or more characters consisting of 0-9
, A-Z
, a-z
, underscore or hyphen and no more than 80 characters long. It cannot contain spaces. It can be anywhere on your server; it does not have to be in the Rosette Server directory structure. This is the profile-data-root
.
-
Create a subdirectory for each profile, identified by a profileId.
For each profile, create a subdirectory named profileID in the profile-data-root. The profile-path for a project is profile-data-root/profileId
.
-
Edit the Rosette Server configuration files to look for the profile directories.
The configuration files are in the launcher/config/
directory. Set the profile-data-root
value in these files:
# profile data root folder that may contain profile-id/{rex,tcat} etc
profile-data-root=file:///Users/rosette-users
-
Add the customization files for each profile. They may be configuration and/or data files.
When you call the API, add "profileId" = "myProfileId"
to the body of the call.
{"content": "The black bear fought the white tiger at London Zoo.",
"profileId": "group1"
}
New profiles are automatically loaded in Rosette Server. You do not have to bring down or restart the instance to add new models or data to Rosette Server.
When editing an existing profile, you may need to restart Rosette Server. If the profile has been called since Rosette Server was started, the Server must be restarted for the changes to take effect. If the profile has not been called since Rosette Server was started, there is no need to restart.
To add or update models or data, assuming the custom profile root rosette-users
and profiles group1
and group2
.
-
Add a new profile with the new models or new data, for example group3
.
-
Delete the profile and re-add it. Delete group1
and then recreate the group1
directory with the new models and/or data.
The configurations for each endpoint are contained in the factory configuration files. The worker-config.yaml
file describes which factory configuration files are used by each endpoint as well as the pipelines for each endpoint. To modify parameter values or any other configuration values, copy the factory configuration file into the profile path and modify the values.
Example 2. Modifying entities parameters default values
Let's go back to our example with profile-ids of group1 and group2. Group1 wants to modify the default entities parameters, setting entity linking to true
and case sensitivity to false
. These parameters are set in the rex-factory-config.yaml
file.
-
Copy the file /launcher/config/rosapi/rex-factory-config.yaml
to rosette-users/group1/config/rosapi/rex-factory-config.yaml
.
-
Edit the new rex-factory-config.yaml
file as needed. This is an excerpt from a sample file.
# rootDirectory is the location of the rex root
rootDirectory: ${rex-root}
# startingWithDefaultConfigurations sets whether to fill in the defaults with CreateDefaultExtrator
startingWithDefaultConfiguration: true
# calculateConfidence turns on confidence calculation
# values: true | false
calculateConfidence: true
# resolvePronouns turns on pronoun resolution
# values: true | false
resolvePronouns: true
# rblRootDirectory is the location of the rbl root
rblRootDirectory: ${rex-root}/rbl-je
# case sensitivity model defaults to auto
caseSensitivity: false
# linkEntities is default true for the Cloud
linkEntities: true
Each profile can include custom data sets. For example, the entities endpoint includes multiple types of data files, including regex and gazetteers. These files can be put into their own directory for entities, known as an overlay directory. This is an additional data directory which takes priority over the default entities data directory.
Note
If the data overlay directory is named rex, the contents of the overlay directory will completely replace all supplied REX data files, including models, regex, and gazetteer files.
-
If your custom data sets are intended to supplement the shipped files, the directory name must not be rex
.
-
If your custom data sets are intended to completely replace the shipped files, use the directory name rex
.
Example 3. Custom Gazetteer for the Entities Endpoint
We will create a custom gazetteer file called custom_gaz.txt
specifying "John Doe" as an ENGINEER entity type. Full details on how to create custom gazetteer files are in the section Creating a Custom Gazetteer in the Rosette Entity Extractor Application Developer Guide.
-
Create the custom gazetteer file in /Users/rosette-users/group1/custom-rex/data/gazetteer/eng/accept/custom_gaz.txt.
It should consist of just two lines:
ENGINEER
John Doe
-
Copy the file /launcher/config/rosapi/rex-factory-config.yaml
to /Users/rosette-users/group1/config/rosapi/rex-factory-config.yaml
.
-
Edit the new rex-factory-config.yaml
file, setting the dataOverlayDirectory
.
# rootDirectory is the location of the rex root
rootDirectory: ${rex-root}
dataOverlayDirectory: "/Users/rosette-users/group1/custom-rex/data"
-
Call the entities endpoint with the profileId
set to group1
:
curl -s -X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Cache-Control: no-cache" \
-d '{"content": "John Doe is employed by Basis Technology", "profileId": "group1"}' \
"http://localhost:8181/rest/v1/entities"
You will see "John Doe" extracted as type ENGINEER from the custom gazetteer.
You can train and deploy a custom model to the entities endpoint for entity extraction. You can either:
-
Copy the model file to the default data directory in the REX root folder.
<RosetteServerInstallDir>/roots/rex/<version>/data/statistical/<lang>/<modelfile>
where <lang> is the 3 letter language code for the model.
-
Copy the model to the data directory of a custom profile.
<profile-data-root>/<profileId>/data/statistical/<lang>/<modelfile>
where <lang> is the 3 letter language code for the model.
The custom profile must be set up as described in Setting up custom profiles
Tip
Model Naming Convention
The prefix must be model.
and the suffix must be -LE.bin
. Any alphanumeric ASCII characters are allowed in between.
Example valid model names:
-
model.fruit-LE.bin
-
model.customer4-LE.bin
In this example, we're going to add the entity types COLORS and ANIMALS to the entities endpoint, using a regex file.
-
Create a profile-data-root
, called rosette-users in the Users
directory.
-
Create a user with the profileId
of group1. The new profile-path
is:
/Users/rosette-users/group1
-
Edit the Rosette Server configuration files:
adding the profile-data-root.
# profile data root folder that may contain app-id/profile-id/{rex,tcat} etc
profile-data-root=file:///Users/rosette-users
-
Copy the rex-factory-config.yaml
file from /launcher/config/rosapi
into the new directory:
/Users/rosette-users/group1/config/rosapi/rex-factory-config.yaml
-
Edit the copied file, setting the dataOverlayDirectory
parameter and adding the path for the new regex file. The overlay directory is a directory shaped like the data
directory. The entities endpoint will look for files in both locations, preferring the version in the overlap directory.
dataOverlayDirectory: "/Users/rosette-users/group1/custom-rex/data"
supplementalRegularExpressionPaths:
- "/Users/rosette-users/group1/custom-rex/data/regex/eng/accept/supplemental/custom-regexes.xml"
-
Create the file custom-regexes.xml
in the /Users/rosette-users/group1/custom-rex/data/regex/eng/accept/supplemental
directory.
<regexps>
<regexp type="COLOR">(?i)red|white|blue|black</regexp>
<regexp type="ANIMAL">(?i)bear|tiger|whale</regexp>
</regexps>
-
Call the entities endpoint without using the custom profile:
curl -s -X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Cache-Control: no-cache" \
-d '{"content": "The black bear fought the white tiger at London Zoo." }' \
"http://localhost:8181/rest/v1/entities"
The only entity returned is London Zoo:
{
"entities": [
{
"type": "LOCATION",
"mention": "London Zoo",
"normalized": "London Zoo",
"count": 1,
"mentionOffsets": [
{
"startOffset": 41,
"endOffset": 51
}
],
"entityId": "T0"
}
]
}
-
Call the entities endpoint, adding the profileId to the call:
curl -s -X POST \ -H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Cache-Control: no-cache" \
-d '{"content": "The black bear fought the white tiger at London Zoo.",
"profileId": "group1"}' \
"http://localhost:8181/rest/v1/entities"
The new colors and animals are also returned:
"entities": [
{
"type": "COLOR",
"mention": "black",
"normalized": "black",
"count": 1,
"mentionOffsets": [
{
"startOffset": 4,
"endOffset": 9
}
],
"entityId": "T0"
},
{
"type": "ANIMAL",
"mention": "bear",
"normalized": "bear",
"count": 1,
"mentionOffsets": [
{
"startOffset": 10,
"endOffset": 14
}
],
"entityId": "T1"
},
{
"type": "COLOR",
"mention": "white",
"normalized": "white",
"count": 1,
"mentionOffsets": [
{
"startOffset": 26,
"endOffset": 31
}
],
"entityId": "T2"
},
{
"type": "ANIMAL",
"mention": "tiger",
"normalized": "tiger",
"count": 1,
"mentionOffsets": [
{
"startOffset": 32,
"endOffset": 37
}
],
"entityId": "T3"
},
{
"type": "LOCATION",
"mention": "London Zoo",
"normalized": "London Zoo",
"count": 1,
"mentionOffsets": [
{
"startOffset": 41,
"endOffset": 51
}
],
"entityId": "T4"
}
Configuring Rosette Server
For a full description of installing Rosette Server and all configuration parameters, refer to the Server User Guide. This section describes a few of the more common configuration parameters.
Enable passing files to endpoints
Most endpoints can take either a text block, a file, or a link to a webpage as the input text. The webpage link is in the form of a URI. To enable passing a URI to an endpoint, the enableDTE
flag must be set in the file com.basistech.ws.worker.cfg
.
By default, the flag is set to True
; URI passing is enabled.
#download and text
extractorenableDte=true
Modify the input constraints
The limits for the input parameters are in the file /rosapi/constraints.yaml
. Modify the values in this file to increase the limits on the maximum input character count and maximum input payload per call. You can also increase the number of names per list for each call to the name deduplication endpoint.
The default values were determined as optimal during early rounds of performance tests targeting < 2 second response times. Larger values may cause degradation of system performance.
Table 21. constraints.yaml
Parameter
|
Minimum
|
Maximum
|
Default Value
|
Description
|
maxInputRawByteSize
|
1
|
10,000,000
|
614400
|
The maximum number of input bytes per raw doc
|
maxInputRawTextSize
|
1
|
1,000,000
|
50000
|
The maximum number of input characters per submission
|
maxNameDedupeListSize
|
1
|
100,000
|
1000
|
The maximum number of names to be deduplicated.
|
To modify the input constraints:
-
Edit the file /rosapi/constraints.yaml
-
Modify the value for one or more parameters
Setting Rosette to pre-warm
To speed up first call response time, Rosette can be pre-warmed by loading data files at startup at the cost of a larger memory footprint.
Most components load their data lazily, meaning that the data required for processing will only be loaded into memory when an actual call hits. This is particularly true for language-specific data. The consequence is that when the very first call with text in a given language arrives at a worker, the worker can take a quite a bit of time loading data before it can process the request.
Pre-warming is Rosette's attempt to address the 1st-call penalty by hitting the worker with text in every licensed language it supports at boot time. Then, when an actual customer request comes in, all data will have already been memory mapped and you won't experience a first call delay as the data is loaded. Only languages licensed for your installation will be pre-warmed.
The default is set to false
, pre-warm is not enabled.
To set Rosette to warm up the worker upon activation
On macOS/Linux or Windows:
-
Edit the file /com.basistech.ws.worker.cfg
-
set warmUpWorker=true
Tip
When installing on macOS or Linux, Rosette can be set to pre-warm in the installation. Select Y
when asked Pre-warm Rosette at startup?
You can always change the option by editing the com.basistech.ws.worker.cfg
file.
With Docker:
-
Edit the file docker-compose.yml
-
Set ROSETTE_PRE_WARM=true
Configuring worker threads for HTTP transport
Multiple worker threads allow you to implement parallel request processing. Generally, we recommend that the number of threads should be less than the number of physical cores or less than the total number of hyperthreads, if enabled.
You can experiment with 2-4 worker threads per core. More worker threads may improve throughput a bit, but typically won't improve latency. The default value of worker threads is 2.
If the URL for all licensed endpoints are set to local:
(not distributed):
-
Edit the file /config/com.basistech.ws.transport.embedded.cfg
.
-
Modify the value of workerThreadCount
If using transport rules in a distributed deployment on macOS/Linux or Windows:
-
Edit the file /config/com.basistech.ws.transport.embedded.cfg
.
-
Modify the value of workerThreadCount
.
-
Edit the file /config/com.basistech.ws.worker.cfg
-
Modify the value of workerThreadCount