The entity extraction endpoint (https://localhost:8181/rest/v1/entities) comes fully configured to extract entities. This guide explains how to modify the configuration of the extractor for your use case.
General Configuration Options
Fragment Boundary Detector
REX detects entities within sentences. By default, REX uses a Fragment Boundary Detector to add sentence boundaries at tabs, newlines, and multiple whitespace characters (such as 3 or more spaces) in text fragments, such as lists and tables. This enables the detection of multiple entities in text fragments that do not form standard sentences. Consider the following text:
George Washington
John Adams
Thomas Jefferson
Without the Fragment Boundary Detector, the statistical model identifies the preceding text as a single PERSON entity. With the Fragment Boundary Detector, the statistical model identifies three separate PERSON entities.
Turn off the Fragment Boundary Detector in the rex-factory-config.yaml
file.
#Regular expressions and gazetteers may be configured to match tokens
#partially independent from token boundaries. If true, reported offsets
#correspond to token boundaries.
snapToTokenBoundaries: false
Note
While the Fragment Boundary Detector improves REX's performance on tables, lists, and other non-prose content, REX is, by design, tuned for prose and may not return high accuracy results on content with significant non-prose elements.
If your project has a set of unique data files that you would like to keep separate from other data files, you can put them in their own directory, also known as an overlay directory. This is an additional data directory, which takes priority over the default REX data directory.
The overlay directory must have the same directory tree as the provided data
directory. If an overlay directory is set, REX searches both it and the default data
directory.
If a file exists in both places, the version in the overlay directory is used.
If there is an empty file in the overlay directory, REX will ignore the corresponding file in the default data
directory.
If there is no file in the overlay directory, REX will use the file in the default directory.
To specify the overlay directory use:
-
Create an overlay directory:
<install-directory>/my-data
-
Add the overlay directory to the rex-factory-config.yaml
file:
dataOverlayDirectory:
<install-directory>/my-data
Example 1. Turn Off a Specific Language Gazetteer
Create an overlay directory:
-
Add an empty file (gaz-LE.bin
) to the overlay directory:
my-data/gazetteer/eng/accept/gaz-LE.bin
-
Add the overlay directory to the rex-factory-config.yaml
file:
dataOverlayDirectory:
<install-directory>/my-data
The default English gazetteer will not be used in calls.
Example 2. Use a Custom German Reject Gazetteer
In the above example, add a reject gazetteer file:
my-data/gazetter/deu/reject/reject-names.txt
REX can return a salience score for each extracted entity. Salience indicates whether the entity is important to the overall scope of the document. Returned salience scores are binary, either 0 (not salient) or 1 (salient). The decision is made according to several parameters, such as frequency, distance from document start, etc. Salience is not calculated by default.
To include the salience in a result for by call, add the option to the request:
"options": {"calculateSalience": true}}
Or to get the salience by default, set the calculateSalience
parameter to true
in the rex-factory-config.yaml
file.
#An option to calculate entity-chain salience values.
calculateSalience: true
Retrieving Base Linguistics Configuration
REX internally uses Rosette Base Linguistics to analyze the text before processing it. If the user application already uses Base Linguistics for other purposes, it's possible to save processing time and have REX annotate pre-toxenized documents by passing REX's annotator annotate
function a tokenized AnnotatedText
instance instead of a string. However, if the user's instance of Base Linguistics and REX's internal instance of Base Linguistics are configured differently, REX's results might be affected.
To solve the problem, EntityExtractor
provides a getBaseLinguisticsParameters
function that returns the set of Rosette Base Linguistics options REX uses internally, given a language. This function should be called after the EntityExtractor
has been otherwise configured. It returns an EnumSet
of keys to the values REX configures them to.