To run REX, you need to define an entity extractor. The createDefaultExtractor
method creates an extractor holding all the default data. The license file and data files must be in their default locations under the rex-je-<VERSION>
directory.
-
Create an extractor using the default configuration.
EntityExtractor extractor = EntityExtractor.createDefaultExtractor(rootDirectory);
-
(Optional) Add your own extraction rules.
-
(Optional) Set the processors to be used
-
Use the extractor createAnnotator(LanguageCode)
method to create an Annotator for processing the input text and extracting entities.
extractor.createAnnotator(LanguageCode.ENGLISH);
Starting with an Empty Extractor
If you do not want to use the default configuration, you can start with an empty extractor and build from there.
-
Create an empty extractor, an extractor with no rules.
EntityExtractor extractor = EntityExtractor.createEmptyExtractor();
-
Designate the Rosette license, by designating the Rosette license file or designating the text contained in the Rosette license file.
extractor.setLicense(new File("path/to/rlp-license.xml"));
or
extractor.setLicense("content-of-rlp-license-file");
-
Designate the RBL directory. For some languages, REX requires an RBL root directory to provide data used in segmentation and/or morphological analysis. A suitable installation of RBL is provided as part of the REX distribution, but is not set when using an empty extractor.
extractor.setBaseLinguisticsRootDirectory("path/to/RBLroot");
-
Add your own extraction rules.
-
(Optional) Set the processors to be used
-
Use the extractor createAnnotator(LanguageCode)
method to create an Annotator for processing the input text and extracting entities.
extractor.createAnnotator(LanguageCode.ENGLISH);
General Configuration Options
Handling Structured Regions
The REX statistical model is trained to extract entities from unstructured text, where the model uses the syntactic context in sentences to help identify entities and entity types. But not all data is unstructured. Often input documents contain some sections of structured text, such as tables and lists, along with the unstructured text. Structured text usually does not contain full sentences and is often missing the syntactic context that REX expects. This can lead to noisy results and false positives.
In addition to sentences and token, the Rosette Base Linguistics (RBL) processor identifies structured and unstructured regions. For structured regions, REX disables the statistical processor. The text in structured regions is still processed by the rule-based processors (gazetteers and regexes) and the linker. Additionally, for some languages, another extractor, the name classifier, can extract entities from structured regions of text.
By default, structured regions are processed the same as unstructured regions.
To change how structured text is processed, set the enum structuredRegionsProcessingType
when configuring the entity extractor. It has three values:
-
none: (default) Disables the statistical/DNN models from processing structured regions. When set to none
, REX does not attempt to extract entities from structured regions using the statistical processor or DNN models. The rule-based extractors (gazetteers, regex) and the linker are used to process structured regions.
-
nerModel: Processes the entire document as unstructured text. Structured regions are processed the same as unstructured regions.
-
nameClassifier: Disables the statistical/DNN models from processing structured regions and enables the name classifier on the structured regions.
Some structured regions may contain enough syntactic context for the statistical/DNN models to accurately extract entities. You can set a minimum number of tokens required in a structured region to override the structured region processor setting. If the number of tokens in the region exceeds this minimum, the region will be processed with the statistical/DNN models. The default value is 0. With this default, all structured regions are processed as defined by the structuredRegionsProcessingType
.
public REXFactoryConfiguration
EntityExtractor.setstructuredRegionProcessingSentenceTokensMin(Integer tokensMin);
Fragment Boundary Detector
Note
Disabling the fragment boundary detector will classify the entire text as unstructured. This has a similar effect to setting structuredRegionsProcessingType
to nerModel
.
REX detects entities within sentences. By default, REX uses a fragment boundary detector to identify structured regions, adding sentence boundaries at tabs, newlines, and multiple whitespace characters (such as 3 or more spaces) in text fragments, such as lists and tables. This enables the detection of multiple entities in text fragments that do not form standard sentences. Consider the following text:
George Washington
John Adams
Thomas Jefferson
Without the fragment boundary detector, the statistical model identifies the preceding text as a single PERSON entity. With the fragment boundary detector, the statistical model identifies three separate PERSON entities.
Turn off the fragment boundary detector using the EntityExtractor#setUseFragmentBoundaryDetector(boolean useFragmentBoundaryDetector)
method.
Note
While the fragment boundary detector improves REX's performance on tables, lists, and other non-prose content, REX is, by design, tuned for prose and may not return high accuracy results on content with significant non-prose elements.
If your project has a set of unique data files that you would like to keep separate from other data files, you can put them in their own directory, also known as an overlay directory. This is an additional data directory, which takes priority over the default REX data directory.
The overlay directory must have the same directory tree as the provided data
directory. If an overlay directory is set, REX searches both it and the default data
directory.
-
If a file exists in both places, the version in the overlay directory is used.
-
If there is an empty file in the overlay directory, REX will ignore the corresponding file in the default data
directory.
-
If there is no file in the overlay directory, REX will use the file in the default directory.
To specify the overlay directory use:
EntityExtractor#setOverlayDataDirectory(Path overlayDataDirectory)
Example 1. Turn Off a Specific Language Gazetteer
EntityExtractor extractor = EntityExtractor.createDefaultExtractor(new File("/path/to/root"));
// 'my-data' has an empty file at "gazetteer/eng/accept/gaz-LE.bin"
// so 'American' will not be extracted as a Nationality
extractor.setOverlayDataDirectory(Paths.get("my-data"));
String input = "George Washington was an American president.";
Annotator ann = extractor.createAnnotator(LanguageCode.ENGLISH);
AnnotatedText anText = ann.annotate(input);
for (Entity e : anText.getEntities()) {
System.out.println(e.toString())
}
Output:
Entity{extendedProperties={}, type=PERSON, Mentions=[Mention{extendedProperties={},
startOffset=0, endOffset=17, source=statistical,
subsource=/path/to/root/data/statistical/eng/model-LE.bin, normalized=George Washington}]}
Entity{extendedProperties={}, type=TITLE, mentions=[Mention{extendedProperties={},
startOffset=34, endOffset=43, source=statistical,
subsource=/path/to/root/data/statistical/eng/model-LE.bin, normalized=president}]}
The default English gazetteer will not be used in calls.
REX can return a salience score for each extracted entity. Salience indicates whether the entity is important to the overall scope of the document. Returned salience scores are binary, either 0 (not salient) or 1 (salient). The decision is made according to several parameters, such as frequency, distance from document start, etc. Salience is not calculated by default.
To enable entity salience use the method:
EntityExtractor#setCalculateSalience(boolean calculateSalience)
Entity.getSalience()
To return the salience value:
Entity.getSalience()
Invalidating Internal Data Caches
For the most port, REX uses memory-mapping techniques to keep Java heap memory usage low. However, there are certain cases, such as specific hardware configurations or heavy use of dynamic data, where REX's internal caches might cause memory problems. In such cases, REX provides some APIs to invalidate its caches so Java can reclaim the memory.
Cache eviction is currently supported for the statistical and gazetteer processors. A list of currently cached language data for a specific processor can be retrieved with either the getCachedLanguagesForProcessor()
function in com.basistech.rosette.rex.EntityExtractor
or com.basistech.rosette.rex.REXAnnotatorFactory
.
To invalidate the cache:
-
For specific language/processor pairs, use invalidateProcessorCacheForLanguage()
.
-
To invalidate all data for specific languages, use invalidateCacheForLanguage
.
-
To invalidate all of REX's internal caches, use invalidateCache()
.
Invalidating cached data simply removes references from REX's caches so that Java's garbage collector can reclaim the memory. If there are still other references to an annotator for a specific language in the user process, memory won't be freed until those references are also disposed of.
Retrieving Base Linguistics Configuration
REX internally uses Rosette Base Linguistics (RBL) to analyze the text before processing it. If the user application already uses Base Linguistics for other purposes, it's possible to save processing time and have REX annotate pre-toxenized documents by passing REX's annotator annotate
function a tokenized AnnotatedText
instance instead of a string. However, if the user's instance of Base Linguistics and REX's internal instance of Base Linguistics are configured differently, REX's results might be affected.
To solve the problem, EntityExtractor
provides a getBaseLinguisticsParameters
function that returns the set of Rosette Base Linguistics options REX uses internally, given a language. This function should be called after the EntityExtractor
has been otherwise configured. It returns an EnumSet
of keys to the values REX configures them to.
Tip
REX provides a sample (rex-je-<version>/samples/RBLParametersSample.java
) which demonstrates how to retrieve RBL parameters from REX and use RBL directly to process documents before running the REX extractor.