To run REX, you need to define an entity extractor. The createDefaultExtractor
method creates an extractor holding all the default data. The license file and data files must be in their default locations under the rex-je-<VERSION>
directory.
-
Create an extractor using the default configuration.
EntityExtractor extractor = EntityExtractor.createDefaultExtractor(rootDirectory);
(Optional) Add your own extraction rules.
(Optional) Set the processors to be used
-
Use the extractor createAnnotator(LanguageCode)
method to create an Annotator for processing the input text and extracting entities.
extractor.createAnnotator(LanguageCode.ENGLISH);
Starting with an Empty Extractor
If you do not want to use the default configuration, you can start with an empty extractor and build from there.
-
Create an empty extractor, an extractor with no rules.
EntityExtractor extractor = EntityExtractor.createEmptyExtractor();
-
Designate the Rosette license, by designating the Rosette license file or designating the text contained in the Rosette license file.
extractor.setLicense(new File("path/to/rlp-license.xml"));
or
extractor.setLicense("content-of-rlp-license-file");
-
Designate the RBL directory. For some languages, REX requires an RBL root directory to provide data used in segmentation and/or morphological analysis. A suitable installation of RBL is provided as part of the REX distribution, but is not set when using an empty extractor.
extractor.setBaseLinguisticsRootDirectory("path/to/RBLroot");
Add your own extraction rules.
(Optional) Set the processors to be used
-
Use the extractor createAnnotator(LanguageCode)
method to create an Annotator for processing the input text and extracting entities.
extractor.createAnnotator(LanguageCode.ENGLISH);
General Configuration Options
Fragment Boundary Detector
REX detects entities within sentences. By default, REX uses a Fragment Boundary Detector to add sentence boundaries at tabs, newlines, and multiple whitespace characters (such as 3 or more spaces) in text fragments, such as lists and tables. This enables the detection of multiple entities in text fragments that do not form standard sentences. Consider the following text:
George Washington
John Adams
Thomas Jefferson
Without the Fragment Boundary Detector, the statistical model identifies the preceding text as a single PERSON entity. With the Fragment Boundary Detector, the statistical model identifies three separate PERSON entities.
Turn off the Fragment Boundary Detector using the EntityExtractor#setUseFragmentBoundaryDetector(boolean useFragmentBoundaryDetector)
method.
Note
While the Fragment Boundary Detector improves REX's performance on tables, lists, and other non-prose content, REX is, by design, tuned for prose and may not return high accuracy results on content with significant non-prose elements.
If your project has a set of unique data files that you would like to keep separate from other data files, you can put them in their own directory, also known as an overlay directory. This is an additional data directory, which takes priority over the default REX data directory.
The overlay directory must have the same directory tree as the provided data
directory. If an overlay directory is set, REX searches both it and the default data
directory.
If a file exists in both places, the version in the overlay directory is used.
If there is an empty file in the overlay directory, REX will ignore the corresponding file in the default data
directory.
If there is no file in the overlay directory, REX will use the file in the default directory.
To specify the overlay directory use:
EntityExtractor#setOverlayDataDirectory(Path overlayDataDirectory)
Example 1. Turn Off a Specific Language Gazetteer
EntityExtractor extractor = EntityExtractor.createDefaultExtractor(new File("/path/to/root"));
// 'my-data' has an empty file at "gazetteer/eng/accept/gaz-LE.bin"
// so 'American' will not be extracted as a Nationality
extractor.setOverlayDataDirectory(Paths.get("my-data"));
String input = "George Washington was an American president.";
Annotator ann = extractor.createAnnotator(LanguageCode.ENGLISH);
AnnotatedText anText = ann.annotate(input);
for (Entity e : anText.getEntities()) {
System.out.println(e.toString())
}
Output:
Entity{extendedProperties={}, type=PERSON, Mentions=[Mention{extendedProperties={},
startOffset=0, endOffset=17, source=statistical,
subsource=/path/to/root/data/statistical/eng/model-LE.bin, normalized=George Washington}]}
Entity{extendedProperties={}, type=TITLE, mentions=[Mention{extendedProperties={},
startOffset=34, endOffset=43, source=statistical,
subsource=/path/to/root/data/statistical/eng/model-LE.bin, normalized=president}]}
The default English gazetteer will not be used in calls.
REX can return a salience score for each extracted entity. Salience indicates whether the entity is important to the overall scope of the document. Returned salience scores are binary, either 0 (not salient) or 1 (salient). The decision is made according to several parameters, such as frequency, distance from document start, etc. Salience is not calculated by default.
To enable entity salience use the method:
EntityExtractor#setCalculateSalience(boolean calculateSalience)
Entity.getSalience()
To return the salience value:
Entity.getSalience()
Invalidating Internal Data Caches
For the most port, REX uses memory-mapping techniques to keep Java heap memory usage low. However, there are certain cases, such as specific hardware configurations or heavy use of dynamic data, where REX's internal caches might cause memory problems. In such cases, REX provides some APIs to invalidate its caches so Java can reclaim the memory.
Cache eviction is currently supported for the statistical and gazetteer processors. A list of currently cached language data for a specific processor can be retrieved with either the getCachedLanguagesForProcessor()
function in com.basistech.rosette.rex.EntityExtractor
or com.basistech.rosette.rex.REXAnnotatorFactory
.
To invalidate the cache:
For specific language/processor pairs, use invalidateProcessorCacheForLanguage()
.
To invalidate all data for specific languages, use invalidateCacheForLanguage
.
To invalidate all of REX's internal caches, use invalidateCache()
.
Invalidating cached data simply removes references from REX's caches so that Java's garbage collector can reclaim the memory. If there are still other references to an annotator for a specific language in the user process, memory won't be freed until those references are also disposed of.
Retrieving Base Linguistics Configuration
REX internally uses Rosette Base Linguistics to analyze the text before processing it. If the user application already uses Base Linguistics for other purposes, it's possible to save processing time and have REX annotate pre-toxenized documents by passing REX's annotator annotate
function a tokenized AnnotatedText
instance instead of a string. However, if the user's instance of Base Linguistics and REX's internal instance of Base Linguistics are configured differently, REX's results might be affected.
To solve the problem, EntityExtractor
provides a getBaseLinguisticsParameters
function that returns the set of Rosette Base Linguistics options REX uses internally, given a language. This function should be called after the EntityExtractor
has been otherwise configured. It returns an EnumSet
of keys to the values REX configures them to.