REX has a plugin architecture that allows users to create custom processors that can be inserted into the REX pipeline at two points.
preExtractor phase - a custom processor may insert additional text pre-processing after input, but before tokenization and sentence breaks (either provided by Rosette Base Linguistics or the user’s own tokenizer and sentence breaker).
preRedaction phase - a custom processor may insert corrections or modifications to output from the default extractors (statistical, regex, gazetteer), using the full information and context that the default extractors have access to (e.g., plain text data, sentence boundaries, tokens, full list of entities extracted and the source extractors which found them, boundaries, and processor types, etc.)
Pre-Extraction Custom Processors: For Additional Text Pre-Processing
Custom processors at the
preExtractor phase can provide additional text pre-processing. For example, if the files contain boilerplate, footers, and navigation bar text that are not the target of the analysis, including these parts of the document in the analysis may trip up the tokenization process and thus decrease the overall quality of extraction results. A
preExtractor custom processor can strip footers of emails or add metadata to the target files.
The code sample at
samples/src/SampleCustomProcessor.java includes a custom processor called
MetadataAnnotator, which adds metadata to each file.
Pre-Redaction Custom Processors: For Correcting/Modifying Extractor Output
Custom processors at the
preRedaction phase are run after the default processors (statistical, gazetteer, regex) and any filters (reject files for regex and gazetteer) have run, but before the redactor. A custom processor at the
preRedaction phase receives all information and context of the intermediate results from the output of the default extraction processors, and can make modifications to those results before the redactor phase adjudicates conflicts between the results from statistical, gazetteer, and regex processors.
Only entities and metadata attributes fields can be updated with the pre-redaction custom processor. If the custom processor attempts to make changes in forbidden fields, specifically data (input), token, or sentence attributes, the specified changes will be ignored and a warning will be logged.
Examples of cases that are correctable with a custom processor include:
Reject a mention as an entity: Cases where REX incorrectly extracts a mention that is not an entity can be excluded from the new list of entity results.
Correcting the entity type: If, for example, your dataset consists of personal letters, and you have high confidence that after a closing such as “Love,” or “Sincerely yours,” the entity that follows should be a PERSON, but REX is identifying it as an ORGANIZATION.
Modifying entity boundaries: If, for example, REX is incorrectly extracting “Hi” as part of a PERSON entity, as in “Hi Joe” instead of just extracting “Joe”.
Filters (reject files) vs. Pre-Redaction Custom Processors
The reject files for regexes and gazetteers simply filter out a list of words or a pattern-matched set of words the user does not want to extract as entities. These reject functions operate without considering the context in which these words appear. By contrast, custom processors at the preRedaction phase have access to the entire context in which an extracted entity appears, and thus can implement smarter rules.
Implementing the Custom Processor
You can implement the
Annotator interfaces in Java in your own JAR and register them via the extractor’s
setCustomProcessors. Your custom processor is the factory of the annotator implementation and thus should be familiar with the requirements of your annotators, and provide them with the correct parameters for the language and the phase requests. The
Annotator is the interface to the ADM (i.e., annotated text) and based on the custom processor it manipulates the ADM and outputs it to the next phase.
Walk-Through Example of preRedaction Phase Custom Processors
preRedaction annotator receives entity mentions from all extraction processors, after reject processors run and before redactor and coref processors run. It can reject (remove) entity mentions, modify entity types or adjust entity mention offsets. These modifications will affect the input of the next processors in the pipeline. For example coref would not consider chaining together PERSON and ORGANIZATION mentions into the same entity, so a mention whose entity type was changed from ORGANIZATION to PERSON by a custom processor would only be chained to other PERSON entities. After the
Redactor phase, the rest of the pipeline runs as usual.
The code samples at
samples/src/SampleCustomProcessor.java demonstrate a sample custom processor and how it might be used in an application.
Create a default entity annotator (to compare its output with the modified entity annotator’s output).
Register the custom processor by
Configure the entity extractor to use the custom processor by
Create an entity annotator with three custom processors
personContextAnnotator, specifies that after the letter closing, “Love,” the entity that follows is PERSON. Note that the annotator changes the “source” (indicating which processor produced the result) from “statistical” to “custom processor”. Remember to edit the redactor configuration file,
data/etc/ne-types.xml, to give greater weight to the custom processor if you would like it to “win” when there is a conflict with results from a default processor at the
boundaryAdjustAnnotator, corrects a REX mention boundary issue in which “Hi Joe” is extracted instead of just “Joe”.
Third process is an example of a custom processor at the
preExtraction phase that adds metadata to
Test (serves as the “application”)