REX has a field training system to customize and retrain the statistical models to improve the extraction results in your domain. The two primary types of customization are:
The REX FTK is also used to create custom knowledge base models for linking, as described in Creating a Custom Knowledge Base for Linking.
If your domain of information is similar to the default REX domain, news stories, then we recommend that you first use Unsupervised Training and unannotated text to retrain the models. If your domain is very different from the default REX, then we recommend Supervised Training with annotated text to retrain the statistical models.
You can customize the models to statistically extract new types of entities. The default models extract the following entity types: Person, Location, Organization, Product, and Title. The models use linguistic context disambiguate between "Apple Inc." and an "apple". You can extract car parts, medical terms, weapons, and other entity types specific to your use case.
The FTK can be shipped as a Docker container or as a package of scripts (supported by only CentOS currently). The requisites and setup instruction of the two package is somewhat different, as discussed in Requirements and the installation instructions for each package.
Installing the Field Training Kit
-
Installed REX package
-
Field training kit, one of:
-
Field training language resources, rex-training-lang-<lang>.tar.gz
, where <lang>
is the language code in ISO 693-3 Language Codes.
Third party software and hardware
For base Field Training Kit
-
12 GB of RAM, minimum
-
Java 11 or above
-
Python 3.5
-
CentOS 6+
For Docker Field Training Kit
For example, if using Oracle VirtualBox as the VM platform that hosts Docker, run this command:
docker-machine create --driver virtualbox --virtualbox-memory "12288" default
Warning
Windows with Docker.
Due to Docker limitations, the customization process may be slower and more complicated when running on Windows than with Mac or Linux. Plan your time and hardware resources accordingly.
Corpus of text in UTF-8, without markup.
This corpus may be annotated for Named Entities or left in its plain-text form, depending on the type of statistical customization (supervised or unsupervised) and whether you require an objective quantitative accuracy estimation.
We measure a corpus' size by the number of tokens, or words, that comprise it. This includes both entities and non-entities.
We recommend that for both customization methods that you annotate a corpus that meets the minimum required size, which depends on the type of customization. Annotate content from your domain that closely resembles the input that REX will process.
If you train REX with supervised training (or add a new entity type), you must have a separate annotated corpus for evaluation. You should annotate a large enough corpus for both training and evaluation.
Installing without Docker
The following section will guide you through the common setup steps that are needed for either kind of statistical training, annotation or evaluation.
Important
Before installing the FTK, REX must already be installed in your environment.
Install the FTK:
-
Unzip the field training kit rex-training-kit-<version>.tar.gz
. A directory named basis
will be created where you unzipped the file. Let <basis-ftk-path>
be the absolute path to the directory where the basis
directory was created.
-
Create a directory named asset
. Unzip the field training language resource rex-training-lang-<lang>.tar.tz
within the asset
directory.
From within the unzipped field training language resource:
-
Create a directory for each language asset/input/<lang>
:
mkdir -p asset/input/<lang>
-
Set environment variables with <rex-field-training-home>
being one directory above the asset
directory and <rex-installation-path>
being the root directory of the unzipped REX package rex-je-<version>.zip
:
PATH=<basis-ftk-path>/basis/ftk/bin:$PATH
export REX_JE_ROOT=<rex-installation-path>
export ASSET=<rex-field-training-home>/asset
-
Copy your example text files to <rex-field-training-home>/asset/input/<lang>
. Reminder:
Important
The training process requires clean utf-8 input documents with no markup. If acquiring text from the web please make sure to remove HTML tags, javascript, CSS, metadata etc.
-
To display the usage menu:
ftk help
Load the Docker image
-
Create a working directory, which we will refer to as <rex-field-training-home>
.
-
Load the image:
docker load -i rex-training-docker-<version>.tar.gz
-
Validate that the image is loaded:
docker images
Confirm there is an image named rex-field-training
in the images list.
Prepare the training files directory
-
Make the directories asset/input/<lang>
in <rex-field-training-home>
.
mkdir -p asset/input/<lang>
The asset
directory will be mounted to the Docker container and is used for transferring data in and out of it.
-
Extract the language resource to <rex-field-training-home>/asset
.
cd asset
tar xvf ../rex-training-lang-<lang>.tar.gz
After extracting the files, make the asset
directory and sub-directories writable by anyone.
-
Copy your example text files to <rex-field-training- home>/asset/input/<lang>
.
Important
The training process requires clean utf-8 input documents with no markup. If acquiring text from the web please make sure to remove HTML tags, javascript, CSS, metadata etc.
Run the Docker container
docker run -it --rm -v <local-asset-dir-full-path>:/asset
-v <rex-installation-path>:/basis/rex rex-field-training
Optional: If you're planning to perform any annotations using the bundled Brat server (see Annotating below), please add port mapping to the run
command:
docker run -it --rm -v <local-asset-dir-full-path>:/asset -v <rex-installation-path>:/basis/rex
-p <local_port_number>:8080 rex-field-training
Windows: Note the special syntax used on Windows machines to denote mounted path names:
docker run -it --rm -v '//<rex-field-training-home>:/asset'
-v '//<rex-installation-path>:/basis/rex' rex-field-training
An example filepath for Windows is //c/Users/basis/asset:/asset
Tip
You may find it convenient to run the container inside a screen
session (or another terminal multiplexer of your choice), so you can later detach and reattach to your session from a terminal.
A startup message similar to the one below, should now appear in your console window, confirming that your Docker is set up correctly:
REX field training tools (docker) version
Please refer to the legal notices in /basis/ftk/dependencies/ThirdPartyLicenses.txt.
Copyright (c) 2016 Basis Technology Corporation All Rights Reserved.
Support@basistech.com
http://www.basistech.com
brat instance started on port 8080 in this container. brat data: /asset/bratdata
Available commands:
Statistical model:
generate-ngram, cluster-ngram, train-model, evaluate
Advanced tools: learning-curve
Linker model:
generate-linker-binaries, train-linker-model
Binary gazetteer:
build-binary-gazetteer
The previous command placed you at a shell prompt inside a running field training docker container. From here, you may proceed to perform supervised training, unsupervised training, evaluation, or annotation.
Using the Field Training Kit
The field training system enables you to customize and retrain the statistical models with your input to improve the extraction results in your domain. This customization expands the extraction to include entities REX has not encountered.
You have the option of retraining the models on unannotated (unsupervised) or annotated text (supervised). If your domain of information is similar to the default REX domain, news stories, then we recommend that you first use Unsupervised Training and unannotated text to retrain the models. If your domain is very different from the default REX, then we recommend Supervised Training with annotated text to retrain the statistical models.
The major benefit of unsupervised training is that the process does not require the human-intensive effort to annotate example data. The model will discover entities using the context of words within the plain text input. It will generate groupings of words that appear in similar contexts and assign them to the same cluster, like "Boston", "Texas", and "France". The model then uses that cluster information to extract entities from your input.
If your domain is very different from the news stories that REX was trained on, after performing Unsupervised Training, you can use Supervised Training to improve the extraction results using annotated data from your domain. To perform supervised training, you need to annotate a corpus of text. You can use any annotation tool. The REX training system includes the Brat Rapid Annotation Tool.
Generating N-Gram Distributions
Taking the customer-provided example text from /asset/input
, the system splits it into unigram and bigram counts. It breaks the input into sentences, then tokenizes and normalizes the input to begin generating the n-grams. Using Rosette Base Linguistics, we determine the normalized form for each input token. For some languages we use lemmatization to determine the dictionary-form of a token, and then disambiguate to return the correct meaning of the word from multiple lemmas.
Next, the system scans the normalized input to calculate the distribution of unigrams and bigrams. The new distributions are combined with the previously trained and annotated corpus in /asset/corpora/<lang>/news/train
. The combined n-grams are placed in /asset/combined/<lang>.ftk.{uni,bi}.gz
. The n-gram distributions are the building blocks for the clusters, which are created in the next step.
Creating Clusters
Reading the n-grams produced in /asset/combined/<lang>.ftk.{uni,bi}.gz
, the system applies a mutual-information clustering algorithm to determine the correlation between n-grams. The system creates the final word clusters and stores them in /asset/wordclasses/<lang>.gz
.
The algorithm groups up to one thousand words into a cluster, empirically designed to yield the optimal extraction accuracy.
Training Models
Using the newly created word clusters from the previous step, the system can now retrain the model. The system reads the word clusters generated and produces the files model-{LE,BE}.bin
and model_uc-{LE,BE}.bin
in /asset/models/<lang>
.
The supervised training algorithm creates a sequence labeler, a structured, averaged perceptron. It is informed by the annotated text and the unlabeled clusters generated in the previous step.
The field training language resources from BasisTech contain annotated news documents used for training the default REX statistical model. These documents are intentionally encrypted for BasisTech use only.
The model is then compiled into a binary format, which can be read by the REX system. Once the binary model is deployed, REX will use it to extract entities from your domain.
Customer input
-
100MB (or more) of text in UTF-8 without markup.
This is the recommended minimum to improve the model's accuracy. The amount of input you need depends on how different the target domain is from the original text used to train the model.
Important
Creating word clusters can take a significant amount of time to complete. Our tests indicate that retraining the models on high-resource languages like English, Spanish, and Chinese could take up to a few days to complete, when using server-grade machines.
Performing Unsupervised Training
-
Generate the n-gram distributions:
generate-ngram <lang>
The local asset/generated
directory now contains compressed word ngrams statistics derived from your corpus.
-
Create the new clusters:
cluster-ngram <lang>
The local <rex-field-training-home>/asset/wordclasses/
directory now contains the results of the clustering algorithm applied on the word ngrams.
-
Train the models on the new clusters:
train-model <lang>
The output is a one or more binary files that comprise the retrained statistical model (for languages in Latin script there will be two files created: a case-sensitive and a case-insensitive one.)
-
Copy the new models from <rex-field-training-home>/asset/models/<lang>
to <rex-home>/data/statistical/<lang>
, and remove the default models. Keep the naming convention of the models.
Alternatively, call EntityExtractor.setStatisticalModels
method and point the system at the new model(s) for that language.
-
Continue to Evaluating to measure the accuracy of the new statistical models on your input.
Manual annotation of documents is required for the following use cases:
Tip
We strongly recommend that you first perform Unsupervised Training, which does not require any annotation, before attempting supervised training in order to improve the extraction results at a minimal human effort.
To help with the annotation process, the Brat Rapid Annotation Tool is included in the Docker field training container. Once you map to a port on your machine, you can access Brat with your web browser. See the Brat Manual for more information on using and customizing Brat.
When you’ve finished annotating the corpus, this guide will instruct you on how to convert the files to Rosette’s Annotated Data Model (ADM) which REX subsequently uses to retrain the statistical models.
The following section guides you through the process of annotating data files using the bundled Brat server.
Tip
The bundled Brat server is configured for Left-To-Right languages by default. To annotate Right-To-Left languages, (Hebrew, Arabic, etc), edit visual.conf
in your Brat collection (default bratdata
) to include:
[options]
Text direction:rtl
Instructions
-
Create a world-writable directory in <rex-field-training-home>/asset/
named bratdata
:
mkdir <rex-field-training-home>/asset/bratdata
-
Copy your corpus of plain, unannotated text (<filename>.txt)
files into the bratdata
directory. This can be a subset of the corpus you had previously placed in asset/input/<lang>
.
-
For each filename.txt
file, create an empty <filename>.ann
file in the bratdata
directory.
-
Make all of the individual files (<filename>.txt
and <filename>.ann
) as well as the directory itself readable and writable by anyone. For example:
chmod -R ugo+rw <rex-field-training-home>/asset/bratdata
-
If you haven't done so already, run the Docker container while mapping the Brat port (8080) to a free port on your machine. See the General Setup Instructions above.
The container startup scripts populate the empty asset/bratdata
directory with configuration files.
-
Open a browser and go to http://<docker_machine_ip>:<local_port_number>
.
Note that on systems that use a docker-machine you'll need to obtain its IP to be able to access it. See the Docker documentation for additional details.
Follow the instructions to select the files to annotate.
-
In the top, right-hand corner, click Login. You will need to login to Brat every time you start the Docker container.
-
Username: brat
-
Password: brat
-
If you're annotating for a new entity type, open the asset/bratdata/annotation.conf
file. Add the name of the new entity type to the list on a new line, for example "MEDICAL".
-
Annotate the examples with Brat in the browser.
The Brat output will be in the <filename>.ann files. Brat uses a 'standoff' annotation method where the original .txt
files are read-only and the annotations are stored in the readable/writable .ann
files, which represent the annotation in simple tabular format. Below is an excerpt of an .ann
file that represents annotations of the entity type FRUIT
in some Spanish text file:
T1 FRUIT 277 284 Manzana
T2 FRUIT 313 320 toronja
T3 FRUIT 571 579 manzanas
-
Create a directory for the ADM output files
mkdir /asset/<admoutput>
-
When you're done annotating, convert the Brat files to ADM using the corpuscmd
utility:
corpuscmd Brat2Adm --bratInput /asset/bratdata --output /asset/<admoutput>
The output json-serialized-ADM files created will be have the .txt.adm.json
extension.
-
Continue to Supervised Trainingt to retrain the models, to Add a New Entity Type, or to Evaluate the Restrained Statistical Model.
Alternate Ways to Generate Annotated Data
You can also generate annotated data in ADM json format directly. This is the representation and serialization used by REX for training and evaluation. The output generated by REXCmd
, REX’s command-line utility, also uses json-serialized ADMs as its output. You may simply generate (or convert) your corpus following the REXCmd
generated json as an example.
Some users find Brat’s standoff annotation format (comprised of .ann
and .txt
files) a convenient way to represent annotations. If you wish to write Brat files yourself (without using the bundled Brat server) you may do so and then start the process discussed above from step 10.
Please contact <support@basistech.com>
for more information about the Rosette Annotated Data Model and ADM files.
Supervised Training to Improve Accuracy in a Domain
You can improve the extraction of the default entity types from your input by customizing the statistical processor with annotated data. If your domain is drastically different from the default REX domain, then it will have a larger impact on the REX results.
-
Setup Brat and annotate data.
For Supervised Training, we recommend that you annotate a corpus containing a minimum of 60,000 tokens. However, if your target domain is very different from the default REX domain then we recommend a larger corpus. The greater the difference between the domains, the more tokens necessary to create a new statistical model.
-
Copy your corpora of ADM files in a directory (<train corpus path>
) accessible from the container. (You may want to replace the Basis-provided training corpus for "New Model Only" training explained below. To do so, copy your adm files to /asset/corpora/<lang>/news/train
, and remove *.enc files in the directory. Please make sure your corpus contains at least 1000 unique tokens.)
-
Put all available raw, unannotated plain text into <rex-field-training-home>/asset/input/<lang>/
.
-
Setup Docker
Supervised Training
-
From your Docker console, generate the n-gram distributions:
generate-ngram <lang>
-
From your Docker console, create the new clusters:
cluster-ngram <lang>
Instructions for New or Mixed Statistical Models
You must choose whether to extract entities using both the new and the default statistical models together, which we call model mixing, or if you want to exclusively use the new statistical Model.
With model mixing, REX runs both the new and the default models in parallel and uses the redactor module to adjudicate the overlapping results.
Model Mixing
-
From your Docker console, train the model on the new clusters:
train-model -T <entity types> -t <train corpus path> <lang>
where <entity types>
is a comma-delimited list of entity types in the annotated corpus, e.g. -T FRUIT,DRUGS
The output is a binary file of the retrained statistical model. For languages in Latin script there will be two models: case-sensitive and case-insensitive.
-
Copy the new models from <rex-field-training-home>/asset/models/<lang>
to <rex-home>/data/statistical/<lang>
with the existing models. Keep the naming convention of the models.
Alternatively, call EntityExtractor.setStatisticalModels
method and point the system at both the new and default model(s) for that language. Only pointing to the new model(s) will overwrite the default model(s).
-
Once you have retrained the models, proceed to Evaluating the Retrained Statistical Model.
Using New Model Only
-
From your Docker console, train the model on the new clusters:
train-model -w /asset/wordclasses/<lang>.customer.gz -T <entity types> -t <train corpus path> <lang>
The output is a binary file of the retrained statistical model. For languages in Latin script there will be two statistical models: case-sensitive and case-insensitive.
-
Copy the new models from <rex-field-training-home>/asset/models/<lang>
to <rex-home>/data/statistical/<lang>
, and remove the default models. Keep the naming convention of the models.
Alternatively, call EntityExtractor.setStatisticalModels
method and point the system at the new model(s) for that language.
-
Once you have retrained the models, proceed to Evaluating the Retrained Statistical Model.
Tip
Model Naming Convention
The prefix must be model.
and the suffix must be -LE.bin
. Any alphanumeric ASCII characters are allowed in between.
Example valid model names:
-
model.fruit-LE.bin
-
model.customer4-LE.bin
Adding New Statistical Entity Types
The REX statistical processor can also be retrained to extract types of entities that are specific to your domain. This customization functions best with a distinct category of entities which do not have a set pattern, or which have an unlimited number of possibilities. If the entities appear in a pattern, such as license plate numbers, then you should Create a Regex for extracting plate numbers. If there is a finite number of possible entities, such as movies that have won an Academy Award, then you should Create a Gazetteer to extract the movie titles.
When you annotate content, you can add new entity types to the Brat configuration. Then REX will take the annotated content and statistically extract domain-specific entities, such as medical terms.
-
Setup Brat and annotate data
If you are adding a new entity type, we recommend that you annotate a corpus with a minimum of 100,000 tokens. The corpus is much larger as REX needs more data to accurately extract a new entity type and to avoid conflicts with similar entities. More annotated data increases accuracy for that new entity type.
-
Retrain the model, following the steps in Supervised Training.
-
Continue to Evaluation to review the new model’s performance.
Evaluating the Retrained Statistical Model
Initial Accuracy Quick Test
When the statistical training is complete, the system will evaluate the resulting model against a standard data set and report F-scores. These numbers are part of the output generated by train-model
and measure the accuracy of the freshly retrained model on the original evaluation input from BasisTech. They are a useful quick test to ensure that the models are functioning properly. If you encounter dramatically low or unexpectedly high F-scores, it could be a sign that something went wrong in the training process.
Since the model has been adapted to your target domain, these F-scores are not a valid reflection of the actual accuracy of the new model’s performance. It is expected that the scores may be moderately lower because the retrained model is no longer adapted for news documents.
Measuring Accuracy on Your Data
To perform a methodical evaluation of the performance of the default or customized models on your dataset, you must first annotate a corpus of text from your domain. Do not reuse documents that were already used for training.
Instructions
-
Annotate data.
For evaluating the performance of REX, we recommend that you annotate a corpus with a minimum of 30,000 tokens. A larger corpus will produce more statistically significant F-scores.
-
Replace the files in /asset/corpora/<lang>/news/eval
with the new ADM files.
-
If you haven’t done so already, make sure the Field Training Docker container is ready for use.
-
From your Docker console, evaluate the models on your domain:
evaluate <lang>
As in train-model
, you can use a specific set of entity types with -T <entity types>
. The -v
flag will produce verbose output, which is especially useful when using custom entity types
-
The Docker console will report the F-scores, precision, and recall.