Entities are the key actors in your text data: the organizations, people, locations, products, and dates mentioned in documents.
An entity model is trained to extract a set of entity types. Before starting any type of entity recognition project, you must identify the entity types you want to extract.
Each named entity extraction project in Rosette Adaptation Studio results in a model trained by the REX training server (RTS). As the training samples in the project are annotated, the annotated data is sent to RTS to train the model. The training server can train multiple models simultaneously and support multiple projects concurrently.
The current status of the model being trained is displayed in the Manage page.
NER Model Training Lifecycle
-
Documents are uploaded into Rosette Adaptation Studio. Documents are stored in the Rosette Adaptation Studio database.
-
Unsupervised training
-
Word classes are automatically generated as a background task.
-
Supervised training
-
Samples in the training set are annotated and adjudicated (can be auto-adjudicated).
-
The training server receives the annotations as the adjudicated samples are available, and trains a new version of the model. This model is held in memory.
-
The trained model is written out to disk. This process can take a few minutes. While this occurs, suggestions will come from the newly trained model in memory.
-
Training is complete.
-
Suggestions are provided by the model on disk.
-
The model can be deployed to another machine through the Export Model task. Exporting a model has no impact on training or training resources. Export Model is only enabled after all components have completed all training tasks.
Supervised training is repeated as additional samples are annotated and adjudicated.
Model Management and Caching
Rosette Adaptation Studio includes a robust model management system, including a cache to manage multiple models and support model permanence across restarts. Be aware that training a new model takes several minutes. If the server shuts down before writing the model to disk, the current in-memory models are lost, but will be retrained when the training server is restarted.
When there is no annotation occurring, the model is not being trained. After a period of inactivity an inactive model is evicted from the Rosette Training Server, freeing up memory. The model will be automatically reloaded once training resumes.
Unsupervised vs Supervised Training
Adaptation Studio uses two types of training for entity extraction:
-
Unsupervised training: Unsupervised training uses a clustering algorithm to create word classes from uploaded documents. It doesn't require any input from the user.
-
Supervised training: Supervised training uses annotated text to train models.
The major benefit of unsupervised training is that the process does not require the human-intensive effort to annotate example data. The model will discover entities using the context of words within the plain text input. It will generate groupings of words that appear in similar contexts and assign them to the same cluster, like "Boston", "Texas", and "France". The model then uses that cluster information to extract entities from your input.
However, if your domain is different from the default training domain or you are extracting new entity labels, you will need to use supervised training.
Ideally, the model is trained on both word classes and annotated text. The supervised training algorithm uses the word classes as well as the annotated data as input.
Unsupervised training evaluates a corpus of data and extracts word classes which are then used when training the model. The advantage of unsupervised training is that it doesn't require annotating large amounts of data, a resource-intensive process. Unsupervised training works best with a large amount of data; a minimum of 100 MB is recommended though a few GB is better.
Rosette Adaptation Studio automatically performs unsupervised training and generates word classes when you upload documents into the system. To get the greatest impact from unsupervised training you should upload a large corpus of data at once. You do not have to annotate all the uploaded samples. Because generating word classes is a time consuming task, ranging from a few hours for 100 MB to a few days for a few GB of data, we recommend uploading the documents a couple of days in advance of starting your annotation process.
-
Documents are uploaded into the system.
-
Using Rosette Base Linguistics, the input is broken into sentences, tokenized, and normalized to generate the normalized form for each input token.
-
The system scans the normalized input to calculate the distribution of unigram and bigrams.
-
Using the n-gram distributions, the system applies a clustering algorithm to determine the correlation between n-grams.
-
The system creates the final word classes. The algorithm groups up to one thousand words into a word class to yield the optimal extraction accuracy.
Supervised training uses annotated documents to train the model. Supervised training is useful when the target domain is significantly different from the default BasisTech training domain (news stories) and when training a model to extract new entity types (labels).
Rosette Entity Extractor (REX) is shipped with a fully trained statistical model. The default model is trained on annotated news documents. You can improve the extraction of the default entity types by customizing this statistical model with more annotated data. The greater the difference between your domain and the default REX domain, the larger the impact in the results.
When training your new model, you can choose to supplement the annotations from the Studio with the data that was used to train the default statistical model. This is recommended when training a model using the same labels as the default labels. Select Use Basis Training Data when you create the model. This will include the training data used to train the statistical model shipped with Rosette Entity Extractor.
If you choose to Use Basis Training Data you can only train on the default training labels. You cannot add new labels for annotation and training.
Note
The time to train the model when Use Basis training data is enabled may be a few minutes longer than without the extra training data. The time is determined by the number of annotated documents as well as the language.
If this option is not selected, the model will be trained exclusively on the adjudicated annotation data provided by the Studio. This is required when there is at least one label that is not one of the standard REX entities types.
Models from Multiple Domains
The most effective machine learning models are domain-specific because general models don't perform well on specific domains. For example, if you want to analyze tweets, you would want to train a model on tweets. If you also want to analyze financial data, you would train another model on financial data.
To extract entity types that are not part of the default Entity Extractor, train a new model using additional labels. For example, if your application has domain-specific types, or the terminology and usage is distinct, collect separate data corpora from each domain and train separate models for each domain. One example could be legal terms vs. medical terms. In this example, in Rosette Adaptation Studio you would create a new project for each domain and define the labels (entity types) for the domain. Each project will train its own separate domain-specific statistical model with its own set of entity types.
When annotating documents, the documents should be assigned to annotators based on the domain. This can be automated by evaluating the metadata (for example, the document source) or by a classifier that is trained to detect domains.
Multiple models can be deployed in Rosette Server. We call it model mixing when using a statistical model alongside the REX standard models, to extract entities. With model mixing, the models are run in parallel. The default model can be mixed with a domain-specific model because it is trained on a corpus that contains well-edited text in a relatively generic domain. If the specific domain does not contain well-edited texts, for example tweets or other social media sources, the default model should be turned off as it is likely to generate noisy results.
We don't recommend applying more than two models to process a single document, where the two models are the default REX model and a domain-specific model. Trying to process medical documents with multiple models, such as the default model + medical domain model + legal domain model will lead to noisy results, i.e. false positives, from the legal domain model.
In Rosette Server, use custom profiles to route users to the correct models. Custom profiles support a single Rosette Server instance with different data domains for each user group. One profile may use the standard model alongside the legal terms model, while another profile uses the medical terms model along with the REX model.
Models to Extract New Entity Types
Rosette Entity Extractor (REX) is shipped with a statistical model that extracts a set of default labels. There are times you may have the choice of using an existing entity type or creating a new entity type for a particular type of entity.
For example, let's assume we are training a model to extract computer hardware, such as specific mentions of graphics cards, routers, and other computer hardware. These could be extracted as product, which is a default label in some languages. Or, you could decide that your business requires the mentions to be extracted as equipment. To extract as product mentions, you can use the Basis Training Data and extend the existing REX statistical model. To extract as equipment mentions, you are adding a new label and must not use the Basis Training Data.
Similar to the issue of multiple domains, Rosette handles multilingual scenarios by applying language-specific models to different languages. Before applying any language-specific models, Rosette uses a language classifier to identify the language and script of a document, and then applies language/script-specific models based on the document’s classification. A model trained on English text is not expected to perform adequately on Spanish text, much less on a language such as Arabic which is typologically even more different from English than Spanish is, and uses a different orthographic system.
In Rosette Adaptation Studio, create a separate project for each language. A language-specific model is then trained for each language. If you have multiple domains and languages, create a project for each domain-language pair.
Case-Sensitive vs Case-Insensitive Models
Models may be built to be case-sensitive or case-insensitive. This refers to the capitalization (aka 'case') of the input texts.
-
Case-sensitive models are most appropriate for well-formed text where case is an informative model feature.
-
Case-insensitive models are most appropriate for text with no or imprecise casing, for example, tweets, all-caps, no-caps, or text with headline capitalization.
Case-sensitive models are only available for English in Rosette Adaptation Studio.
We recommend splitting your corpus into case-sensitive and case-insensitive documents. At project creation, the option Train Case Sensitive Model is checked by default. If training a case-insensitive model, remove the checkmark. You will train two separate models, one for each input type.
If you don't have enough annotated documents to split the corpus, but you have headlines and body components (the first being case-insensitive, the body case-sensitive), you can train case-sensitive and case-insensitive models on the same corpus. At runtime, you will have to split the input documents into their separate components and process each part of the document with a different model.
When the model is deployed in production, set the caseSensitivity
parameter to automatic
in the rex-factory-config.yaml
file in Rosette Server to let REX decide which model to use at run time.