This section provides an overview of the tasks and issues to be considered when developing a data corpus and training models. Previous sections describe how to perform these tasks with Rosette Adaptation Studio.
One of the most critical tasks when training machine learning systems for NLP is developing the data corpus used to train the models. Acquisition and curation of the data may be single most important and difficult part of training machine learning models. The Studio can reduce the time to annotate and train a model.
The tasks for collecting data and training models are:
The first step is determining the linguistic annotations required for your specific NLP task. Each annotation task requires its own set of annotations. The annotations may depend on other NLP tasks. For example, tasks such as language identification and word segmentation may prepare samples for named entity recognition or sentiment analysis.
Examples of annotation types required for different tasks include:
POS-tagging: POS tags, at either the character or word level
Named entity recognition: entity types
Event extraction: events
Named Entity Recognition
For named entity recognition you need to understand the business and determine the types of entities to extract. Are you looking for names of people, locations, and organizations for a content management system? Are you analyzing financial news stories to extract companies and financial metrics? Are you sifting through social media posts to find the names of products and companies, and then looking at sentiment towards those entities? Each use case requires data that is representative of the task at hand.
Questions to answer:
What type of data will you be analyzing? Financial news stories? Tweets? Company press releases?
What are the specific types of entities you are interested in? Can you define them clearly and unambiguously?
What are your requirements for precision vs. recall? Is it more important to return the most "correct" results, with the chance of missing some results or to return "all possible results"?
The data must reflect the real-world data you'll be analyzing or processing, by both content and format. You also want to be sure that it is balanced, with relatively similar numbers of examples of the different types of data. And you need to have enough data. When collecting and annotating data, if possible, collect more than you think you will need. This allows you to ensure the data is representative and balanced, and provides enough for future customizaiton and training.
The documents must represent the type of text you want Rosette to process, in terms of content as well as format, including:
domain vocabulary - financial, scientific, legal
format - electronic medical records, email, patent applications, news stories
genre - novel, sports news, tweets
If you are interested in analyzing tweets, then you should collect tweets for your corpus (rather than Wikipedia articles, instruction manuals, or sports commentary transcripts). Similarly, if the goal is analyzing news articles, your corpus should consist of news articles (not tweets or novels). Language (English, German), domain (financial, scientific, legal), genre (novel, sports news, tweets), register (formality of speech), and document size (tweets, blogs, magazine articles) are all aspects of the content that you should consider when assessing representativeness.
Remove "noisy documents" from the dataset, including documents that are not relevant for the task or do not contain sufficient lexical content. For example, a document which consists of a URL (and nothing else) is not a useful sample (unless you are analyzing URLs).
Note that you can use Rosette’s language identification functionality to select only documents in the target language.
The documents in the test dataset should collectively contain enough instances of each type you want the system to extract, and each type should have roughly the same number of examples.
A balanced dataset avoids generating biased results or creating a biased model. Your model may be much better at extracting people than organizations. If your test data is primarily people, your results will look better than they actually are. Or conversely, if your test data is primarily organizations, your model may appear to perform a lot worse than it actually does.
Each type should have roughly the same number of examples and the data should be balanced across sources. For example, if you are drawing on multiple data sources, there shouldn’t be disproportionately more data from one source than another. If the data comes from different time periods, try to get a balanced distribution across time. Depending on the variables and the task, you may need to downsample your data in order to maintain balance, even if it reduces the size of your dataset. Don't hesitate to remove documents from your dataset in order to maintain balance as long as you have enough data.
Preparing Data for Annotation
An ideal corpus includes properly formatted and selected documents whose contents are clean and in plain-text. In this section we list how to prepare your data before starting annotating.
Avoid duplicate documents or near-duplicate documents in your corpus. Duplicate data skews the frequencies of different vocabulary terms. Duplicates or near duplicates can also create "data leakage" if they occur across your training corpus -testing corpus partition. Data in your testing set should never overlap with your training set. There are various tools for finding duplicate documents in a corpus before you partition your dataset. We recommend:
Dedup - A Python script for finding duplicate text files.
Onion - (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
Converting Documents to Unicode
Most plain-text documents are encoded as Unicode UTF-8. If you find data in other encodings, we recommend converting all your data to UTF-8. Here are some recommended tools for detecting and converting character encodings:
Extracting Plain Text from Markup
If the text in your documents includes markup tags (such as HTML, XML, Markdown, etc.), then you will need to extract the plain-text. Some suggested tools that automatically extract text from marked up documents are:
For documents in a particularly cumbersome markup language, consider converting it to a friendlier format using a tool such as Pandoc.
If you need to write your own software to extract text, start by researching what tools your go-to programming language offers; most programming languages will have libraries for parsing markup of various kinds. For example, Python’s standard library offers several tools for working with HTML, XML, and other similar markup languages. Some useful non-standard Python tools for working with various markup formats include:
Once the requirements for the task are defined, create annotation guidelines, completely defining the labels to be used when annotating the corpus. When composing annotation guidelines, describe each label and what it is intended to capture. While descriptions are useful, examples are essential. For each label, include positive examples, cases where the label in question applies. Sometimes negative examples, where the label in question does not apply, are equally instructive. For examples that are not straightforward, such as those that annotators disagree on, it is important to discuss those examples and explain the reasoning behind the final decision that was made.
The annotation guidelines should be uploaded into Rosette Adaptation Studio by selecting the Manage link from the navigation bar. Anyone working on the project can click on the Guidelines link in the system menu to review the guidelines.
Manual annotation is the part of the cycle where trained humans augment raw, linguistic content according to the formal data model designed in the previous step. Rosette Adaptation Studio improves the process of manually annotating a dataset. It applies active learning to reduce the number of annotations required and provides a fluid interface for annotators and annotation managers. See Annotate for details on annotating with Rosette Adaptation Studio.
Once you've collected and annotated your data, the data must be partitioned into training and validation sets before moving on to training the model.
If you evaluate a model on documents that it has seen during training, you are not measuring the expected performance on new, unseen data that the model will be expected to encounter in the real world. It can be a helpful quick check to evaluate a model against the data it was trained on, but, you should expect the results to be near perfect, and highly inflated compared to expected real-world performance.
Training set: The model learns from this subset.
Validation set: This subset (sometimes called a "hold-out" or "evaluation" set) should never be inspected by you or your model. It is imperative that this dataset is left uninspected to prove that the trained and tuned model that can perform at an acceptable level of accuracy on new, unseen data. This dataset simulates new, unseen data. If you accidentally inspect this subset, or accidentally train your model on this subset, you will no longer be measuring your model's ability to generalize.
When you split your dataset into training/validation subsets, try to ensure that each subset maintains the desired properties of balance and representativeness. Additionally, if there are metadata in your corpus that represent different dimensions of your data, it is ideal to maintain the distributions of those attributes within each subset. That is to say, if you collected your data from different sources, from different time periods, etc., each subset should represent a similar distribution of sources, time periods, etc. If these properties can be maintained, it will help mitigate skewing your classifier.
Ideally you want to give your classifier as much training data as possible while maintaining sufficiently representative validation subset.
In Rosette Adaptation Studio when you add documents you assign them to either the training or validation set, or you can let the studio assign them for you.
When using the studio's automatic assignment of training and evaluation sets there is no guarantee of preserving the proper balance of metadata, categories, or any other dimensions of your data.
The model is trained on the data in the training set. Scores (precision, recall, and F1 measure) are calculated on the data in the validation set. You can also at any point select Test Document to test a specific document agains the current model.
Rosette Adaptation Studio continually trains and tests as you annotate more documents. Each time a new sample is annotated from the training set, the model is retrained.
As you evaluate your model, you may identify problems with the data corpus or the annotations. Fixing these may require one or more of the following actions:
updating the annotation guidelines
collecting more data
annotating new data or re-annotating existing data
cleaning or modifying the existing data
Common pitfalls in developing and annotating the data corpus are described below.
When you are developing your corpus, it is not possible to know, a priori, how much data will be sufficient. Sometimes you will select a corpus and perform careful annotation and train a system only to discover that the model does not perform well. One way to assess if your model is hampered by a lack of data is to plot a so-called learning curve.
To plot a learning curve:
Divide your training data set into parts.
Train your model on the first part and plot its performance with respect to size (number of documents) trained on, evaluating against a held out test set.
Add the next part of the training set and plot the performance again, evaluating against the same held out test set.
Generally, the trend of this plot should show an increase in performance as more data is added. If this is not the case, it usually indicates a problem with the annotations.
If the trend of this plot increases, but levels off after a certain amount of training data has been added, then adding additional data is unlikely to help.
If the trend of the plot continues to increase without leveling off, add additional data to your training set until you reach a point where adding more data achieves diminishing returns
If your training set is imbalanced, the types that have few examples relative to other types may perform poorly. In this case, you may need to add more samples to the categories or labels that have insufficient data.
An ideal corpus includes only documents whose contents are clean plain-text. Undesirable qualities of data which can have an impact on your models include:
Badly encoded data: Most plain-text documents are encoded as UTF-8, however, sometimes you will find data encoded with other encodings. All data must be encoded as UTF-8.
Markup: If the text is embedded within markup (such as HTML, XML, Markdown, etc.), then you will need to extract the plain-text.
If you have documents marked up with a particularly cumbersome markup language, you may consider converting it to a friendlier format.
Metadata: Some documents, especially those collected from structured websites, may include text that is not part of the document body, such as titles, headers, footers, site map information, or other metadata. At best, such metadata can just be noise, essentially artificially over-representing the frequencies of certain terms in your corpus. At worst, the metadata can cause your model to learn based on the metadata itself, instead of the document content.