You must be registered as a manager.
Reports about annotation/adjudication progress and inter-annotator agreement (IAA) can be accessed from the Reports option on the project navigation bar. Other metrics, such as precision, recall, and F1 measure, can be viewed directly on the project dashboard.
Provides information to project managers pertaining to the quantity and quality of the data being created by annotators and adjudicators. When training a model, it is critical to have a the correct volume and distribution of high-quality data. Only managers have access to reports.
Annotation/Adjudication: Shows the distribution of annotations and adjudications among all labels. Click Download CSV to download the data.
Annotation Progress: Displays the number of annotations per label. The top row displays the total number of annotations for each label in parentheses next to the label name. Use these numbers to monitor how balanced your data set is. For example, if you are training a model to retrieve location entities, you should use this report to make sure there is a significant number of location annotations.
Adjudication Progress: Displays the number of adjudications per label.
Inter-Annotator Agreement: Shows Krippendorff's Alpha for all annotators and for each pair of annotators. Krippendorff's Alpha is a number between -1.0 and 1.0 that quantifies how much annotators agreed with each other. A higher score indicates higher agreement and therefore better data. This score should ideally be 0.80 or greater.
IAA History: Displays inter-annotator agreement for all annotators (represented by Krippendorff's Alpha on the y-axis) over time (represented by day on the x-axis). Hover your cursor over the data point for each day to see the Krippendorff's Alpha for each pair of annotators for that day. Ideally, IAA should increase at the beginning as initial points of disagreement are resolved, and then level off as annotators improve and approach a consistently high level of agreement.
Precision, Recall, and F1 Measure
Most NLP tools are evaluated based on their precision and recall. Accuracy is measured as a combination of precision, recall, and F1 measure.
Precision answers the question "of the answers you found, what percentage were correct?" Precision is sensitive to false positives; higher is more precise.
Recall answers the question "of all possible correct answers, what percentage did you find?" Recall is sensitive to false negatives; higher is better recall.
F1 measure is the harmonic mean of precision and recall. The F1 measure is sensitive to both false positives and false negatives; a higher value means better accuracy. It isn't quite an average of the two scores, as it penalizes the case where the precision or recall scores are far apart. For example, if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn't want the F1 measure to be misleadingly high.
The correct measure depends on your application. In applications where you can only handle a few responses, such as voice applications (e.g. Amazon's Alexa), high precision with low recall is ideal, because the system can only present the user with a few options in a reasonable time frame. However, other applications, such as redacting text to remove personally identifiable information, redacting too much (low precision) is much better than missing even one item that should have been redacted. In that case, high recall is preferred.
The precision, recall, and F1 measures are based only on the samples in the validation set. The values are not calculated for the training set samples.
The values displayed on the project dashboard are calculated using the annotated validation data as the gold data. As the model is trained, it generates new predictions and the scores are recalculated. The predictions generated by the model for the validation samples are compared with the annotated values in the samples.
Named Entity Recognition Metrics
Named entity recognition models are trained to recognize and extract entity mentions. Each annotated sample may contain multiple entities.
The statistics displayed on the project dashboard are the macro averages of the precision, recall, and F1 measures for the labels.
Event extraction models are trained to recognize and extract multiple components. Each event mention contains a key phrase and one or more roles. A single annotated sample may contain multiple event mentions, and therefore, multiple key phrases and multiple roles.
The statistics displayed on the project dashboard are macro averages of the precision, recall, and F1 measures for the key phrases only. The displayed values are weighted averages across all event types in the model.
Calculating Precision, Recall, and F-scores
Let's look at how precision, recall, and f-score are calculated. Let's assume a search system, where a number of items were retrieved.
TP: True positive. Number of documents retrieved that are correct.
FP: False positive. Number of documents retrieved that are incorrect.
FN: False negative. Number of documents that should have been retrieved, but weren't.
Retrieved: All documents retrieved = TP + FP
Relevant: All documents that should have been retrieved = TP + FN
Precision is the fraction of correct retrievals among all retrieved instances.
Recall is the fraction of relevant documents that are successfully retrieved.
F-score is the harmonic mean of precision and recall
The project dashboard displays precision, recall, and F1 measures for the labels. There are two different ways of calculating these scores across multiple labels:
Macro Average: The macro average for precision and recall are the means of the per-label precision and recall scores. The macro F1 is the harmonic mean of the macro precision and macro recall scores.
Micro Average: This score accounts for the total false positives and false negatives. The micro scores are sensitive to imbalanced datasets—if one label is overly represented (i.e., there are more gold instances of this label in the evaluation subset relative to the others), it will have a bigger impact on the micro averaged precision, recall, and F1 measures.
You must be registered as a manager.
Machine-learning models are completely dependent on the quality of the data used for training. Inconsistent annotation or lack of adherence to the project annotation guidelines will lead to less accurate models. Especially when starting a new project or on-boarding new human annotators, check for reliable annotations by having a subset of data annotated in parallel by multiple human annotators.
Krippendorff’s alpha is a statistical inter-rater reliability metric used to measure inter-annotator agreement. Krippendorff’s alpha scores range from -1.0 to 1.0 with 1.0 indicating perfect agreement between annotators. A score of 0.0 indicates agreement no better than random chance (as if your annotators picked their labels randomly out of a hat). A reliability score of 0.80 or greater is generally considered sufficient (though, the higher the better). Lower scores may indicate potential issues with your data, your annotation guidelines, or your annotators’ understanding of the task. A low level of inter-annotator agreement will ultimately lead to a less accurate model, so we recommend repeatedly measuring the reliability of your annotators until they achieve a satisfactory level of agreement. The cases where annotators disagree are usually good examples to include in your annotation guidelines. It can be useful to have a discussion about points of disagreement with your annotators as a group to reach a consensus.
Adaptation Studio calculates the Krippendorff's alpha score as you adjudicate samples. To view the current values, select Manage from the project toolbar. The values will be displayed for all annotators and as pairs of annotators.
The IAA history report displays the progress of the inter-annotator agreement over time.