Most NLP tools are evaluated based on their precision and recall. Accuracy is measured as a combination of precision and recall and F1 measure.
Precision answers the question "of the answers you found, what percentage were correct?" Precision is sensitive to false positives; higher is more precise.
Recall answers the question "of all possible correct answers, what percentage did you find?" Recall is sensitive to false negatives; higher is better recall.
F1 meaure is the harmonic mean of precision and recall. The F1 measure is sensitive to both false positives and false negatives; a higher value means better accuracy. It isn't quite an average of the two scores, as it penalizes the case where the precision or recall scores are far apart. For example, if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn't want the F1 measure to be misleadingly high.
The correct measure depends on your application. In applications where you can only handle a few responses, such as voice applications (e.g. Amazon's Alexa), high precision with low recall is ideal, because the system can only present the user with a few options in a reasonable time frame. However, other applications, such as redacting text to remove personally identifiable information, redacting too much (low precision) is much better than missing even one item that should have been redacted. In that case, high recall is preferred.
Notice
The precision, recall, and F1 measures are based only on the samples in the validation set. The values are not calculated for the training set samples.
The values displayed on the project dashboard are calculated using the annotated validation data as the gold data. As the model is trained, it generates new suggestions and the scores are recalculated. The suggestions generated by the model for the validation samples are compared with the annotated values in the samples.
Calculating Precision, Recall, and F-scores
Let's look at how precision, recall, and f-score are calculated. Let's assume a search system, where a number of items were retrieved.
TP: True positive. Number of documents retrieved that are correct.
FP: False positive. Number of documents retrieved that are incorrect.
FN: False negative. Number of documents that should have been retrieved, but weren't.
Retrieved: All documents retrieved = TP + FP
Relevant: All documents that should have been retrieved = TP + FN
Precision is the fraction of correct retrievals among all retrieved instances.
Recall is the fraction of relevant documents that are successfully retrieved.
F-score is the harmonic mean of precision and recall
The project dashboard displays precision, recall, and F1 measures for the labels. There are two different ways of calculating these scores across multiple labels:
Macro Average: The macro average for precision and recall are the means of the per-label precision and recall scores. The macro F1 is the harmonic mean of the macro precision and macro recall scores.
Micro Average: This score accounts for the total false positives and false negatives. The micro scores are sensitive to imbalanced datasets—if one label is overly represented (i.e., there are more gold instances of this label in the evaluation subset relative to the others), it will have a bigger impact on the micro averaged precision, recall, and F1 measures.
Note
The statistics on the project dashboard are micro averages.
Inter-Annotator Agreement
Note
You must be registered as a manager for the project.
Machine-learning models are completely dependent on the quality of the data used for training. Inconsistent annotation or lack of adherence to the project annotation guidelines will lead to less accurate models. Especially when starting a new project or on-boarding new human annotators, check for reliable annotations by having a subset of data annotated in parallel by multiple human annotators.
Krippendorff’s alpha is a statistical inter-rater reliability metric used to measure inter-annotator agreement. Krippendorff’s alpha scores range from -1.0 to 1.0 with 1.0 indicating perfect agreement between annotators. A score of 0.0 indicates agreement no better than random chance (as if your annotators picked their labels randomly out of a hat). A reliability score of 0.80 or greater is generally considered sufficient (though, the higher the better). Lower scores may indicate potential issues with your data, your annotation guidelines, or your annotators’ understanding of the task. A low level of inter-annotator agreement will ultimately lead to a less accurate model, so we recommend repeatedly measuring the reliability of your annotators until they achieve a satisfactory level of agreement. The cases where annotators disagree are usually good examples to include in your annotation guidelines. It can be useful to have a discussion about points of disagreement with your annotators as a group to reach a consensus.
Rosette Adaptation Studio calculates the Krippendorff's alpha score as you adjudicate samples. To view the current values, select Manage from the project toolbar. The values will be displayed for all annotators and as pairs of annotators.