When training an event model in Rosette Adaptation Studio, you should keep an eye on statistics to see how well the model is performing. Many statistics, ranging from accuracy to annotation progress, are exposed to the user in the detailed reports menu or even right on the project dashboard, letting you keep an eye on your model at a glance.
The first thing you'll notice when you look at the project is a green bar at the bottom. The fullness of this bar indicates how close the project is to being fully annotated and adjudicated. Above this bar are some basic metrics about the number of samples in the project, how many have been annotated, and how many have been adjudicated. To the right of these are three statistics related to accuracy: precision, recall, and F1 measure. Precision answers the question "of the answers you found, what percentage were correct?" Recall answers the question "of all possible correct answers, what percentage did you find?" Since precision is sensitive to false negatives and recall is sensitive to false positives, we also include F1 measure, which is the harmonic mean of precision and recall. You can think of this as your model's average accuracy.
It's worth noting that all statistics on the project dashboard are macro averages, as opposed to weighted averages. When calculating the macro averages, each event is not weighted according to its frequency. The Reports section includes weighted averages in which more common events have a greater impact on the average. In other words, weighted averages assume that event types with fewer samples are less important.
The Reports section also includes more detailed breakdowns of event detection and event extraction scores. Event detection refers to the model's ability to identify an event. In other words, it is only concerned with the model's ability to identify the key phrase of an event. Event extraction, on the other hand, refers to the model's ability to identify all the roles in an event, including the key phrase.
All of the above statistics are possible thanks to the evaluation set. This is a set of documents the model considers "gold data." When the evaluation set is labeled by a human, the model compares its own guesses to the human-applied labels in order to track how well it's doing. This is why it's important to have as many documents as possible in the evaluation set — if you don't have enough, you may see misleading statistics.
In projects with a large number of annotators, you will also want to keep an eye on inter-annotator agreement, a statistic available in the Reports section. Inter-annotator agreement quantifies the consistency of annotation between annotators. Scores range from -1.0 to 1.0, with 1.0 indicating perfect agreement between annotators. Lower scores may indicate potential issues with your data, your annotation guidelines, or your annotators’ understanding of the task.
Remember that the success of any language model depends on the quality and quantity of the data provided. When using Adaptation Studio, if the metrics on a model are below satisfactory, consider the following questions:
Are there sufficient samples in the project?
Do enough samples include valid examples of the event?
Are the samples annotated well?
Is annotation generally consistent between annotators?
For more information on using Rosette Adaptation Studio, see the Adaptation Studio User Guide.