Adaptation Studio includes a wealth of options for monitoring the performance of a given model. These are split between two areas: at-a-glance metrics on the project dashboard, and more specific reports in the Reports tab.
Project Dashboard Metrics
On the project dashboard you can easily see the progress of your annotators and adjudicators, as well as metrics pertaining to accuracy. The three accuracy metrics are precision, recall, and F1.
Precision answers the question "of the answers you found, what percentage were correct?" Higher is more precise.
Recall answers the question "of all possible correct answers, what percentage did you find?" Higher is better recall.
F1 measure is the harmonic mean of precision and recall. A higher value means better accuracy.
Next, let's take a closer look in the Reports tab. Select Reports on the project navigation bar. You will see three sections: Annotation, Evaluation, and Inter-Annotator Agreement.
The Annotation section leads you to a graph showing more detailed annotation progress, including the total number of annotations performed by each annotator, and the number of each event they annotated. Since you annotated this tutorial project yourself, you are unlikely to find any surprises here. But this can be a useful place to keep track of projects with a larger number of annotators.
The Evaluation section includes two options: event detection and event extraction.
Event Detection refers to the model's ability to identify an event. It does not concern the event roles, only the key phrase. The detection scores table shows you the model's performance in identifying each event. You can expand the sample breakdown section to compare the annotations and model predictions for each sample in the training set.
Remember, the samples in the evaluation set are the samples used by the model to calculate the evaluation metrics. If you have a very small evaluation set, you may see misleading values.
Event Extraction refers to the model's ability to identify all the roles in an event, including the key phrase. The extraction scores (overall) table shows you the model's general ability to identify roles for each event. The extraction scores (event breakdown) tables go one step further by showing the model's performance in identifying each individual role. You can expand the sample breakdown section to compare the annotations and model predictions for each sample in the training set.
There are two other rows in the detection and extraction tables which merit some additional explanation: model macro, and model weighted. Both of these provide a holistic, rather than event-specific, picture of the model's performance. The key difference is how much each event is weighted when the average is calculated.
Model Macro refers to the unweighted average of each event's precision and recall scores. The macro F1 is the harmonic mean of the macro precision and macro recall scores. All event types contribute equally to the average.
Model Weighted refers to the model's overall precision and recall scores, where the most frequent event types will have the greatest impact on the average. Less frequent event types will impact the score less. Weighted metrics assume that event types with fewer samples are less important.
The Inter-Annotator Agreement section provides more detail on the consistence of annotation on projects with multiple annotators. Scores range from -1.0 to 1.0 with 1.0 indicating perfect agreement between annotators. A reliability score of 0.80 or greater is generally considered sufficient (though, the higher the better). Lower scores may indicate potential issues with your data, your annotation guidelines, or your annotators’ understanding of the task. Since this tutorial only has one annotator (you), this section will be unavailable in your current project.
Improving Model Performance
Remember that the success of any language model depends on the quality and quantity of the data provided. When using Adaptation Studio, if the metrics on a model are below satisfactory, start by exploring the following:
Are there sufficient samples in the project?
Do enough samples include valid examples of the event?
Are the samples annotated well?
Is annotation generally consistent between annotators?