While these specific terms may have broader definitions in other contexts, within the scope of this document, we will use the following definitions.
Written prose (we are only concerned with natural language data)
A discrete, cohesive unit of data (e.g. a tweet, a letter, a news article etc. - this is intentionally loosely defined.
a collection (literally "body of") documents
a class or division of documents
The name of a particular category
A set of categories or labels (and possibly relationships between categories); this sense of the term ‘model’ is more specifically a ‘data model’ which we will use in this context
A model exhibiting a hierarchy or tree-like structure branching from most general to most specific
Natural language processing, referring to a set of technologies to analyze text written by people
The process of assigning labels to documents
The process of annotating by hand (by humans)
A system that assigns labels to documents
The process of automatically annotating documents using a classifier (by machine)
A representation of the true positives, false positives, and false negatives for a given evaluation run
A representation of the false positives and false negatives for a given evaluation run
A quick and simple way to validate a model once it's been created, even if you have a small amount of data
Cases where the classifier prediction correctly matches a label
Cases where the classifier prediction says it matches a label, but it does not.
Cases where the classifier prediction says it does not match a label, but it really does match the label
The proportion of the positive identifications which are correct
The proportion of the positives that were correctly identified
A weighted average of precision and recall, where an F1 score reaches its best value at 1 and worst at 0. The harmonic mean of precision and recall.
A validation methodology where the original sample is randomly partitioned into n equal sized subsamples. Of the n subsamples, a single subsample is retained as the validation data for testing the model, and the remaining n − 1 subsamples are used as training data.
The data used to train the classifier
The data used for testing, as opposed to, training the classifier. It is comprised of the development set and the evaluation set.
A subset of data held out from training, but used during the tuning process to improve the classifier
A subset of data used to evaluate the quality of the classifier. This subset should never be inspected or reviewed.