The config.yaml
file provided to bin/Train create
determines the features with which your model will be trained. These hyperparameters are set in the configuration file prior to training, and can be modified in subsequent training runs until they are optimal, as measured against your development dataset. Parameters are values that the model learned from your training data. Hyperparameters instead of being learned from data, are predefined before training begins.
The config.yaml
file has three main components, features
, featureSelectors
, and tokenFilters
.
features
lists the specific features to be used at training time and byy the classifier at run-time. General categorization models will use one or both of the features TOKEN_UNIGRAM
and TOKEN_BIGRAM
, although there are a few additional features available to test.
featureSelectors
has only one available member, FREQUENCY_THRESHOLD
, which filters out features that occur less than five times in the training data. we recommend leaving this selector active as it generally reduces a model’s training time and memory footprint without sacrificing accuracy.
tokenFilters
, controls the tokens that will be ignored from the input before feature generation. There are two options here: STOP_WORD
, filters based on a specified list of stopwords; ASCII_NUMERIC
, filters out numeric tokens.
Stop words are words that are blacklisted by the classifier so that they are ignored at training time and at prediction time. In English, it is often useful to ignore certain high frequency, but uninformative words such as "a", "an", "the, "but", "or, "of", "to", "if", etc. Certain classes of words, including determiners, prepositions, particles, and conjunctions are generally not informative for document classification, so they are good candidates for stop words. These are good stop words, intuitively, because there is no reason to believe that the presence or absence of these words in a document will tell you much, if anything, about the content of the document. In fact, most English language documents of any significant length are likely to include these words. You can use the default list of stop words provided by the FTK (classifier-field-training-kit-<version>/etc/stopwords-default.txt
), or create your own. Stop words lists are one example of a hyperparameter because they are predefined, before training your classifier, and the effects of varying stop words lists on classifier accuracy can be measured at evaluation time.
Feature selection is the process of selecting functions that will extract information from the data that will be informative for classification. An example of a feature in the sample configuration is the TERM_UNIGRAM
feature. This feature simply takes the set of words that occur in a document, individually. In NLP, this is commonly referred to as a "bag-of-words" the sequence in which the words occur in the document is ignored. While sometimes a "bag-of-words" is sufficient to help a classifier perform well, sometimes more complex features are required.
The TERM_BIGRAM
feature take pairs of words in sequence from the document, but ignores the overall structure of the document. Unigrams, bigrams and so on are instances of the generalized term “n-gram” which is a subsequence of n words extracted from a longer sequence. For example, for the following sentence, the table lists the unigrams and bigrams:
Budgerigars are popular pets around the world due to their small size, low cost and ability to mimic human speech.
Selecting appropriate features for your task involves some combination of linguistic intuition and experimentation. For this reason, feature selection is sometimes described as an art.
TOKEN_UNIGRAM - each individual token is used as features.
TOKEN_BIGRAM - all token pairs are used as features.
TF_IDF_TOKEN_UNIGRAM - all tokens are used as features with weights representing their TF-IDF (term frequency-inverse document frequecy) score. IDF scores are based off of the training corpus.
NEGATION_BIGRAM - all token pairs where the first token is a "negation word" are used as features. Custom negation words can be provided as an argument to bin/Train create
. This feature is for English only, and negation words should NOT also be included in the stopwords list.