Once you've gathered and annotated your data, the next step is to train and test your classification model using the Rosette Classification Field Training Kit (FTK). This section describes how to install the FTK and the command line tools provided by the FTK. To integrate the models into a real application, you must configure Rosette Enterprise to use the new models with the Rosette Enterprise runtime API.
To build a machine-learned model from labeled training data, you need a directory of training files, arranged by category. The format is:
dataset_root/
category1/
file1
file2
...
category2/
file1
file2
...
...
The directories under the root directory are the category labels. The files under each category are the training examples. These files should be UTF-8-encoded plain text files. They should contain "clean" data. For best results, the input to the classifier at run time should be cleaned in the same way as the training data.Unclean Data
The command line tools are called Train
, TCatCLI
, MeasurementCLI
, and Evaluate
.
Train
is the only command needed to build models and run cross-validation. TCatCLI
is useful for quick testing of a model, MeasurementCLI
is useful if you prepare your own train/test splits, and Evaluate
can measure your model’s performance on an annotated, gold-standard data set.
The Train
command is the only command needed to build models. It can also run cross-validation for a quick and simple way to validate a model once it has been created. Train
has the following four subcommands:
create:
Prepare the data files and create the model directory.
append:
Add more training data. The model must be retrained after adding the data.
train:
Train the model.
xval:
Train the model and run cross-validation from a single dataset, without partitioning data into training and evaluation sets. This function is a sanity check that you have your dataset prepared and your model configured correctly, though it will tend to overestimate the performance of the model. This step helps you determine if you have enough correct data to train the model. Does not save the model.
The following parameters are used by the Train
subcommends
lang
is the 3 letter ISO-693-3 language code for the model language.
config
is the name of the config file containing the hyperparameters for your model.
dataRootDir
is the directory containing the training data files.
modelDir
is the name of the trained model directory.
n-folds
an integer indicating the number of folds to be used in cross-validation.
$ bin/Train
usage: Train [options]... create lang config dataRootDir modelDir
Train [options]... append dataRootDir modelDir
Train [options]... train modelDir
Train [options]... xval n-folds modelDir
usage: options
-c,--cost <arg> one or more (comma separated) cost params
(C); default 0.01 - for train and xval. Only
xval supports multiple cost values.
--negationWords <arg> path to negation word list (create only)
--negativeLexicon <arg> path to negative lexicon (create only)
--positiveLexicon <arg> path to positive lexicon (create only)
--stopwords <arg> path to stopwords list (create only)
--train train model after adding examples
TCatCLI for Quickly Testing Models
TCatCLI
runs the classifier that you trained with the Train
command. It is a command-line version of the categories
endpoint, allowing you to run the function using the trained model before integrating into Rosette Enterprise.
$ bin/TCatCLI
usage: TCatCLI [options] file [...]
--adm print out results in ADM JSON format
-ct,--confidenceThreshold <arg> show confidences above threshold
-et,--elbowThresholding use elbow thresholding (invalidates
-st)
--explanationSet <arg> show top N positive features
(requires -v)
-f,--format <arg> (line|file|file-list), default=file
--featureMatrixCategories <arg> show matrix view of top N categories
-m,--model <arg> model directory
--maxResults <arg> show top N categories, default 1
-o,--output <arg> output file or directory. Output
directory required for -f file-list
-st,--scoreThreshold <arg> show scores above threshold
-v,--verbose verbose results
The command requires the -m
(model) and -o
(output) options, in addition to the input documents. The documents can be a line of text, a file, or a file-list. The default is file. If running with a line of text or file-list, the -f
(format) option is required. TCatCLI
should be run on documents not seen at training time.
When providing a file list to -f
, the file list must contain one input file path per line. Additionally, your output argument (-o
) must be a directory and not a single file, as the command will create one output file per input file in the list.
By default, TCatCLI
returns a single category. The options maxResults
, scoreThreshold
, and confidenceThreshold
allow you to evaluate the multilabel performance of your model along with the default single-label behavior. To return a set number of categories, simply set the maxResults
option to a value greater than 1. Alternatively, set the threshold for a result’s raw score or confidence score (or both) with the scoreThreshold
and confidenceThreshold
options, respectively. Note that if the maxResults
is set alongside either threshold option, it will act as a cap on the number of returned results and not an exact count. In this situation, TCatCLI
will return up to <maxResults>
results as long as their score exceeds the specified threshold value.
MeasurementCLI Command: Measurement
MeasurementCLI
generates the same statistics as Train xval
, but instead of using cross-validation, it compares two lists of data, actual versus predicted. The first list is the gold-standard, annotated category each item belongs to, and the second list is the trained model's predicted category for each item. Each list contains one category per line. It is useful if you prepare your own train/test splits, as described in Evaluating on a Train/Test Partition.
$ bin/MeasurementCLI
usage: MeasurementCLI [options]... actual predicted
-m,--mode <arg> (stats|matrix|list), default=stats, matrix supports up
to 26 categories
If you have a gold set of annotated documents in ADM format, you can score your model’s performance using Evaluate
. By default, this evaluation will calculate per-category and overall (micro and macro) precision (P), recall (F), and F1 scores based on the model's performance in a single-label context. By specifying the --multilabel
flag, the evaluation can also be performed in a multilabel context. In this case, the same P/R/F metrics will be calculated, as well as two additional metrics, hamming-loss and subset-loss. Note that by default, the multi-label evaluation will use a default raw score threshold of -0.25f. Only categories with a score greater than the default raw score threshold are returned. If you find that this value is suboptimal for your dataset and the not all expected categories are being returned, it can be updated by using the --score-threshold
option. Additionally, the --include-categories
and --exclude-categories
options can be used to include or exclude specific categories from the evaluation.
$ bin/Evaluate
usage: Evaluate modelDir dataDir [options]
-et,--elbow-thresholding use elbow thresholding (only applicable
in multilabel mode)
-i,--include-categories <arg> categories to include in evaluation
-m,--multilabel multilabel evaluation
-mr,--max-results <arg> max number of results to return
-s,--score-threshold <arg> raw score threshold (only applicable in
multilabel mode)
-t,--tokenize Ignore gold tokens