Once you've gathered and annotated your data, the next step is to train and test your classification model using the Rosette Classification Field Training Kit (FTK). This section describes how to install the FTK and the command line tools provided by the FTK. To integrate the models into a real application, you must configure Rosette Enterprise to use the new models with the Rosette Enterprise runtime API.
To build a machine-learned model from labeled training data, you need a directory of training files, arranged by category. The format is:
The directories under the root directory are the category labels. The files under each category are the training examples. These files should be UTF-8-encoded plain text files. They should contain "clean" data. For best results, the input to the classifier at run time should be cleaned in the same way as the training data.Unclean Data
The command line tools are called
Train is the only command needed to build models and run cross-validation.
TCatCLI is useful for quick testing of a model,
MeasurementCLI is useful if you prepare your own train/test splits, and
Evaluate can measure your model’s performance on an annotated, gold-standard data set.
Train command is the only command needed to build models. It can also run cross-validation for a quick and simple way to validate a model once it has been created.
Train has the following four subcommands:
create: Prepare the data files and create the model directory.
append: Add more training data. The model must be retrained after adding the data.
train: Train the model.
xval: Train the model and run cross-validation from a single dataset, without partitioning data into training and evaluation sets. This function is a sanity check that you have your dataset prepared and your model configured correctly, though it will tend to overestimate the performance of the model. This step helps you determine if you have enough correct data to train the model. Does not save the model.
The following parameters are used by the
lang is the 3 letter ISO-693-3 language code for the model language.
config is the name of the config file containing the hyperparameters for your model.
dataRootDir is the directory containing the training data files.
modelDir is the name of the trained model directory.
n-folds an integer indicating the number of folds to be used in cross-validation.
usage: Train [options]... create lang config dataRootDir modelDir
Train [options]... append dataRootDir modelDir
Train [options]... train modelDir
Train [options]... xval n-folds modelDir
-c,--cost <arg> one or more (comma separated) cost params
(C); default 0.01 - for train and xval. Only
xval supports multiple cost values.
--negationWords <arg> path to negation word list (create only)
--negativeLexicon <arg> path to negative lexicon (create only)
--positiveLexicon <arg> path to positive lexicon (create only)
--stopwords <arg> path to stopwords list (create only)
--train train model after adding examples
TCatCLI for Quickly Testing Models
TCatCLI runs the classifier that you trained with the
Train command. It is a command-line version of the
categories endpoint, allowing you to run the function using the trained model before integrating into Rosette Enterprise.
usage: TCatCLI [options] file [...]
--adm print out results in ADM JSON format
-ct,--confidenceThreshold <arg> show confidences above threshold
-et,--elbowThresholding use elbow thresholding (invalidates
--explanationSet <arg> show top N positive features
-f,--format <arg> (line|file|file-list), default=file
--featureMatrixCategories <arg> show matrix view of top N categories
-m,--model <arg> model directory
--maxResults <arg> show top N categories, default 1
-o,--output <arg> output file or directory. Output
directory required for -f file-list
-st,--scoreThreshold <arg> show scores above threshold
-v,--verbose verbose results
The command requires the
-o(output) options, in addition to the input documents. The documents can be a line of text, a file, or a file-list. The default is file. If running with a line of text or file-list, the
-f (format) option is required.
TCatCLI should be run on documents not seen at training time.
When providing a file list to
-f, the file list must contain one input file path per line. Additionally, your output argument (
-o) must be a directory and not a single file, as the command will create one output file per input file in the list.
TCatCLI returns a single category. The options
confidenceThreshold allow you to evaluate the multilabel performance of your model along with the default single-label behavior. To return a set number of categories, simply set the
maxResults option to a value greater than 1. Alternatively, set the threshold for a result’s raw score or confidence score (or both) with the
confidenceThreshold options, respectively. Note that if the
maxResults is set alongside either threshold option, it will act as a cap on the number of returned results and not an exact count. In this situation,
TCatCLI will return up to
<maxResults> results as long as their score exceeds the specified threshold value.
MeasurementCLI Command: Measurement
MeasurementCLI generates the same statistics as
Train xval, but instead of using cross-validation, it compares two lists of data, actual versus predicted. The first list is the gold-standard, annotated category each item belongs to, and the second list is the trained model's predicted category for each item. Each list contains one category per line. It is useful if you prepare your own train/test splits, as described in Evaluating on a Train/Test Partition.
usage: MeasurementCLI [options]... actual predicted
-m,--mode <arg> (stats|matrix|list), default=stats, matrix supports up
to 26 categories
If you have a gold set of annotated documents in ADM format, you can score your model’s performance using
Evaluate. By default, this evaluation will calculate per-category and overall (micro and macro) precision (P), recall (F), and F1 scores based on the model's performance in a single-label context. By specifying the
--multilabel flag, the evaluation can also be performed in a multilabel context. In this case, the same P/R/F metrics will be calculated, as well as two additional metrics, hamming-loss and subset-loss. Note that by default, the multi-label evaluation will use a default raw score threshold of -0.25f. Only categories with a score greater than the default raw score threshold are returned. If you find that this value is suboptimal for your dataset and the not all expected categories are being returned, it can be updated by using the
--score-threshold option. Additionally, the
--exclude-categories options can be used to include or exclude specific categories from the evaluation.
usage: Evaluate modelDir dataDir [options]
-et,--elbow-thresholding use elbow thresholding (only applicable
in multilabel mode)
-i,--include-categories <arg> categories to include in evaluation
-m,--multilabel multilabel evaluation
-mr,--max-results <arg> max number of results to return
-s,--score-threshold <arg> raw score threshold (only applicable in
-t,--tokenize Ignore gold tokens