If you have sufficient training data, you can use the techniques described earlier to build and evaluate a classifier. If you don’t have any data, you might consider a keyword-based classifier which is very quick and easy to create, since you select your own keywords, and you don’t need to collect, clean, and annotate data. Currently, a keyword-based classifier can only be trained for English using the FTK.
First, create a directory to hold the keyword model, and copy in the "semantic index":
$ mkdir keyword_model
$ cp -r data/esa/index keyword_model
Next, create a taxonomy file with your keywords and phrases. We ship a sample one based on the IAB QAG Taxonomy:
$ head -20 etc/iab_taxonomy.txt
"Arts & Entertainment"
"Books & Literature"
Note that keyphrases are enclosed in quotes, while keywords are not. Without the quotes, each token in a keyphrase will be considered individually, as if each token appeared on its own line. Thus, you may often want to have overlapping keyphrases and keywords, e.g. include both "Auto Parts" and Auto for AUTOMOTIVE.
Next, train your model by building a concept vector for each category. Below, we choose the best 1000 concepts per category:
$ bin/ConceptVectorsBuilder 1000 keyword_model/index \
This command turns your keywords into a list of Wikipedia pages related to your categories. We treat each Wikipedia page as a separate concept. You can look at the resulting concepts file to make sure things make sense:
$ grep AUTOMOTIVE keyword_model/concepts | head
You can remove keywords that don’t look useful or add ones you feel are missing, either by adding/removing keywords, or by dealing directly with the generated concepts file. These activities fine-tune your classifier.
Now you are ready to classify new documents:
$ bin/TCatCLI -m keyword_model -o /dev/stdout \
<(echo "the red sox won at fenway") \
<(echo "i drive a ford focus") | cut -f1 2>/dev/null
How the Keyword-Based Classifier Works
The keyword-based classifier works by pivoting through Wikipedia. We have processed Wikipedia into a local artifact so there is no need for an internet connection. Our version of Wikipedia has been pruned to remove uninteresting pages. We call this our semantic index.
Consider each page in this index to be a concept. We represent a document as a concept vector. A concept vector is a mathematical representation of the meaning (concept) of a document. Documents about similar concepts will have vectors that are closer together.
For example, the sentence "The Red Sox play at Fenway Park" might have a vector that looks like this:
$ bin/SemanticInterpreterCLI --max-concepts 10000 eng keyword_model/index \
You can imagine this vector as having one slot for every page in Wikipedia, but most slots will be fairly low. For example, Roland Winters (an actor) is ranked 1000 on this list and is only vaguely related to the Red Sox - he was born in Boston. Item 10000 on this list is Yu (wind instrument), with score a of 4.427081, and it’s not clear how it’s related to the Red Sox at all.
When you train a category using keywords, you are combining weighted concept vectors. The overall vector for your category should end up having clearly related concepts at the top.
At runtime, the input document is also turned into a concept vector by querying our local copy of Wikipedia. This input concept vector is then compared against the concept vectors for each category your classifier has been trained on. We choose the category that has the closest vector distance.
Because this works by pivoting through Wikipedia, keyword-based classifiers will not work well for categories without clear Wikipedia-based concepts. For example, a sentiment model would not be a good fit. Also, fine-grained categories like iPhone6 vs. iPhone7 may also not be a good fit.