Rosette identifies and separates each word into one or more tokens through advanced statistical modeling. A token is an atomic element such as a word, number, possessive affix, or punctuation.
The resulting token output minimizes index size, enhances search accuracy, and increases relevancy.
We offer two algorithms for Chinese and Japanese tokenization and morphological analysis. Prior to August 2018, the default algorithm was a perceptron. To return to that algorithm, add this to the body of your call:
"options": {"modelType": "perceptron"}
If you'd like to try out Rosette's tokenization capabilities, sign up for a free trial of Rosette here.
The tokens REST endpoint is:
https://api.rosette.com/rest/v1/tokens
More Info: