How does tokenization of European languages work (RBL-JE)?

See also: How does tokenization of European languages work (RLP)?

By default, RBL-JE uses language-independent tokenization for European languages. For example, English "in front of" is parsed as three tokens, and French "donne-le-lui" is parsed as five tokens, ["donne", "-", "le", "-", "lui"]. This option is preferred for certain downstream applications, such as named entity extraction and search.

RBL also provides an option for language-specific tokenization tailored to each European language. For instance, English "in front of" is parsed as a single token, identified as a preposition. French "donne-le-lui" is parsed into two tokens, ["donne-", "-le-lui"], each containing an intermediate "-". This option is recommended for downstream applications doing linguistic analysis. The option is set using the fstTokenize option for TokenizerFactory, e.g.:

TokenizerFactory factory = new TokenizerFactory();
factory.setOption(TokenizerOption.fstTokenize, "true");
Tokenizer tokenizer = factory.create();
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request


Please sign in to leave a comment.

Powered by Zendesk