By default, RBL-JE uses language-independent tokenization for European languages. For example, English "in front of" is parsed as three tokens, and French "donne-le-lui" is parsed as five tokens, ["donne", "-", "le", "-", "lui"]. This option is preferred for certain downstream applications, such as named entity extraction and search.
RBL also provides an option for language-specific tokenization tailored to each European language. For instance, English "in front of" is parsed as a single token, identified as a preposition. French "donne-le-lui" is parsed into two tokens, ["donne-", "-le-lui"], each containing an intermediate "-". This option is recommended for downstream applications doing linguistic analysis. The option is set using the fstTokenize
option for TokenizerFactory
, e.g.:
TokenizerFactory factory = new TokenizerFactory();
factory.setOption(TokenizerOption.fstTokenize, "true");
Tokenizer tokenizer = factory.create();