How does tokenization of European languages work (RLP)?

See also: How does tokenization of European languages work (RBL-JE)?

In RLP (native), the European Language Analyzer[2] (ELA) processor tokenizes text in a language-specific manner, designed to facilitate linguistic analysis. For instance, English "in front of" is parsed as a single token, identified as a preposition. French "sur-le-champs" is parsed into two tokens, ["sur-", "-le-champs"], each containing an intermediate "-".

Some applications, such as named entity extraction and search, may work better with language-independent tokenization, where "in front of" is parsed as three tokens, and "sur-le-champs" is parsed as five: ["sur", "-", "le", "-", "champs"]. If you would prefer this method, reorder the processors in your context file so that Word Breaker comes before European Language Analyzer[2].

You can read more about this in the Rosette Linguistics Platform Application Developer’s Guide, the section entitled "European Language Tokenizers".

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request


Please sign in to leave a comment.

Powered by Zendesk