The tokenizer is a language-specific processor that evaluates documents and identifies the tokens. RBL supports tokenization and sentence boundaries for all languages. For many languages, you can choose the tokenizer by setting tokenizerType
.
Table 1. Tokenizer Types
TokenizerType |
Description |
Supported Languages |
ICU
|
Uses the ICU tokenizer |
All, except for Chinese and Japanese |
FST
|
Uses the FST tokenizer |
Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Spanish |
SPACELESS_LEXICAL
|
Uses a lexicon and rules to tokenize input without spaces. Uses the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA). |
Chinese, Japanese |
SPACELESS_STATISTICAL
|
Uses statistical approach to tokenize input without spaces. |
Chinese, Japanese, Korean, Thai |
DEFAULT
|
Selects the default tokenizer for each language. The default is SPACELESS_STATISTICAL for Chinese, Japanese, and Thai, and ICU for all other languages. |
All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Set TokenizerOption#tokenizerType
and AnalyzerOption#tokenizerType
to the same value.
For most languages the default tokenizer is referred to as the ICU tokenizer. It implements standard Unicode guidelines for determining boundaries between sentences and for breaking each sentence into individual tokens. Many languages have an alternate tokenizer, the FST tokenizer, enabled by setting the tokenizerType
to FST
. The FST tokenizer provides somewhat different sentence and token boundaries. For example, the FST tokenizer keeps hyphenated tokens together, while the ICU tokenizer breaks them into separate tokens. For applications that don't want tokens or lemmas that contains spaces, the ICU tokenizer provides the best accuracy. To determine which tokenizer is best for your use case, we recommend running each of them against a test dataset and reviewing the output.
For Chinese, Japanese, and Thai, the default tokenizer determines sentence boundaries, and then uses statistical models to segment each sentence into individual tokens. If Latin-script or other non-Chinese, non-Japanese, or non-Thai fragments greater than a certain length (defined by minNonPrimaryScriptRegionLength
) are embedded in the Chinese, Japanese, or Thai text, then the tokenizer applies default Unicode tokenization to those fragments. If a non-primary script region is less than this length, and adjacent to a primary script region, it is appended to the primary script region.
To use the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA) tokenization algorithm, set the tokenizerType
to SPACELESS_LEXICAL
. This disables post-tokenization analysis; an analyzer created with this option will leave its input tokens unchanged.
For all languages, the RBL tokenizer can apply Normalization Form KC (NFKC) as specified in Unicode Standard Annex #15 to normalize the tokens. This normalization includes a normalizing a fullwidth numeral to a halfwidth numeral, a fullwidth Latin letter to a halfwidth Latin letter, and a halfwidth Katakana character to a fullwidth Katakana character. NFKC normalization is turned off by default. Use the nfkcNormalize
option to turn it on and use tokenizerType
of ICU
. To apply NKFC for Chinese and Japanese, tokenizerType
must be SPACELESS_STATISTICAL
or DEFAULT
.
Table 2. General Tokenizer Options
Option |
Description |
Type
(Default)
|
Supported Languages |
caseSensitive
|
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. |
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog |
defaultTokenizationLanguage
|
Specify language to use for script regions, other than the script of the overall language. |
Language code
(xxx )
|
Chinese, Japanese, Thai |
minNonPrimaryScriptRegionLength
|
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. |
Integer
(10)
|
Chinese, Japanese, Thai |
nfkcNormalize
|
Turns on Unicode NFKC normalization before tokenization.
tokenizerType must not be FST or SPACELESS_LEXICAL .
|
Boolean
(false)
|
All |
query
|
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. |
Boolean
(false)
|
All |
tokenizeForScript
|
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts. |
Boolean
(false)
|
Chinese, Japanese, Thai |
tokenizerType
|
Selects the tokenizer to use |
TokenizerType
(SPACELESS_STATISTICAL for Chinese, Japanese, Thai; ICU for all other languages)
|
All |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
A document may contain tables and lists in addition to regular sentences. Structured text is composed of fragments, such as list items, table cells, and short lines of text. The tokenizer emits sentence offsets for each fragment it encounters.
One way fragments are identified is by detecting fragment delimiters. A delimiter is restricted to one character; the default delimiters are U+0009 (tab), U+000B (vertical tab), and U+000C (form feed). To modify the set of recognized delimiters, pass a string containing all desired delimiter values to the fragmentBoundaryDelimiters
option. The string must include any default values you want to keep. Non-whitespace delimiters within a token will be ignored.
The following rules determine where fragments are identified, in descending priority:
Each line in a list, where a list is defined as 3 or more lines containing the same punctuation mark within the first 5 characters of the line, are fragments
A delimiter or three or more consecutive whitespace characters breaks a line into fragments
A short line is a fragment if it is preceded by another short line, preceded by a fragment, or if it's the first line of text. The length of a short line is configurable with the maxTokensForShortLine
option; the default is 6 or fewer tokens.
Fragments always include trailing whitespace.
Example:
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory)
factory.setOption(BaseLinguisticsOption.language, "eng");
EnumMap<BaseLinguisticsOption, String> options = Maps.newEnumMap(BaseLinguisticsOption.class);
options.put(BaseLinguisticsOption.fragmentBoundaryDelimiters, "|~");
options.put(BaseLinguisticsOption.maxTokensForShortLine, "5");
factory.createSingleLanguageAnnotator(options);
By default, fragment detection is enabled. Use the fragmentBoundaryDetection
option to disable it.
Table 3. Structured Text Options
Option |
Description |
Type
(Default)
|
Supported Languages |
fragmentBoundaryDetection
|
Turn on fragment boundary detection. |
Boolean
(true)
|
All |
fragmentBoundaryDelimiters
|
Specify the fragment boundary delimiters. |
String
("\u0009\u000B\u000C")
|
All |
maxTokensForShortLine
|
The maximum length of a short line. |
Integer
(6)
|
All |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs
RBL supports POS-tagging of emoji, emoticons, @mentions, email addresses, hashtags, and URLs in all supported languages.
Tokenization of emoji is always enabled. The other options are disabled by default but can be enabled through the options listed. When tokenization is disabled, the characters may be split into multiple tokens.
Table 4. Social Media Token Options
Option |
Description |
Default |
Supported Languages |
n/a |
Enables emoji tokenization |
true
|
All |
emoticons
|
Enables emoticon tokenization |
false
|
All |
atMentions
|
Enables atMention tokenziation |
false
|
All |
hashtags
|
Enables hashtag tokenization |
false
|
All |
emailAddresses
|
Enables emailAdress tokenization |
false
|
All |
urls
|
Enables url tokenization |
false
|
All |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
The tokenization when the option is disabled depend on the options language
and tokenizerType
. The samples provided here are for when language=eng
and tokenizerType=ICU
.
Emoji & Emoticon Recognition
Emoji are defined by Unicode Technical Standard #51. In tokenizing emoji, RBL recognizes the emoji presentation selector (VS16; U+FE0F) and text presentation selector (VS15; U+FE0E), which indicate if the preceding character should be treated as emoji or text.
Although RBL detects sideways, Western-style emoticons, it does not currently support Japanese-style emoticons called kaomoji such as (o^ ^o)
.
Emoji Normalization & Lemmatization
RBL normalizes emoji, placing the result into the lemma
field. The simplest example is when an emoji presentation selector follows a character that is already an emoji. In this case, RBL will simply remove the emoji presentation selector.
Lemmatization applies to an emoji character in multiple ways.
Emoji that depict people or body parts may be followed by an emoji modifier indicating skin tone. Lemmatization simply removes the emoji modifier skin tone from the emoji character. The reasoning is that the skin tone is of secondary importance to the meaning of the emoji.
Emoji depicting people may be followed by an emoji component indicating hair color or style. Lemmatizing removes the hair component from the emoji character.
Where a gender symbol has been added to create a gendered occupation emoji, lemmatization removes the gender symbol.
Finally, RBL can normalize non-fully-qualified emoji ZWJ sequences to fully-qualified emoji ZWJ sequences. In the above example, it is possible to omit the VS16 (though discouraged by Unicode): since Police Officer is an emoji, anything joined to it by a ZWJ is implicitly an emoji too. RBL adds the missing VS16.
Customizing the ICU Tokenizer
The ICU tokenizer is the default tokenizer used for European languages. It works based on behavior defined in a rule file. If the default behavior is not exactly what is desired, RBL allows custom rule files to be supplied that will determine the behavior of the tokenizer. How to make these customizations is briefly outlined here. Be careful with any changes you make to the tokenizer behavior; Basis does not support customizations made by the user.
BaseLinguisticsFactory
and TokenizerFactory
both have a method addCustomTokenizerRules
which can be used to specify a custom rule file. RBLCmd also has the -ctr
option to specify a path on the command line. All of these methods accept a case-sensitivity value (for -ctr
, cs
and ci
mean case-sensitive and case-insensitive), which is important because only when BaseLinguisticsOption.caseSensitive
is the same as the value for a rule file will it be selected. Custom rule files are not cumulative, i.e. only one set of rules may be used at a time for any one combination of case sensitivity and language.
Note
Basis reserves the right to change the version of ICU used in RBL. Thus any rule file provided by Basis for a particular version of RBL may or may not work with newer versions.
Tokenization Rule File Format
A tokenization rule file is a ICU break rule file encoded in UTF-8. A custom file replaces Basis’s tokenization rules, so a custom rule should include all the rules for basic tokenization as well as the new custom rules. The default rule files that RBL uses can be obtained by contacting Basis support, or you can copy the rule file from ICU.
RBL also provides the ability to pass in a subrule file if desired. This is for splitting tokens produced according to rules in the main file. The subrule file is a list of subrules, each of which is a number and a regex separated by a tab character. This number corresponds to the “rule status” of the main rule whose tokens the subrule splits. Each capturing group in the subrule regex corresponds to a token that will be produced by the tokenizer.
The rule file and the subrule file can be placed anywhere. In particular, they need not be placed anywhere within your RBL installation directory.
There is one Basis-specific extension, !!btinclude <filename>
. This command tells the preprocessor to replace the !!btinclude
line with the contents of the specified file. Relative paths are relative to the location of the file containing the !!btinclude
line. Recursive inclusion is allowed.
The ICU tokenizer does not normally tokenize with an eye to emoticons, but perhaps that is important to your use case. You could make a copy of the default rule file and add the following.
...
$Smiley = [\:=][)}\]];
!!forward;
$Smiley;
...
For the input:
=)
instead of the output:
=
)
of two tokens with the Basis default rules you would get back one token:
=)
Unknown Language Tokenization
RBL provides basic tokenization support when the language is "Unknown" (xxx
). The tokenizer uses generic rules to tokenize, such as whitespace and punctuation delimitation.
Supported Features when language is unknown (xxx
):
Using the language code of xxx
will provide basic tokenization support for languages not supported by RBL.