alternativeTokenization
|
Directs the use of the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA) |
Boolean
(false)
|
Chinese, Japanese |
caseSensitive
|
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. |
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
defaultTokenizationLanguage
|
Specify language to use for script regions, other than the script of the overall language. |
Language code
(xxx )
|
Chinese, Japanese, Thai |
fstTokenize
|
Turns on FST tokenization |
Boolean
(false)
|
Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Spanish |
minNonPrimaryScriptRegionLength
|
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. |
Integer
(10)
|
Chinese, Japanese, Thai |
tokenizeForScript
|
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts |
Boolean
(false)
|
Chinese, Japanese, Thai |
nfkcNormalize
|
Turns on Unicode NFKC normalization before tokenization.
fstTokenize must be false .
For Japanese and Chinese, alternativeTokenization must be false
|
Boolean
(false)
|
All except Hebrew |
query
|
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. |
Boolean
(false)
|
All |