For Chinese and Japanese, in addition to the statistical model described above, RBL includes Chinese Language Analyzer (CLA) and Japanese Language Analyzer (JLA) modules which are optimized for search. They are activated by setting tokenizerType
to SPACELESS_LEXICAL
.
Table 12. Chinese and Japanese Lexical Options
Option
|
Description
|
Default value
|
Supported languages
|
breakAtAlphaNumIntraWordPunct
|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when consistentLatinSegmentation is true .
|
false
|
Chinese
|
consistentLatinSegmentation
|
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of segmentNonJapanese is ignored.
|
true
|
Chinese, Japanese
|
decomposeCompounds
|
Indicates whether to decompose compounds.
|
true
|
Chinese, Japanese
|
deepCompoundDecomposition
|
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when decomposeCompounds is false .
|
false
|
Chinese, Japanese
|
favorUserDictionary
|
Indicates whether to favor words in the user dictionary during segmentation.
|
false
|
Chinese, Japanese
|
ignoreSeparators
|
Indicates whether to ignore whitespace separators when segmenting input text. If false , whitespace separators will be treated as morpheme delimiters. Has no effect when whitespaceTokenization is true .
|
true
|
Japanese
|
ignoreStopwords
|
Indicates whether to filter stop words out of the output.
|
false
|
Chinese, Japanese
|
joinKatakanaNextToMiddleDot
|
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token.
|
true
|
Japanese
|
minLengthForScriptChange
|
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when consistentLatinSegmentation is false .
|
10
|
Chinese, Japanese
|
pos
|
Indicates whether to add parts of speech to morphological analyses.
|
true
|
Chinese, Japanese
|
segmentNonJapanese
|
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when consistentLatinSegmentation is true .
|
true
|
Japanese
|
separateNumbersFromCounters
|
Indicates whether to return numbers and counters as separate tokens.
|
true
|
Japanese
|
separatePlaceNameFromSuffix
|
Indicates whether to segment place names from their suffixes.
|
true
|
Japanese
|
whiteSpaceIsNumberSep
|
Indicates whether to treat whitespace as a number separator. Has no effect when consistentLatinSegmentation is true .
|
true
|
Chinese
|
whitespaceTokenization
|
Indicates whether to treat whitespace as a morpheme delimiter.
|
false
|
Chinese, Japanese
|
Enum Classes:
-
BaseLinguisticsOption
-
TokenizerOption
Chinese and Japanese Readings
Table 13. Chinese and Japanese Readings
Option
|
Description
|
Default value
|
Supported languages
|
generateAll
|
Indicates whether to return all the readings for a token. For characters with multiple readings, all the readings are returned in brackets and separated by semicolons. Has no effect when readings is false .
|
false
|
Chinese
|
readingByCharacter
|
Indicates whether to skip directly to the fallback behavior of readings without considering readings for whole words. Has no effect when readings is false .
|
false
|
Chinese, Japanese
|
readings
|
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters.
|
false
|
Chinese, Japanese
|
readingsSeparateSyllables
|
Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when readings is false .
|
false
|
Chinese, Japanese
|
readingType
|
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
-
cjktex : macros for the CJKTeX pinyin.sty style
-
no_tones : pinyin without tones
-
tone_marks : pinyin with diacritics over the appropriate vowels
-
tone_numbers : pinyin with a number from 1 to 4 suffixed to each syllable, or no number for neutral tone
|
tone_ marks
|
Chinese
|
useVForUDiaeresis
|
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when readingType is cjktex or tone_marks , which always use 'v' and 'ü' respectively. It is probably most useful when readingType is tone_numbers . Has no effect when readings is false .
|
false
|
Chinese
|
Enum Classes:
-
BaseLinguisticsOption
-
TokenizerOption
Editing the stop words list
The ignoreStopwords
option uses a stop words list to define stop words. The path to the stop words list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8
and Japanese uses root/dicts/jpn/jla/JP_stop.utf8
.
You can add stop words to these files. When you edit one of these files, you must follow these rules:
-
The file must be encoded in UTF-8.
-
The file may include blank lines.
-
Comment lines begin with #
.
-
Each non-blank non-comment line represents exactly one lexeme (stop word).