The following options are described in more detail in Initial and Path Options.
If the option rootDirectory
is specified, then the string ${rootDirectory}
takes that value in the dictionaryDirectory
, modelDirectory
, and licensePath
options.
Table 31. Initial and Path Options
Option
|
Description
|
Type (Default)
(Default)
|
Supported Languages
|
dictionaryDirectory
|
The path of the lemma and compound dictionary, if it exists.
|
Path
${rootDirectory}/dicts
|
All
|
language
|
The language to process by analyzers or tokenizers created by the factory.
|
Language code
|
All
|
licensePath
|
The path of the RBL license file.
|
Path
${rootDirectory}/licenses/rlp-license.xml
|
All
|
licenseString
|
The XML license content, overrides licensePath .
|
String
|
All
|
modelDirectory
|
The directory containing the model files.
|
Path
${rootDirectory}/models
|
All
|
rootDirectory
|
Set the root directory. Also sets default values for other required options (dictionaryDirectory , licensePath , licenseString , and modelDirectory ).
|
Path
|
All
|
The following options are described in more detail in Tokenizers.
Table 32. General Tokenizer Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
tokenizerType
|
Selects the tokenizer to use
|
TokenizerType
SPACELESS_STATISTICAL for Chinese, Japanese, Thai; ICU for all other languages
|
All
|
caseSensitive
|
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions.
|
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Ukrainian
|
defaultTokenizationLanguage
|
Specify language to use for script regions, other than the script of the overall language.
|
Language code
(xxx )
|
Chinese, Japanese, Thai
|
minNonPrimaryScriptRegionLength
|
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region.
|
Integer
(10)
|
Chinese, Japanese, Thai
|
tokenizeForScript
|
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts.
|
Boolean
(false)
|
Chinese, Japanese, Thai
|
nfkcNormalize
|
Turns on Unicode NFKC normalization before tokenization.
tokenizerType must not be FST or SPACELESS_LEXICAL .
|
Boolean
(false)
|
All
|
query
|
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior.
|
Boolean
(false)
|
All
|
The following options are described in more detail in Structured Text.
Table 33. Structured Text Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
fragmentBoundaryDetection
|
Turn on fragment boundary detection.
|
Boolean
(true)
|
All
|
fragmentBoundaryDelimiters
|
Specify the fragment boundary delimiters.
|
String
("\u0009\u000B\u000C")
|
All
|
maxTokensForShortLine
|
The maximum length of a short line.
|
Integer
(6)
|
All
|
The following options are described in more detail in Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs.
Table 34. Social Media Token Options
Option
|
Description
|
Default
|
Supported Languages
|
n/a
|
Enables emoji tokenization
|
true
|
All
|
emoticons
|
Enables emoticon tokenization
|
false
|
All
|
atMentions
|
Enables atMention tokenziation
|
false
|
All
|
hashtags
|
Enables hashtag tokenization
|
false
|
All
|
emailAddresses
|
Enables emailAdress tokenization
|
false
|
All
|
urls
|
Enables url tokenization
|
false
|
All
|
The following options are described in more detail in Analyzers.
Table 35. General Analyzer Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
analysisCacheSize
cacheSize
|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off.
|
Integer
(100.000)
|
All
|
caseSensitive
|
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions.
|
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish
|
deliverExtendedTags
|
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned.
|
Boolean
(false)
|
All
|
normalizationDictionaryPaths
|
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator.
|
List of paths
|
All
|
query
|
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation)
|
Boolean
(false)
|
All
|
The following options are described in more detail in Compounds.
Table 36. Compound Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
decomposeCompounds
|
Indicates whether to decompose compounds.
For Chinese and Japanese, tokenizerType must be SPACELESS_LEXICAL .
If koreanDecompounding is enabled but decomposeCompounds is disabled, compounds will be decomposed.
|
Boolean
(true)
|
Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish
|
compoundComponentSurfaceForms
|
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, getText returns the surface form of a component Token , and its lemma can be retrieved using Token#getAnalyses() and MorphoAnalysis#getLemma() . When this option is enabled and the results are not in ADM format, getCompoundComponentSurfaceForms returns the surface forms of a compound word’s Analysis , and its surface form is not available.
This option has no effect when decomposeCompounds is set to false .
|
Boolean
(false)
|
Dutch, German, Hungarian
|
The following options are described in more detail in Disambiguation.
Table 37. Disambiguation Options
Option
|
Description
|
Type (Default)
|
Supported Languages
|
disambiguate
|
Indicates whether the analyzers should disambiguate the results.
|
Boolean
(true)
|
Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish
|
alternativeEnglishDisambiguation
|
Enables faster part of speech disambiguation for English.
|
Boolean
(false)
|
English
|
alternativeGreekDisambiguation
|
Enables faster part of speech disambiguation for Greek
|
Boolean
(false)
|
Greek
|
alternativeSpanishDisambiguation
|
Enables faster part of speech disambiguation for Spanish.
|
Boolean
(false)
|
Spanish
|
The following options are described in more detail in Returning Universal Part-of-Speech (POS) Tags.
Table 38. Universal POS Tag Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
universalPosTags
|
Indicates if POS tags should be converted to universal versions
|
Boolean
(false)
|
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
|
customPosTagsUri
|
URI of a POS tag map
|
URI
|
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
|
The following options are described in more detail in Contraction Splitting Rule File Format.
Table 39. Contraction Splitting Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
tokenizeContractions
|
Indicates whether to deliver contractions as multiple tokens. If false , they are delivered a a single token.
|
Boolean
(false)
|
All
|
customTokenizeContractionRulesUri
|
URI of contraction rule file.
|
URI
|
All
|
The following options are only available when using the ADM API.
Table 40. Annotator Object Options
Option
|
Description
|
Type (Default)
|
Supported Languages
|
analyze
|
Enables analysis. If false, the annotator will only perform tokenization.
|
Boolean
(true )
|
All
|
customPosTagsUri
|
URI of a POS tag map file for use by the universalPosTags option.
|
URI
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish
|
Chinese and Japanese Options
The following options are described in more detail in Chinese and Japanese Lexical Tokenization.
Table 41. Chinese and Japanese Lexical Options
Option
|
Description
|
Default value
|
Supported languages
|
breakAtAlphaNumIntraWordPunct
|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when consistentLatinSegmentation is true .
|
false
|
Chinese
|
consistentLatinSegmentation
|
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of segmentNonJapanese is ignored.
|
true
|
Chinese, Japanese
|
decomposeCompounds
|
Indicates whether to decompose compounds.
|
true
|
Chinese, Japanese
|
deepCompoundDecomposition
|
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when decomposeCompounds is false .
|
false
|
Chinese, Japanese
|
favorUserDictionary
|
Indicates whether to favor words in the user dictionary during segmentation.
|
false
|
Chinese, Japanese
|
ignoreSeparators
|
Indicates whether to ignore whitespace separators when segmenting input text. If false , whitespace separators will be treated as morpheme delimiters. Has no effect when whitespaceTokenization is true .
|
true
|
Japanese
|
ignoreStopwords
|
Indicates whether to filter stop words out of the output.
|
false
|
Chinese, Japanese
|
joinKatakanaNextToMiddleDot
|
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token.
|
true
|
Japanese
|
minLengthForScriptChange
|
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when consistentLatinSegmentation is false .
|
10
|
Chinese, Japanese
|
pos
|
Indicates whether to add parts of speech and secondary parts of speech to morphological analyses.
|
true
|
Chinese, Japanese
|
segmentNonJapanese
|
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when consistentLatinSegmentation is true .
|
true
|
Japanese
|
separateNumbersFromCounters
|
Indicates whether to return numbers and counters as separate tokens.
|
true
|
Japanese
|
separatePlaceNameFromSuffix
|
Indicates whether to segment place names from their suffixes.
|
true
|
Japanese
|
whiteSpaceIsNumberSep
|
Indicates whether to treat whitespace as a number separator. Has no effect when consistentLatinSegmentation is true .
|
true
|
Chinese
|
whitespaceTokenization
|
Indicates whether to treat whitespace as a morpheme delimiter.
|
false
|
Chinese, Japanese
|
The following options are described in more detail in Chinese and Japanese Readings.
Table 42. Chinese and Japanese Readings
Option
|
Description
|
Default value
|
Supported languages
|
generateAll
|
Indicates whether to return all the readings for a token. Has no effect when readings is false .
|
false
|
Chinese
|
readingByCharacter
|
Indicates whether to skip directly to the fallback behavior of readings without considering readings for whole words. Has no effect when readings is false .
|
false
|
Chinese, Japanese
|
readings
|
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters.
|
false
|
Chinese, Japanese
|
readingsSeparateSyllables
|
Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when readings is false .
|
false
|
Chinese, Japanese
|
readingType
|
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
-
cjktex : macros for the CJKTeX pinyin.sty style
-
no_tones : pinyin without tones
-
tone_marks : pinyin with diacritics over the appropriate vowels
-
tone_numbers : pinyin with a number from 1 to 4 suffixed to each syllable, or no number for neutral tone
|
tone_ marks
|
Chinese
|
useVForUDiaeresis
|
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when readingType is cjktex or tone_marks , which always use 'v' and 'ü' respectively. It is probably most useful when readingType is tone_numbers . Has no effect when readings is false .
|
false
|
Chinese
|
The following options are described in more detail in Hebrew Analyses.
Table 43. Hebrew Options
Option
|
Description
|
Type
(Default)
|
guessHebrewPrefixes
|
Splits prefixes off unknown Hebrew words
|
Boolean
(false)
|
includeHebrewRoots
|
Indicates whether to generate Semitic root forms
|
Boolean
(false)
|
Table 44. Hebrew Disambiguation Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
disambiguatorType
|
Selects which disambiguator to use for Hebrew.
|
DisambiguatorType
(PERCEPTRON )
|
Hebrew
|
Chinese Script Converter Options
The following options are described in more detail in Chinese Script Converter (CSC).
Table 45. CSC Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
conversionLevel
|
Indicates most complex conversion level to use
|
CSConversionLevel
(lexemic )
|
Chinese
|
language
|
The language from which the CSCAnalyzer is converting
|
LanguageCode
|
Chinese, Simplified Chinese, Traditional Chinese
|
targetLanguage
|
The language to which the CSCAnalyzer is converting
|
LanguageCode
|
Chinese, Simplified Chinese, Traditional Chinese
|
The following options are described in more detail in Using RBL in Apache Lucene.
Table 46. Lucene Filter Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
addLemmaTokens
|
Indicates whether the token filter should add the lemmas (if none, the steps) of each surface token to the tokens being returned..
|
Boolean
(true)
|
All
|
addReadings
|
Indicates whether the token filter should add the readings of each surface token to the tokens being returned
|
Boolean
(false)
|
Chinese, Japanese
|
identifyContractionComponents
|
Indicates whether the token filter should identify contraction components as contraction components rather than as lemmas
|
Boolean
(false)
|
All
|
replaceTokensWithLemmas
|
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled.
|
Boolean
(false)
|
All
|
Table 47. Lucene User Dictionary Path Options
Option
|
Description
|
Type
|
Supported Languages
|
lemDictionaryPath
|
A list of paths to user lemma dictionaries.
|
List of Paths
|
Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai
|
segDictionaryPath
|
A list of paths to user segmentation dictionaries.
|
List of Paths
|
All
|
userDefinedDictionaryPath
|
A list of paths to user dictionaries.
|
List of Paths
|
All
|
userDefinedReadingDictionaryPath
|
A list of paths to reading dictionaries.
|
List of Paths
|
Japanese
|