The analyzer is a language-specific processor that uses dictionaries and statistical analysis to add analysis objects to tokens.
To extend the coverage that RBL provides for each supported language you can create User Dictionaries. Segmentation user dictionaries are supported for all languages. Lemma user dictionaries are supported for Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Thai.
A stem is the substring of a word that remains after prefixes and suffixes are removed, while the lemma is the dictionary form of a word. RBL supports stems for Arabic, Finnish, Persian, and Urdu.
Semitic roots are generated for Arabic and Hebrew.
The option name to set the analysis cache depends on the accepting factory. The option analysisCacheSize
is a BaseLinguisticsOption
while cacheSize
is an option for both AnalyzerOption
and CSCAnalyzerOption
. They all perform the same function.
Table 5. General Analyzer Options
Option |
Description |
Type
(Default)
|
Supported Languages |
analysisCacheSize
cacheSize
|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. |
Integer
(100.000)
|
All |
caseSensitive
|
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions. |
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog |
deliverExtendedTags
|
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned. |
Boolean
(false)
|
All |
normalizationDictionaryPaths
|
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator. |
List of paths |
All |
query
|
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation) |
Boolean
(false)
|
All |
tokenizerType
|
Selects the tokenizer to use with this analyzer. |
TokenizerType
(SPACELESS_STATISTICAL for Chinese, Japanese, Thai; ICU for all other languages)
|
All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Set TokenizerOption#tokenizerType
and AnalyzerOption#tokenizerType
to the same value.
Enum Classes:
For each token and normalized form in the token stream, the analyzer performs a dictionary lookup starting with any user dictionaries followed by the RBL dictionary. During lookup, RBL ignores the context in which the token or normalized form appears.
Once the analyzer has found one or more lemmas in a dictionary, it does not consult additional dictionaries. In other words, if two user dictionaries are specified, and the filter finds a lemma in the first dictionary, it does not consult the second user dictionary or the RBL dictionary.
Unless overridden by an analysis dictionary, the only lemmatization done in Chinese and Thai is number normalization. Other Chinese and Thai tokens' lemmas are equal to their surface forms.
There is no analysis dictionary available for Finnish, Pashto, or Urdu. All other languages are supported.
No dictionary can ever be complete: new words get added to languages, languages change and borrow. So, in general, analysis for each language includes some sort of guessing capability. The job of a guesser is to take a word and to come up with some analysis of it. Whatever facts we generate for a language, those are all possible outputs of a guesser.
In European languages, guessers deliver lemmas and parts of speech. In Korean, guessers provide morphemes, morpheme tags, compound components, and parts of speech.
By default, the analyzer returns any lemma that contains whitespace as multiple lemmas (each with no whitespace). To allow lemmas with whitespace (such as International Business Machines
as a lemma for the token IBM
) to be placed as such in the token stream, you can create a user analysis dictionary with an entry that defines the lemma. For example:
IBM International[^_]Business[^_]Machines[+PROP]
The analyzer decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components.
The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.
For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting 's' is not present in the component list. For this input token, the RBL tokenizer and analyzer return the following entries:
- Original form: Eingangstüren
- Lemma for the compound: Eingangstür
- Component lemmas: Eingang, Tür
Other German examples include letter removal (Rennrad ⇒ rennen + Rad), vowel changes (Mängelliste ⇒ Mangel + Liste), and capitalization changes (Blaugrünalge ⇒ blau + grün + Alge).
Table 6. Compound Options
Option |
Description |
Type
(Default)
|
Supported Languages |
decomposeCompounds
|
Indicates whether to decompose compounds.
For Chinese and Japanese, tokenizerType must be SPACELESS_LEXICAL .
If koreanDecompounding is enabled but decomposeCompounds is disabled, compounds will be decomposed.
|
Boolean
(true)
|
Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish |
compoundComponentSurfaceForms
|
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, getText returns the surface form of a component Token , and its lemma can be retrieved using Token#getAnalyses() and MorphoAnalysis#getLemma() . When this option is enabled and the results are not in ADM format, getCompoundComponentSurfaceForms returns the surface forms of a compound word’s Analysis , and its surface form is not available.
This option has no effect when decomposeCompounds is set to false .
|
Boolean
(false)
|
Dutch, German, Hungarian |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object. The disambiguate
option enables the disambiguator. When true
, the disambiguator determines the best analysis for each word given the context in which it appears.
When using an annotator, the disambiguated result is at the head of all possible analyses. The remainder of the list is ordered randomly. When using a tokenizer/analyzer, use the method getSelectedAnalysis
to return the disambiguated result.
For all languges except Japanese, disambiguation is enabled by default. For performance reasons, disambiguation is disabled by default for Japanese when using the statistical model.
Table 7. Disambiguation Options
Option |
Description |
Type (Default) |
Supported Languages |
disambiguate
|
Indicates whether the analyzers should disambiguate the results. |
Boolean
(true)
|
Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish |
alternativeEnglishDisambiguation
|
Enables faster part of speech disambiguation for English. |
Boolean
(false)
|
English |
alternativeGreekDisambiguation
|
Enables faster part of speech disambiguation for Greek |
Boolean
(false)
|
Greek |
alternativeSpanishDisambiguation
|
Enables faster part of speech disambiguation for Spanish. |
Boolean
(false)
|
Spanish |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Part-of-Speech (POS) Tags
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Russian, Spanish, and Urdu.
A POS tag map file is a YAML file encoded in UTF-8. It is a sequence of mapping rules.
A mapping rule is a sequence of two elements: the POS tag to be mapped and a sequence of submappings. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules.
A submapping is a mapping with the keys m
, s
, and t
. m
is a Java regular expression. s
is a surface form. m
and s
are optional: they can be omitted or null. t
specifies the output POS tag to use when the following criteria are met:
The input token's POS tag equals the POS tag to be mapped.
m
(if any) matches a substring of the input token's raw analysis.
s
(if any) equals the input token's surface form, compared case-insensitively.
-
- NUM_VOC
-
- { m: \+Total, t: PRON }
- { s: moc, t: DET }
- { s: oba, t: DET }
- { t: NUM }
This rule maps tokens with Basis's NUM_VOC POS tag. If the input token's raw analysis matches the regular expression \+Total
, the token becomes a PRON. Otherwise, if the token's surface form is moc or oba, the token becomes a DET. Otherwise, the token becomes a NUM.