The analyzer is a language-specific processor that uses dictionaries and statistical analysis to add analysis objects to tokens.
To extend the coverage that RBL provides for each supported language you can create User Dictionaries. Segmentation user dictionaries are supported for all languages. Lemma user dictionaries are supported for Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Thai.
A stem is the substring of a word that remains after prefixes and suffixes are removed, while the lemma is the dictionary form of a word. RBL supports stems for Arabic, Finnish, Persian, and Urdu.
Semitic roots are generated for Arabic and Hebrew.
The option name to set the analysis cache depends on the accepting factory. The option analysisCacheSize
is a BaseLinguisticsOption
while cacheSize
is an option for both AnalyzerOption
and CSCAnalyzerOption
. They all perform the same function.
Table 7. General Analyzer Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
analysisCacheSize
cacheSize
|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off.
|
Integer
(100.000)
|
All
|
caseSensitive
|
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions.
|
Boolean
(true)
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish
|
deliverExtendedTags
|
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned.
|
Boolean
(false)
|
All
|
normalizationDictionaryPaths
|
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator.
|
List of paths
|
All
|
query
|
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation)
|
Boolean
(false)
|
All
|
tokenizerType
|
Selects the tokenizer to use with this analyzer.
|
TokenizerType
(SPACELESS_STATISTICAL for Chinese, Japanese, Thai; ICU for all other languages)
|
All
|
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Enum Classes:
For each token and normalized form in the token stream, the analyzer performs a dictionary lookup starting with any user dictionaries followed by the RBL dictionary. During lookup, RBL ignores the context in which the token or normalized form appears.
Once the analyzer has found one or more lemmas in a dictionary, it does not consult additional dictionaries. In other words, if two user dictionaries are specified, and the filter finds a lemma in the first dictionary, it does not consult the second user dictionary or the RBL dictionary.
Unless overridden by an analysis dictionary, the only lemmatization done in Chinese and Thai is number normalization. Other Chinese and Thai tokens' lemmas are equal to their surface forms.
There is no analysis dictionary available for Finnish, Pashto, or Urdu. All other languages are supported.
No dictionary can ever be complete: new words get added to languages, languages change and borrow. So, in general, analysis for each language includes some sort of guessing capability. The job of a guesser is to take a word and to come up with some analysis of it. Whatever facts we generate for a language, those are all possible outputs of a guesser.
In European languages, guessers deliver lemmas and parts of speech. In Korean, guessers provide morphemes, morpheme tags, compound components, and parts of speech.
By default, the analyzer returns any lemma that contains whitespace as multiple lemmas (each with no whitespace). To allow lemmas with whitespace (such as International Business Machines
as a lemma for the token IBM
) to be placed as such in the token stream, you can create a user analysis dictionary with an entry that defines the lemma. For example:
IBM International[^_]Business[^_]Machines[+PROP]
The analyzer decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components.
The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.
For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting 's' is not present in the component list. For this input token, the RBL tokenizer and analyzer return the following entries:
- Original form: Eingangstüren
- Lemma for the compound: Eingangstür
- Component lemmas: Eingang, Tür
Other German examples include letter removal (Rennrad ⇒ rennen + Rad), vowel changes (Mängelliste ⇒ Mangel + Liste), and capitalization changes (Blaugrünalge ⇒ blau + grün + Alge).
Table 8. Compound Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
decomposeCompounds
|
Indicates whether to decompose compounds.
For Chinese and Japanese, tokenizerType must be SPACELESS_LEXICAL .
If koreanDecompounding is enabled but decomposeCompounds is disabled, compounds will be decomposed.
|
Boolean
(true)
|
Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish
|
compoundComponentSurfaceForms
|
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, getText returns the surface form of a component Token , and its lemma can be retrieved using Token#getAnalyses() and MorphoAnalysis#getLemma() . When this option is enabled and the results are not in ADM format, getCompoundComponentSurfaceForms returns the surface forms of a compound word’s Analysis , and its surface form is not available.
This option has no effect when decomposeCompounds is set to false .
|
Boolean
(false)
|
Dutch, German, Hungarian
|
Enum Classes:
-
AnalyzerOption
-
BaseLinguisticsOption
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object. The disambiguate
option enables the disambiguator. When true
, the disambiguator determines the best analysis for each word given the context in which it appears.
When using an annotator, the disambiguated result is at the head of all possible analyses. The remainder of the list is ordered randomly. When using a tokenizer/analyzer, use the method getSelectedAnalysis
to return the disambiguated result.
For all languges except Japanese, disambiguation is enabled by default. For performance reasons, disambiguation is disabled by default for Japanese when using the statistical model.
Table 9. Disambiguation Options
Option
|
Description
|
Type (Default)
|
Supported Languages
|
disambiguate
|
Indicates whether the analyzers should disambiguate the results.
|
Boolean
(true)
|
Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish
|
alternativeEnglishDisambiguation
|
Enables faster part of speech disambiguation for English.
|
Boolean
(false)
|
English
|
alternativeGreekDisambiguation
|
Enables faster part of speech disambiguation for Greek
|
Boolean
(false)
|
Greek
|
alternativeSpanishDisambiguation
|
Enables faster part of speech disambiguation for Spanish.
|
Boolean
(false)
|
Spanish
|
Enum Classes:
-
AnalyzerOption
-
BaseLinguisticsOption
Part-of-Speech (POS) Tags
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
Returning Universal Part-of-Speech (POS) Tags
The universalPosTags
option converts BasisTech POS tags to universal POS tags, as defined by the Universal Dependencies project. The POS tag mappings are defined by POS tag map files. By default, the annotator uses the map in rootDirectory/upt-16/upt-16-<language>.yaml
, where <language> is a language code. customPosTagsUri
allows you to specify custom POS tag mappings.
If you want to return universal part-of-speech tags in place of the language-specific tags that RBL ordinarily returns, set universalPosTags
to true
.
For an ADM sample that follows the same pattern as the preceding sample and returns universal POS tags for each token, see rbl-je-<version>/samples/universal-pos-tags
.
Table 10. Universal POS Tag Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
universalPosTags
|
Indicates if POS tags should be converted to universal versions
|
Boolean
(false)
|
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
|
customPosTagsUri
|
URI of a POS tag map
|
URI
|
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
|
Enum Classes:
A POS tag map file is a YAML file encoded in UTF-8. It is a sequence of mapping rules.
A mapping rule is a sequence of two elements: the POS tag to be mapped and a sequence of submappings. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules.
A submapping is a mapping with the keys m
, s
, and t
. m
is a Java regular expression. s
is a surface form. m
and s
are optional: they can be omitted or null. t
specifies the output POS tag to use when the following criteria are met:
-
The input token's POS tag equals the POS tag to be mapped.
-
m
(if any) matches a substring of the input token's raw analysis.
-
s
(if any) equals the input token's surface form, compared case-insensitively.
-
- NUM_VOC
-
- { m: \+Total, t: PRON }
- { s: moc, t: DET }
- { s: oba, t: DET }
- { t: NUM }
This rule maps tokens with BasisTech's NUM_VOC POS tag. If the input token's raw analysis matches the regular expression \+Total
, the token becomes a PRON. Otherwise, if the token's surface form is moc or oba, the token becomes a DET. Otherwise, the token becomes a NUM.
You can split contractions and return analyses with tokens, lemmas, and POS tags for each constituent. For example, given the English contraction can't, RBL returns analyses for can
and for not
. To split contractions, set tokenizeContractions
to true
.
Contractions are defined by contraction rule files. By default, the tokenizer uses the rules in rootDirectory/contractions/contraction-rules-<language>.yaml
, where <language> is the language code. RBL comes with contraction rules for English, German, and Portuguese. To add rules for these languages or add rules to support another language, edit the default files or create a custom rule file. The URI for the custom file is defined by customTokenizeContractionRulesUri.
Table 11. Contraction Splitting Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
tokenizeContractions
|
Indicates whether to deliver contractions as multiple tokens. If false , they are delivered a a single token.
|
Boolean
(false)
|
All
|
customTokenizeContractionRulesUri
|
URI of contraction rule file.
|
URI
|
All
|
Enum Classes:
For a sample, see rbl-je-<version>/samples/contractions
.
Contraction Splitting Rule File Format
A contraction rule file is a YAML file encoded in UTF-8. It must be a sequence of contraction rules.
A contraction rule is a sequence of two elements: a contraction key and a contraction replacement. Any token which matches the key is replaced with the replacement. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules. A token which matches no rule is not rewritten.
A contraction key is a sequence of a surface form and a POS tag. A token matches a key if and only if its surface form and POS tag match the key's surface form and POS tag.
A contraction replacement is a sequence of replacement tokens.
-
A replacement token is a sequence of a replacement surface form, POS tag, lemma, and raw analysis. All four are strings. The raw analysis can also be null.
-
- [ "ain't", "VBPRES" ]
-
- [ "am", "VBPRES", "be", null ]
- [ "not", "NOT", "not", null ]
-
- [ "amn't", "ADJ" ]
-
- [ "am", "VBPRES", "be", null ]
- [ "not", "NOT", "not", null ]
-
- [ "amn't", "NOUN" ]
-
- [ "am", "VBPRES", "be", null ]
- [ "not", "NOT", "not", null ]
The first entry is for ain't with POS tag VBPRES. This splits into am and not. The next is for amn't as an ADJ, and the third is for amn't as a NOUN.
The replacement surface form uses the same capitalization format as the original surface form. Using the first entry of the above example, ain't becomes am not, Ain't becomes Am not, and AIN'T becomes AM NOT.