Hebrew Tokenization and Analyses
For Hebrew, the tokenizer also generates a lemma and a Semitic root for each token. Hebrew doesn't use an analyzer; the tokenizer returns tokens with the analyses already generated.
Table 7. Hebrew Options
Option |
Description |
Type |
Default |
guessHebrewPrefixes |
Splits prefixes off unknown Hebrew words |
Boolean |
false |
includeHebrewRoots |
Indicates whether to generate Semitic root forms |
Boolean |
false |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Hebrew Disambiguator Types
RBL includes multiple disambiguators for Hebrew. Set the value for the option disambiguatorType
to select which type to use. The valid values for DisambiguatorType
are:
PERCEPTRON
: a perceptron model
DICTIONARY
: a dictionary-based reranker
-
DNN
: a deep neural network.
TensorFlow, which is not supported on all systems, much be installed. If DNN
is selected and TensorFlow is not supported, RBL will throw a RosetteRuntimeException
.
Table 8. Hebrew Disambiguation Options
Option |
Description |
Type |
Default |
Supported Languages |
disambiguatorType
|
Selects which disambiguator to use for Hebrew. |
DisambiguatorType |
PERCEPTRON
|
Hebrew |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Arabic, Persian, and Urdu Token Analysis
For Arabic, Persian (Western Persian and Dari), and Urdu, RBL may return multiple analyses for each token. Each analysis contains the normalized form of the token, a part-of-speech tag, and a stem. For Arabic, the analysis also includes a lemma and a Semitic root. For Persian, some analyses include a lemma.
This appendix provides information on token normalization and the generation of variant tokens. For Arabic, it also provides information on stems and Semitic roots.
Token normalization is performed in two stages:
Generic Arabic script normalization
Language-specific normalization
Generic Arabic Script Token Normalization
Generic Arabic script normalization includes the following:
The following diacritics are removed: dammatan, kasratan, fatha, damma, kasra, shadda, sukun.
The following characters are removed: kashida, left-to-right marker, right-to-left marker, zero-width joiner, BOM, non-breaking space, soft hyphen, space.
Alef maksura is converted to yeh unless it is at the end of the word or followed by hamza.
-
All numbers are converted to Arabic numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
Thousand separators are removed, and the decimal separator is changed to a period (U+002E). The normalizer handles cases where ر (reh) is (incorrectly) used as the decimal separator.
Alef with hamza above: ٵ (U+0675), ٲ (U+0672), or ا (U+0627) combined with hamza above (U+0654) is converted to أ (U+0623).
Alef with madda above: ا (U+0627) combined with madda above (U+0653) is converted to آ (U+0622).
Alef with hamza below: ٳ (U+0673) or ا (U+0627) combined with hamza below (U+0655) is converted to إ (U+0625).
Misra sign to ain: ؏ (U+060F) is converted to ع (U+0639).
Swash kaf to kaf: ڪ (U+06AA) is converted to ک (U+06A9).
Heh: ە (U+06D5) is converted to ه (U+0647).
-
Yeh with hamza above: The following combinations are converted to ئ (U+0626).
ی (U+06CC) combined with hamza above (U+0654)
ى (U+0649) combined with hamza above (U+0654)
ي (U+064A) combined with hamza above (U+0654)
Waw with hamza above: و (U+0648) combined with hamza above (U+0654), ٷ (U+0677), or ٶ (U+0676) is converted to ؤ (U+0624).
For Arabic input, the following language-specific normalizations are performed on the output of the Arabic script normalization:
Zero-width non-joiner (U+200C) and superscript alef ٰ (U+0670) are removed.
Fathatan (U+064B) is removed.
Persian yeh (U+06CC) is normalized to yeh (U+064A) if it is initial or medial; if final, it is normalized to alef maksura (U+0649).
Persian kaf ک (U+06A9) is converted to ك (U+0643).
Heh ہ (U+06C1) or ھ (U+06BE) is converted to ه (U+0647).
Following morphological analysis, the normalizer does the following:
Alef wasla ٱ (U+0671) is replaced with plain alef ا (U+0627).
If a word starts with the incorrect form of an alef, the normalizer retrieves the correct form: plain alef ا (U+0627), alef with hamza above أ (U+0623), alef with hamza below إ (U+0625), or alef with madda above آ (U+0622).
The analyzer can generate a number of variant forms for each Arabic token to account for the orthographic irregularity seen in contemporary written Arabic. Each token variant is generated in normalized form.
If a token contains a word-final hamza preceded by yeh or alef maksura, then a variant is created that replaces these with hamza seated on yeh.
If a token contains waw followed by hamza on the line, a variant is created that replaces these with hamza seated on waw.
Variants are created where word-final heh is replaced by teh marbuta, and word-final alef maksura is replaced by yeh.
The stem returned is the normalized token with affixes (such as prepositions, conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.
In the process of stripping morphemes (affixes) from a token, the analyzer produces a stem, a lemma, and a Semitic root. Stems and lemmas result from stripping most of the inflectional morphemes, while Semitic roots result from stripping derivational morphemes.
Inflectional morphemes indicate plurality or verb tense. Different forms, such as singular and plural noun, or past and present verb tense share the same stem if the forms are regular. If some of the forms are irregular, they do not share the same stem, but do share the same lemma. Since stems and lemmas preserve the meaning of words, they are very useful in text retrieval and search in general.
Words that have a more distant linguistic relationship share the same Semitic root.
Examples. The singular form الكتابة (al-kitaaba, the writing) and plural form كتابات (kitaabaat, writings) share the same stem: كتاب (kitaab). On the other hand, كُتُب (kutub, books) is an irregular form and does not have the same stem as كِتَاب (kitaab, book). But both forms do share the same lemma, which is the singular form كِتَاب (kitaab). The words مكتبة (maktaba, library), المَكْتَب (al-maktab, the desk), كُتُب (kutub, books), and الكتابة (al-kitaaba, the writing) are related in the sense that a library contains books and desks, a desk is used to write on, and writings are often found in books. All of these words share the same Semitic root: كتب (ktb)
Persian Token Normalization
The following Persian-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B) and superscript alef (U+0670) are removed.
Alefأ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Arabic kafك (U+0643) is converted to Persian kafک (U+06A9).
Heh goal (U+06C1) or heh doachashmee (U+06BE) is converted to heh (U+0647).
Heh with hamzaۂ (U+06C2) is converted to ۀ (U+06C0).
Arabic yehي (U+064A) or ى (U+0649) is converted to Persian yehی (U+06CC).
Following morphological analysis:
The analyzer can generate a variant form for some tokens to account for the orthographic irregularity seen in contemporary written Persian. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains a zero-width non-joiner (U+200C), a variant is generated without the zero-width non-joiner.
If a word ends in teh marbuta (U+0629), two variants are generated. The first replaces the teh marbuta with teh (U+062A); the second replaces the teh marbuta with heh (U+0647).
The Persian analyzer produces both stems and lemmas. A stem is the substring of a word that remains after all prefixes and suffixes are removed. A lemma is the dictionary form of a word. The lemma may differ from the stem if a word is irregular, or if a word contains regular transformations. The distinction between stems and lemmas is especially important for Persian verbs. The typical verb inflection table for Persian includes a past stem and and a present stem that cannot be derived from each other.
Examples. The present subjunctive tense verb بگویم (beguyam, that I say) has the stem گوی (guy) . The past tense verb گفتم (goftam, I said) has the stem گفت (goft). These two have different stems, because the word-internal strings are different. They have the same lemma گفت (goft) because they are inflections of the same word.
The following Urdu-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B), zero-width non-joiner (U+200C), and jazm (U+06E1) are removed.
Alef أ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Kaf ك (U+0643) is converted to ک (U+06A9).
Heh with hamza ۀ (U+06C0) is converted to ۂ (U+06C2).
Yehي (U+064A) or ى (U+0649) is converted to ی (U+06CC).
The analyzer can generate a number of variant forms for each Urdu token to account for the orthographic irregularity seen in contemporary written Urdu. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains heh doachashmee (U+06BE), a variant is generated replacing the heh doachashmee with heh goal (U+06C1).
If a word ends with teh marbuta (U+0647), a variant is generated replacing the teh marbuta with heh goal (U+06C1).
Chinese and Japanese Alternative Tokenization
For Chinese and Japanese, in addition to the statistical model described above, RBL includes Chinese Language Analyzer (CLA) and Japanese Language Analyzer (JLA) modules which are optimized for search. They are activated by setting alternativeTokenization
to true
.
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Chinese and Japanese Readings
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Editing the stop words list
The ignoreStopwords
option uses a stop words list to define stop words. The path to the stop words list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8
and Japanese uses root/dicts/jpn/jla/JP_stop.utf8
.
You can add stop words to these files. When you edit one of these files, you must follow these rules:
The file must be encoded in UTF-8.
The file may include blank lines.
Comment lines begin with #
.
Each non-blank non-comment line represents exactly one lexeme (stop word).
Japanese Lemma Normalization
In Japanese, foreign and borrowed words may vary in their phonetic transcription to Katakana, and some words may be expressed with an older or a modern Kanji form. The Japanese lemma dictionary maps Katakana variants to a standard form and old Kanji forms to their modern forms. Examples:
You can include orthographic normalization in lemma user dictionaries for Japanese. This information can be accessed at runtime from the Analysis
or MorphoAnalysis
object.
Unknown Language Tokenization
RBL provides basic tokenization support when the language is "Unknown" (xxx
). The tokenizer uses generic rules to tokenize, such as whitespace and punctuation delimitation.
Supported Features when language is unknown (xxx
):
Using the language code of xxx
will provide basic tokenization support for languages not supported by RBL.