RBL-Elasticsearch is an Elasticsearch plugin for using Rosette Base Linguistics for analysis, indexing, and queries.
October 2022
Supports Elasticsearch 8.4.0 and includes RBL 7.45.0.c67.0
September 2022
Supports Elasticsearch 8.2.2 and includes RBL 7.45.0.c67.0
June 2022
Supports Elasticsearch 8.1.1 and includes RBL 7.44.1.c67.0
Known Issues
April 2022
Supports Elasticsearch 7.17.2 and includes RBL 7.43.0.c66.0
March 2022
Supports Elasticsearch 7.17.1 and includes RBL 7.43.0.c66.0
July 2021
Supports Elasticsearch 7.13.4 and includes RBL 7.41.1.c65.0
New:
New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. POS tags are not supported for these languages, so the universalPosTags
option is ignored. (ETROG-3443, ETROG-3465)
Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine
parameter. (ETROG-3431)
July 2021
Supports Elasticsearch 7.13.3 and includes RBL 7.40.1.c64.1
July 2022
Supports Elasticsearch 7.13.2 and includes RBL 7.44.2.c67.0
New
-
TensorFlow supported: This plugin supports TensorFlow. (ESPI-168)
If you will be using any of features which require TensorFlow:
Copy the file plugins/analysis-rbl-je/analysis-rbl-je.options
to config/jvm.options.d/analysis-rbl-je.options
Edit the file, uncommenting the line that matches your operating system and CPU.
Features that require TensorFlow include:
June 2021
Supports Elasticsearch 7.13.2 and includes RBL 7.40.1.c64.1
New
-
Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
TokenizerOption.includeHebrewRoots
and TokenizerOption.guessHebrewPrefixes
have been deprecated and replaced by AnalyzerOption.includeHebrewRoots
and AnalyzerOption.guessHebrewPrefixes
.
NFKC normalization is now supported for Hebrew.
We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Improved directory structure: The contents of the models/
directory are now separated into subdirectories by language. (ETROG-1218)
-
Statistical models moved to models/ directory: The following files have been moved from dicts/
to models/
: (ETROG-1218)
cat/ca-ud-train.downcased.mdl
est/et-ud-train.downcased.mdl
fas/posLemma.mdl
lav/lv-ud-train.downcased.mdl
nno/lemma.mdl
nob/lemma.mdl
slk/sk-ud-train.downcased.mdl
srp/sr-ud-train.downcased.mdl
New option for tokenizers: We've added a new option, tokenizerType
to specify which tokenizer to use. The options alternativeTokenization
and fstTokenize
are deprecated in favor of tokenizerType
. (ETROG-3419)
New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType
to spaceless_statistical
. (ETROG-3392)
Bug fixes
-
Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)
Example: מע'רב
-
Previously:
Token{text=מע'}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]},
partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]},
partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
-
Now:
Token{text=מע'רב}
MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב,
tagSet=MILA_HEBREW}
Third-party component updates
This release includes the following third-party component changes:
Table 1. Added
Package |
Version |
JavaCPP |
1.5.4 |
Table 2. Upgraded
Package |
Old Version |
New Version |
TensorFlow |
1.14.0 |
2.3.1 |
February 2021
Supports Elasticsearch 7.10.2 and includes RBL 7.38.1.c63.0
January 2021
Supports Elasticsearch 7.6.1 and includes RBL 7.38.1.c63.0
New
The tools directory is now included in the RBL-Elasticsearch package. (ESPI-31)
Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
-
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation
must be set to false
, which is the default. (ETROG-3289)
Example: δείξε
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation
to true
. (ETROG-3304)
RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Bug fixes
Third-party component updates
This release includes the following third-party component changes:
November 2020
Supports Elasticsearch 7.6.1 and includes RBL 7.37.0.c62.2
New
Support for unknown language: If the language is unknown (xxx
), tokenization and sentence breaking is supported. (ETROG-3278)
Tokenization rule preprocessor: The preprocessor command !!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)
Bug fixes
Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)
RBL no longer crashes when alternativeTokenization
and fragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)
Correct start and end offsets are now produced when fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)
Tokens no longer have null token types. (ETROG-3316)
-
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize
was set to true. (ETROG-2505)
Example: ﷺ
-
Previously: Offsets:
صلى start 0 end 0
الله start 0 end 0
عليه start 0 end 0
وسلم start 0 end 1
-
Now: Offsets:
صلى start 0 end 1
الله start 0 end 1
عليه start 0 end 1
وسلم start 0 end 1