RBL-Elasticsearch is an Elasticsearch plugin for using Rosette Base Linguistics for analysis, indexing, and queries.
February 2021
Supports Elasticsearch 7.10.2 and includes RBL 7.38.1.c63.0
January 2021
Supports Elasticsearch 7.6.1 and includes RBL 7.38.1.c63.0
New
The tools directory is now included in the RBL-Elasticsearch package. (ESPI-31)
Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
-
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation
must be set to false
, which is the default. (ETROG-3289)
Example: δείξε
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation
to true
. (ETROG-3304)
RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Bug fixes
Third-party component updates
This release includes the following third party component changes:
November 2020
Supports Elasticsearch 7.6.1 and includes RBL 7.37.0.c62.2
New
Support for unknown language: If the language is unknown (xxx
), tokenization and sentence breaking is supported. (ETROG-3278)
Tokenization rule preprocessor: The preprocessor command !!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)
Bug fixes
Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)
RBL no longer crashes when alternativeTokenization
and fragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)
Correct start and end offsets are now produced when fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)
Tokens no longer have null token types. (ETROG-3316)
-
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize
was set to true. (ETROG-2505)
Example: ﷺ
-
Previously: Offsets:
صلى start 0 end 0
الله start 0 end 0
عليه start 0 end 0
وسلم start 0 end 1
-
Now: Offsets:
صلى start 0 end 1
الله start 0 end 1
عليه start 0 end 1
وسلم start 0 end 1