Rosette Base Linguistics Java Edition (RBL) is a Java Software Development Kit (SDK) for building analytic applications to process text in a variety of languages.
September 2023
New
Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when tokenizerType
is set to spaceless_lexical
. (ETROG-3616)
Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when tokenizerType
is set to spaceless_lexical
. (ETROG-3632)
Added secondary parts of speech: We've added support for secondary parts of speech to Chinese and Japanese when tokenizerType
is set to spaceless_lexical
. (ETROG-3636)
-
Improved support for Chinese readings when tokenizerType
is set to spaceless_lexical
:
Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)
-
Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)
Example: "蔭權"
Solr and Lucene support: Lucene 9.5 - 9.7 and Solr 9.3 are now supported (ETROG-3643)
Bug Fixes
We fixed a bug where an ArrayIndexOutOfBoundsException
occurred when the Chinese dictionaries produced more than 6 matches and tokenizerType
was set to spaceless_lexical
. (ETROG-3635)
When Chinese readings are constructed by character and tokenizerType
is set to spaceless_lexical
, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)
We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)
Known Issues
Third-party component updates
Table 1. Updated
Package |
Old Version |
New Version |
Jackson Annotations |
2.15.0 |
2.15.2 |
Jackson Core |
2.15.0 |
2.15.2 |
Jackson Databind |
2.15.0 |
2.15.2 |
Jackson Dataformat XML |
2.15.0 |
2.15.2 |
Jackson Dataformat YAML |
2.15.0 |
2.15.2 |
Jackson Datatype: Guava |
2.15.0 |
2.15.2 |
Jackson Module: Old JAXB Annotations |
2.15.0 |
2.15.2 |
Guava: Google Core Libraries for Java |
31.1-jre |
32.1.2-jre |
Protocol Buffers [Core] |
3.21.7 |
3.23.4 |
June 2023
New
Third-party component updates
Table 2. Updated
Package |
Old Version |
New Version |
Apache Log4J |
2.19.0 |
2.20.0 |
fastutil |
8.5.9 |
8.5.12 |
Jackson Annotations |
2.14.0 |
2.15.0 |
Jackson Core |
2.14.0 |
2.15.0 |
Jackson Databind |
2.14.0 |
2.15.0 |
Jackson Dataformat XML |
2.14.0 |
2.15.0 |
Jackson dataformats: Text |
2.14.0 |
2.15.0 |
Jackson datatypes: collections |
2.14.0 |
2.15.0 |
Jackson modules: Base |
2.14.0 |
2.15.0 |
SnakeYAML |
1.33 |
2.0 |
March 2023
Bug Fixes
Known Issues
March 2023
New
Known Issues
November 2022
New
Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)
Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)
GPU support: Features that use TensorFlow now use a GPU if available. (ETROG-3564)
Emoji support: Emoji 15.0 is now supported. (ETROG-3577)
New option for Katakana: We've added the option joinKatakanaNextToMiddleDot
to control whether sequences of Japanese Katakana tokens adjacent to a middle dot should be merged into a single Katakana token. By default, it is true
, which matches the behavior in previous versions of RBL-JE. (ETROG-3592)
Solr 9.1 support: Lucene and Solr 9.1 are supported. (ETROG-3597)
Bug Fixes
Third-party component updates
Table 3. Upgraded
Package |
Old version |
New version |
Apache Log4j |
2.17.1 |
2.19.0 |
fastutil |
8.5.6 |
8.5.9 |
Jackson |
2.11.1 |
2.14.0 |
JavaCPP |
1.58-alpha.20220614.013710.426 |
1.58 |
SLF4J |
1.7.33 |
1.7.36 |
SnakeYAML |
1.30 |
1.33 |
September 2022
New
-
Tagalog support:
RBL now supports Part of Speech (POS) tagging in Tagalog. (ETROG-3559)
RBL now supports lemmatization for Tagalog. (ETROG-3570)
The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)
Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)
Bug Fixes
June 2022
New
Indonesian support added: RBL now supports Part of Speech (POS) tagging in Indonesian. (ETROG-3543)
Malay (Standard) support added: RBL now supports Part of Speech (POS) tagging in Malay (Standard). (ETROG-3545)
Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)
Java 17 support added: Java 8 and 9 support has been removed. (ETROG-3524)
Solr 9 support added: RBL now supports Lucene and Solr 9. (ETROG-3549)
Solr 6 support deprecated: RBL no longer supports Lucene or Solr 6 or earlier. (ETROG-3519)
Bug Fixes
Third-party component updates
This release includes the following third-party component changes:
Table 4. Added
Package |
Version |
License |
Jakarta Annotations API |
1.3.3 |
Eclipse Public License 2.0 and GPL 2 with classpath exception |
February 2022
Notice
Solr 6 and earlier support is deprecated as of this release.
Java 8 and Java 9 support is deprecated as of this release.
New
Solr 8.11 support: This release supports Solr 8.11 (ETROG-3502)
Deprecated methods: Token#getType
has been deprecated as token types are not used in RBL-JE without the Lucene/Solr plugins and the plugins use a different API. (ETROG-3503)
Solr 6 support deprecated: Support for Solr versions 6.x and earlier is deprecated as of this release and will be removed in the next version.
Permission changes: We removed group and other write permissions from model files. All files are now only writable by the owner. (ETROG-3516)
Third-party component updates
This release includes the following third-party component changes:
Table 5. Upgraded
Package |
Old Version |
New Version |
Apache Commons IO |
2.7 |
2.11.0 |
Apache Commons Lang |
2.6 |
3.12.0 |
Apache Log4j |
1.2.17 |
2.17.1 |
ICU4J |
59.1 |
70.1 |
fastutil |
8.4.0 |
8.5.6 |
SLF4J |
1.7.28 |
1.7.33 |
SnakeYAML |
1.26 |
1.30 |
TensorFlow for Java |
0.2.0 |
0.3.3 |
November 2021
New
Deprecated factories: TokenizerFactory
, AnalyzerFactory
, and CSCAnalyzerFactory
have been deprecated in favor of BaseLinguisticsFactory
. (ETROG-3453)
-
Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts, for Japanese with tokenizerType
set to spaceless_lexical
. (ETROG-3474)
Example: Input: 三・一四
Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)
Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)
Japanese tokenization: In Japanese, when tokenizerType
is set to spaceless_lexical
, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. This is consistent with the default algorithm, spaceless_statistical
. (ETROG-3475)
Solr 8.10 support: This release supports Solr 8.10. (ETROG-3482)
Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)
Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)
Bug Fixes
In Japanese, when tokenizerType
is set to spaceless_lexical
, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)
-
We've reverted two of the POS changes made in version 7.39.0.63.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)
The values are now:
RBL-JE no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)
In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)
July 2021
Bug Fixes
Enabling universalPosTags
for Indonesian, Tagalog, or Standard Malay no longer throws a RosetteUnsupportedLanguageException
. POS tags are not supported for these languages, so the universalPosTags
option is ignored. (ETROG-3465)
July 2021
New
New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine
parameter. (ETROG-3431)
Solr 8.9 support: This release supports Solr 8.9. (ETROG-3457)
May 2021
Bug Fixes
We fixed a bug where enabling alternativeSpanishDisambiguation
for Spanish caused a NullPointerException
to be thrown. (ETROG-3435)
We fixed a bug where setting disambiguatorType
to DNN
for Hebrew caused a RosetteRuntimeException
to be thrown. (ETROG-3437)
May 2021
New
New option for tokenizers: We've added a new option, tokenizerType
to specify which tokenizer to use. The options alternativeTokenization
and fstTokenize
are deprecated in favor of tokenizerType
. (ETROG-3419)
New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType
to spaceless_statistical
. (ETROG-3392)
Bug Fixes
Third-party component updates
This release includes the following third-party component changes:
Table 6. Added
Package |
Version |
JavaCPP |
1.5.4 |
Table 7. Upgraded
Package |
Old Version |
New Version |
TensorFlow |
1.14.0 |
2.3.1 |
March 2021
New
-
Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
TokenizerOption.includeHebrewRoots
and TokenizerOption.guessHebrewPrefixes
have been deprecated and replaced by AnalyzerOption.includeHebrewRoots
and AnalyzerOption.guessHebrewPrefixes
.
NFKC normalization is now supported for Hebrew.
We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Solr 8.8: This release supports Solr 8.8. (ETROG-3369)
Improved CSCAnnotator output: The CSCAnnotator
now emits tokens in addition to translations, even if no tokens were specified in the input. (ETROG-3356)
Improved directory structure: The contents of the models/
directory are now separated into subdirectories by language. (ETROG-1218)
-
Statistical models moved to models/ directory: The following files have been moved from dicts/
to models/
: (ETROG-1218)
cat/ca-ud-train.downcased.mdl
est/et-ud-train.downcased.mdl
fas/posLemma.mdl
lav/lv-ud-train.downcased.mdl
nno/lemma.mdl
nob/lemma.mdl
slk/sk-ud-train.downcased.mdl
srp/sr-ud-train.downcased.mdl
Bug Fixes
-
Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)
Example: מע'רב
-
Previously:
Token{text=מע'}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]},
partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]},
partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
-
Now:
Token{text=מע'רב}
MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב,
tagSet=MILA_HEBREW}
-
A structured region containing two new lines is now properly labeled as STRUCTURED. Previously, the layout region would be labeled as UNSTRUCTURED. (ETROG-3378)
Example: * item\n* item\n* item\n\n
-
Previously:
{"startOffset": 0,"endOffset": 14,"layout": "STRUCTURED"}
{"startOffset": 14,"endOffset": 22,"layout": "UNSTRUCTURED"}
-
Now:
{"startOffset": 0,"endOffset": 22,"layout": "STRUCTURED"}
January 2021
New
RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Bug Fixes
Third-party component updates
This release includes the following third-party component changes:
December 2020
New
Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
-
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation
must be set to false
, which is the default. (ETROG-3289)
Example: δείξε
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation
to true
. (ETROG-3304)
Deprecated classes: The classes BufferWordBreaker
and WordBreakResults
have been deprecated. (ETROG-3318)
Bug Fixes
-
The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)
Example: Start+
GenericTokenizer#hasNext
is now implemented to be consistent with the documentation for Iterator#hasNext
. Previously it always returned false
. (ETROG-2140)
November 2020
New
Performance improvement: Spanish disambiguation with alternativeSpanishDisambiguation
set to false
is now faster. (ETROG-3271)
Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)
Support for unknown language: If the language is unknown (xxx
), tokenization and sentence breaking is supported. (ETROG-3278)
Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)
Tokenization rule preprocessor: The preprocessor command !!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)
Updated sample: The tokenize-analyze
sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)
New sample: The sample csc-annotate
demonstrates using CSC with the ADM API. (ETROG-3317)
Deprecated option: TokenizerOption#includeRoots
has been deprecated and replaced with TokenizerOption#includeHebrewRoots
. (ETROG-3314)
Deprecated option: The alternative tokenization option deliverExtendedAttributes
is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)
Bug Fixes
Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)
RBL no longer crashes when alternativeTokenization
and fragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)
Correct start and end offsets are now produced when fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)
-
The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)
Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET
Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
Tokens no longer have null token types. (ETROG-3316)
-
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize
was set to true. (ETROG-2505)
Example: ﷺ
-
Previously: Offsets:
صلى start 0 end 0
الله start 0 end 0
عليه start 0 end 0
وسلم start 0 end 1
-
Now: Offsets:
صلى start 0 end 1
الله start 0 end 1
عليه start 0 end 1
وسلم start 0 end 1
September 2020
New
Lucene/Solr: Versions up through 8.6.0 are now supported. (ETROG-3250)
Decompose compounds: The option to control decomposition of compounds is now available in Dutch, German, Hungarian, Danish, Bokmål, Nynorsk, Swedish, and Korean. The default for decomposeCompounds
is true
. (ETROG-3263, ETROG-3264, ETROG-3265)
Performance improvement: English and Spanish disambiguation with is now faster. Alternate disambiguation (alternateEnglishDisambiguation
or alternateSpanishDisambiguation
) must be set to false
. (ETROG-3246, ETROG-3243)
Bug Fixes
-
In Hebrew, prefixes in some acronym tokens are now listed correctly in the list of prefixes, instead of being duplicated in the lemma. (ETROG-3214)
Example: “ומש"ס”
Previously: lemma: “ומומש"ס”, empty prefix list
Now: lemma: “ש"ס”, prefix list = [“ו”, “מ”]
-
Sentence breaks are now correct when there are two line breaks and fragmentBoundaryDetection
is enabled. (ETROG-3241)
Example: "a very very very very long line\nshort\n\n"
-
Previously: 2 sentences
{"startOffset":0,"endOffset":20}
{"startOffset":20,"endOffset":27}
-
Now: 1 sentence
{"startOffset":0,"endOffset":26}
-
In Hebrew, lemmas starting or ending with spaces now have the spaces removed. (ETROG-3248)
Example: "אאורקה"
Previously: “אאורקה ”
Now: "אאורקה"
-
Analysis of unknown Hebrew words with guessed prefixes no longer have duplicate prefixes in their prefix list. (ETROG-3253)
Example: "בפיירפוקס"
In Chinese and Japanese, the system no longer crashes when both fragmentBoundaryDetection
and alternativeTokenization
are enabled. (ETROG-3260)
In Japanese, adjacent tokens are no longer erroneously joined when alternativeTokenization
is enabled. (ETROG-3261)
-
When universalPosTags
are enabled the UPT-16 POS tags are now marked as having the tag set UPT16_V1
instead of the default tag set of the language. (ETROG-3273)
Example: French
We've fixed the tokenize-analyze example in the samples directory. It now correctly produces results for Hebrew analysis. (ETROG-3252)
July 2020
New Features
Layout regions added: Layout regions, describing each section of input text as STRUCTURED
or UNSTRUCTURED
, are now identified by the annotator. In order to detect layout regions, fragment boundary detection must be enabled. (ETROG-3172)
New short line parameter: The option maxTokensForShortLine
has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)
Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option fstTokenize
must be set to true
. (ETROG-3226)
Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)
Hebrew user-defined dictionaries added: Static and dynamic user-defined Hebrew analysis dictionaries are now supported. (ETROG-3230)
Deprecated method: HebrewAnalysis#characteristicString
is now deprecated. (ETROG-3209)
Order of user-defined dictionaries: The order in which user-defined dictionaries are consulted has been standardized. Refer to the RBL-JE Application Developer's Guide for details. (ETROG-3148)
Bug Fixes
-
Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when fstTokenize
was enabled and in some languages. (ETROG-3159)
Example: "1\n234" (embedded newline within the number string)
This example assumes fstTokenize
is enabled and the language is French.
Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)
-
Tokens with digits are now eligible for the Greek guesser. (ETROG-3231)
Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags
Now: "HDMI1" gets FM POS tag
-
In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only occured when the guessHebrewPrefixes
option is set to true
.(ETROG-3221)
Example: "ומפיפרנו"
-
Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)
Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"
Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"
Now: One analysis, lemmatized to "разложить"
-
German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)
Example: Input "Mitbehandlung"; "mit" is a separable prefix
-
In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)
Example: "1" (keycap)
Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.
Now: Tokenized to one token, "1"
May 2020
New Features
-
Hebrew tokens that have prefixes but not stems now get appropriate parts of speech. Previously, they got the POS tag "unknown". (ETROG-3207)
Example: “ה” from the string “ה70”
Lucene/Solr up through version 8.5.1 is now supported. (ETROG-3208)
-
When guessHebrewPrefixes
is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes. (ETROG-3188)
Example: Token: "ומפיפרנו"
-
Previously: 2 analyses:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
-
Now: 3 analysis:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
hebrewPrefixes=[] lemma=ומפיפרנו
Bug Fixes
-
Minimally-qualified emoji are no longer split apart. (ETROG-3185)
Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>:
)
Previously: U+1F481 and <U+200D, U+2642> (2 tokens)
Now: <U+1F481, U+200D, U+2642> (1 token)
-
Capitalized nouns are no longer being detected as verbs. (ETROG-3186)
Example: The noun "Service" from the phrase "Price and Quality of Service"
-
When creating multiple analyzers for Chinese, Japanese, or Thai with alternateTokenization
set to false
(the default), the analyzers will now share the same model data. This will improve memory usage when creating multiple analyzers. (ETROG-3200)
Note: While memory usage has been improved, the process is still memory intensive. If RBL throws an OutOfMemoryError
, increase the heap space.
March 2020
New Features
Lucene/Solr: RBL-JE now supports Lucene/Solr up through version 8.4.1. (ETROG-3156)
Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized. (ETROG-3164)
Additional emoji support: Emoji hair components are now lemmatized. (ETROG-3167)
German professions: Additional German professions have been added to the German lexicon. (ETROG-3163)
Spanish performance improvements: Spanish disambiguation is now faster when alternativeSpanishDisambiguation
is false
. (ETROG-3169)
Hebrew lemmatization: We increased proper noun coverage in the Hebrew lexicon. (ETROG-3161, ETROG-3162)
Bug Fixes
Low surrogates are no longer stripped from the ends of tokens in Hebrew. (ETROG-3165)
-
Number tokens with embedded spaces are no longer split into multiple tokens when preceded or followed by a symbol when fstTokenize
is true
. (ETROG-3158)
January 2020
New Features
The delimiters for the fragment boundary detector are now configurable. (ETROG-3116)
The fragment boundary detector now marks a boundary after any spaces following the fragment boundary delimiter. (ETROG-3116)
An underscore (U+005F) is no longer treated as a token separator in German when fstTokenize
is enabled. (ETROG-3144)
Bug Fixes
We fixed a bug where tokens from multi-script Russian text sometimes had incorrect offsets if fstTokenize
was enabled. (ETROG-3142)
We fixed a bug where multi-script Russian text would have a sentence break each time the script changed. (ETROG-3145)
We fixed a bug where there were unexpected sentence breaks after some short lines not ending in whitespace. (ETROG-3146)
We fixed a bug where sentence breaks were missing when the sentence break did not align with a token boundary. (ETROG-3140)
December 2019
New Features
Added support for Lucene/Solr up through version 8.3.0. (ETROG-3128)
Added support for tokenizing and lemmatizing Latvian. (ETROG-2798)
Latin-script regions within Russian documents are now tokenized and analyzed as English. (ETROG-3126)
TokenizerOption.licenseString
, AnalyzerOption.licenseString
, and BaseLinguisticsOption.licenseString
may now be passed into a create
method. Previously, these options had to be set on the factory itself. (ETROG-3134)
Bug Fixes
We fixed a bug where guessed German compounds were sometimes lemmatized as verbs but tagged as nouns. (ETROG-3094)
We fixed a bug where the fragment boundary detector would mark a sentence break after every Windows newline. (ETROG-3133)
November 2019
New Features
The Hebrew files dinflections.bin
, dprefixes.data,
and gimatria.data
have been moved from the root/models
directory to root/dicts/heb
. (ETROG-3088)
Specifying the universalPosTags
option now adds the deliverExtendedTags
option as well. (ETROG-2185)
Dynamic user dictionaries can now be created and populated at runtime. See the section User-Defined Dictionaries in the Application Developer's Guide for details. (ETROG-3086, ETROG-3100, ETROG-3109, ETROG-3110, ETROG-3111)
Fragment boundary detection is now enabled by default. Previously it was disabled by default. (ETROG-3108)
TokenizerOption.alternativeTokenizationOptions
has been deprecated in favor of a separate options for each YAML key. See the Javadoc for details. (ETROG-3109)
The UPT-16 files upt-16-pes.yaml
and upt-16-prs.yaml
have been removed from the distribution package, as they were unused. (ETROG-3122)
The -order
option in rbl-build-csc-dictionary
has been removed. All dictionaries are now built as LE, as LE dictionaries still work on BE machines. (ETROG-3120)
We've added imperative forms for 2000 verbs to the Arabic lexicon. (ETROG-3090)
Bug Fixes
Fragment boundary detection is now enabled for Hebrew. (ETROG-1442)
When lemmatizing numbers in Russian, numbers containing spaces will now be lemmatized without the space. For example, "1 234" will now be lemmatized as "1234" instead of "1 234". (ETROG-3101)
We fixed a bug introduced in 7.30.1.c61.0 which raised an ArrayIndexOutOfBoundsException
when processing Japanese with alternativeTokenization
and favorUserDictionary
set to true
. (ETROG-3118)
We fixed a bug where a middle dot would be ignored if it preceded white space when using alternativeTokenization
in Japanese. (ETROG-3113)
Third-party component updates
September 2019
Bug Fixes
We fixed a bug where an AssertionError
might be thrown when analyzing Hungarian with Java assertions enabled.
-
Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.
Previously:Аполлона-11
(Apollo-11) was tagged as PROP, MISC, and NOUN
Now:Аполлона-11
(Apollo-11) is tagged as NOUN
Correct token offsets are now returned from a Japanese annotator where a non-katakana character precedes a user-defined katakana token and alternativeTokenization
and favorUserDictionary
are enabled.
We fixed a bug where constructors of factory classes in the Lucene/Solr plugin would throw an UnsupportedOperationException
if passed a Map
that did not support the remove
method.
August 2019
New Features
Bug Fixes
When alternativeTokenization
was set to true
, the Chinese tokenizer could create tokens at the end of the input string with the part of speech NT without checking that the context was valid for NT
Analyzing Chinese and Japanese with alternativeTokenization
enabled is now much faster on sentences that are thousands of characters long.
August 2019
New Features
Segmentation user dictionaries can be used for all languages, not just Chinese, Japanese, and Thai.
The option compoundComponentSurfaceForms
has been added to return the surface forms of the components of compound words. By default, RBL-JE only returns the lemmas.
Added support for Lucene/Solr up through version 8.1.1.
Some Polish words ending in “-cku”, “-ska”, or “-sku” are lemmatized to forms ending in “-cki” or “-ski”.
Bug Fixes
The Japanese POS tag NE
was not converted correctly to UPT-16.
The French POS tag CONJQUE
was converted to UPT-16 CONJ
instead of the more appropriate SCONJ
.
When alternativeTokenization
was disabled, Chinese punctuation was tagged as GUESS
instead of PUNCT
or EOS
.
June 2019
New Features
Setting alternativeTokenization
to true
enables an alternative tokenizer for Thai, for parity with the Thai tokenizer in Basis Technology's C++ API (RLP).
All Hebrew tokens have analyses. The main change was adding the new part of speech punctuation
. Non-punctuation tokens that formerly had empty analysis lists now have the part of speech unknown
.
Third-party component updates
May 2019
New Features
Updated the English lexicon.
Added support for Lucene/Solr up through version 8.0.0.
Updated the German lexicon.
Updated the Swedish lexicon.
Arabic analysis will attempt to replace leading hamzated alefs with plain alefs for unrecognized tokens.
Bug Fixes
The surface forms of Hebrew tokens consisting of multiple prefixes without a base, like “מה”, are now the entire token text, instead of just the first prefix.
Russian hyphenated words that end in numbers, like “Аполлона-11”, are no longer tagged as DIG. They are now tagged with the same parts of speech they had before 7.27.2.c60.0.
Closing parentheses, brackets, and braces that follow URLs when urls is enabled are no longer merged into the URLs.
The disambiguator is now more likely to select analyses with the POS tags ATMENTION, EMAIL, HASHTAG, and URL over other analyses.
When the Hebrew tokenizer encounters a character not used in Hebrew immediately following a character used in Hebrew, it starts a new token. Formerly, it would delete that character and any following characters up to the next token separator (e.g. white space).
RBL-JE can now successfully read in ICU tokenization rule files that begin with a BOM.
Hebrew tokens consisting of multiple prefixes without a base are now tagged with the part of speech “unknown”, to match single-prefix tokens.
The English token “than” is tagged only as COTHAN. The candidate part of speech COORD has been removed for this token.
May 2019
New Features
A perceptron-based disambiguator is available for Hebrew. It is used by default and when the option disambiguatorType
is set toDisambiguatorType.PERCEPTRON.
It was measured to have higher lemma and part of speech accuracies than the alternatives. To use the previous default, set disambiguatorType
to DisambiguatorType.DICTIONARY
.
Added support for Lucene/Solr up through version 7.6.0.
Running on Java 11 is now supported.
Bug Fixes
Some white space characters could be part of Chinese tokens when alternativeTokenization was enabled.
Tokens that are thousands of characters long slow down the tokenizer.
Polish tokens that can appear in multiword expressions are no longer lemmatized to the full expressions. For example, “dzień” is not lemmatized to “dzień_dobry”.
The non-final components of Russian compound words with more than one hyphen were not lemmatized. The non-final components of Russian hyphenated compound words with the interfix “е” or “о” that coincidentally looked like the short forms of adjectives were lemmatized as if they were short forms.
RBL-JE Release Note Archive 7.27.0.c60.0 and earlier
The Chinese Script Converter must be licensed distinctly from the rest of RBL. Old licenses won’t work for it anymore. (ETROG-2916)
Lemmatization is supported for Persian. (ETROG-2924)
A dictionary-based disambiguator is available for Hebrew and is now the default. To run disambiguation in TensorFlow, set the option disambiguatorType
to DisambiguatorType.DNN
. (ETROG-2928)
Analyzing German tokens with default ignorable code points, including U+00AD SOFT HYPHEN, U+200C ZERO WIDTH NON-JOINER, and U+200D ZERO WIDTH JOINER, produces the same analyses as if the tokens did not include those characters. (ETROG-2824)
Improved the lemma accuracy of the Spanish disambiguator. (ETROG-2856)
Improved disambiguation of English proper nouns. (ETROG-2867)
The North Korean (qkp) and South Korean (qkr) dialects are both treated as Korean (kor). (ETROG-2878)
Added support for tokenizing and lemmatizing Catalan, Estonian, Serbian, and Slovak. (ETROG-2752, ETROG-2774)
Added support for Lucene/Solr 7.0.0 through 7.1.0. (ETROG-2706)
POS-tagging and disambiguation are supported for Hebrew. (ETROG-2707, ETROG-2717)
Added the ArabicMorphoAnalysis
class to allow an Annotated Data Model application to get more information for Arabic, Persian, and Urdu text than the MorphoAnalysis
class would provide. (ETROG-2623)
Improved speed and memory footprint for English and Spanish disambiguation. (ETROG-2607, ETROG-2618, ETROG-2635)
Added the alternativeEnglishDisambiguation
and alternativeSpanishDisambiguation
options to specify the use of the old disambiguator in English and Spanish. The new disambiguator, introduced in version 7.18.0.c58.3, and enhanced in the current release, is more accurate, but slower. (ETROG-2626)
Added the guessHebrewPrefixes
option to control whether to split possible prefixes off unknown Hebrew words. (ETROG-2642)
Normalized U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew. (ETROG-2647)
Filter out punctuation from Lucene/Solr when query
is set. (ETROG-2648)
Added support for Lucene/Solr 6.6. (ETROG-2656)
Added tokenization and POS-tagging for at-mentions and hashtags in all languages. (ETROG-2571)
Added the options atMentions
, emailAddresses
, emoticons
, hashtags
, and urls
to enable tokenization and POS-tagging of @mentions, email addresses, emoticons, hashtags, and URLs. They are all disabled by default. (ETROG-2583)
Implemented the many-to-one normalizer. (ETROG-1961)
Deprecated many classes and methods that are for internal use only. (ETROG-2065)
Added BaseLinguisticsFactory#addUserCscDictionary
. (ETROG-2098)
Removed obsolete big-endian models and dictionaries. (ETROG-2214)
Overhauled RBLCmd. ANNOTATE
is the default command. -showTokenDetails
, -showRawResults
, and -verboseResults
are removed. -inputJson
interprets the input as an ADM. -outputJson
is a boolean option. (ETROG-1392, ETROG-2343)
Decomposed compound verbs in Japanese when using alternativeTokenization
. (ETROG-2350)
Introduced more advanced disambiguation for English and Spanish. (ETROG-2367, ETROG-2370, ETROG-2372, ETROG-2371, ETROG-2467)
Improved decompounding accuracy in Dutch. (ETROG-2408)
Added tokenization, lemmatization, and POS-tagging for emoticons and emoji in all languages. (ETROG-2474, ETROG-2512, ETROG-2516, ETROG-2520, ETROG-2522, ETROG-2538)
Supplemented analysis dictionaries for English and Spanish. (ETROG-2481, ETROG-2532, ETROG-2535)
Added support for Lucene/Solr 6.3. (ETROG-2501)
Introduced the ability to specify a user-defined reading dictionary in Lucene/Solr (userDefinedReadingDictionaryPath
). (ETROG-2527)
The Chinese script converter is an entitlement with a standard Chinese license. (ETROG-1605)
Arabic reh is normalized as a decimal separator in numeric contexts. (ETROG-1650)
Provide disambiguation of Dutch compounds. (ETROG-1736)
A custom reading dictionary can be specified on the RBLCmd command line. (ETROG-1938)
Alternative tokenization options are included in BaseLinguisticsOption
. (ETROG-1946)
Improve speed by caching Arabic analyses. (ETROG-1992)
Added support for alternative Chinese segmentation. (ETROG-2034)
Return Hebrew sentence boundaries. (ETROG-2036))
Added support for POS tag mappings for alternative Japanese and Chinese segmentation. (ETROG-2152)
Changed CompoundDictionary to provide its components in an order that reflects the contents of the lemma it returns. (ETROG-2154)
AnalyzerFactory#addUserAnalysisDictionary
now throws an informative exception when either the root or dictionary directory is invalid. (ETROG-2166)
Augmented RBLCmd with the ability to return the RBL-JE version number. (ETROG-2168)
Improve handling of hiragana tokens homophonous to verbs in the alternative Japanese tokenizer (JLA). (ETROG-2188)
Improve handling of POS-ambiguous verb stems in the alternative Japanese tokenizer (JLA). (ETROG-2189)
The RBLCmd help command now sorts its options alphabetically. (ETROG-2195)
Han readings now returned for all Katakana tokens. (ETROG-2208)
In the Russian FST tokenizer, initials are tokenized and given the +Init
morpho-tag. (ETROG-2209)
Memory requirements of the FST tokenizer were reduced. (ETROG-2200, ETROG-2226))
Reduce the memory allocated for tokens by the FST tokenizer. (ETROG-2235))
Terminated support for Lucene/Solr 4.1-4.2. Added support for Lucene/Solr 6.0-6.1. (ETROG-2016, ETROG-2241, ETROG-2299)
Note: 7.15.0 was forked directly from 7.14.0 and thus does not have the changes in 7.14.1+.
The specification of options to RBLCmd
was refactored. (ETROG-1503)
Added UPT-16 support for Persian and Urdu. (ETROG-1830)
Changed UPT-16 mappings for Czech and Hungarian numbers. (ETROG-1841)
Removed incorrect analyses for Polish adjectives and participles ending in m/mi. (ETROG-1916)
Removed archaic Polish analyses containing "być". (ETROG-1917)
Added raw analyses for English contractions. (ETROG-1944)
The command line tool RBLCmd supports Hebrew tokenization. (ETROG-1973)
Added support for Finnish stemming. (ETROG-2012)
Removed the spurious generation of an accusative case analysis for some Polish nouns. (ETROG-2020)
The Hebrew tokenizer overzealously guessed that periods were part of an abbreviation. (ETROG-2024)
Refactored the position metadata for Lucene tokens of compound components. (ETROG-2042)
Lucene tokens for components of a contraction are identified with type "CONT". To invoke this functionality, set FilterOption.identifyContractionComponents
to true. (ETROG-2044)
AnalysesAttribute
s formatted as JSON in Elasticsearch. (ETROG-2057)
Added API support for Lucene & Solr 5.0-5.3. (ETROG-1647)
Added support for Persian and Urdu. (ETROG-1636, ETROG-1667)
The 'nor' (Norwegian) language code is accepted. (ETROG-1690)
Exposed support for using the Rosette Annotated Data Model (ADM) to perform RBL-JE operations. (ETROG-1713)
The Arabic analysis candidate generation code now uses the same algorithm that the Arabic Language Processor in the native (C++) version of Rosette Base Linguistics does. (ETROG-1722)
Provided an alternative Japanese analyzer. This provides parity with the Japanese analyzer in Basis Technology's C++ API (RLP). It offers improved accuracy with query strings and names and provides greater user control of the analysis. (ETROG-1727)
For English, Portuguese, and German text, added ADM support for splitting contractions and analyzing the constituents. (ETROG-1769)
Provided support for returning the set of 16 universal part-of-speech (POS) tags rather than the set of 12 that were introduced in version 7.12.0. (ETROG-1771)
The RBLCmd tool now lists the BaseLinguisticsOption
options. To use these options you must set analyzerType=none
, lang
, and BaseLinguisticsOption.language
. (ETROG-1862)
Version 7.12.1.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis JVM SDK (e.g. RBL-JE, RLI-JE, REX-JE) in a single application, then choose versions that have the same compatibility number. (ETROG-1700)
Moved the Tokenize
and Analyze
samples into samples/tokenize-analyze and created a single Ant build script to compile and run both samples. (ETROG-1264)
Provided support for returning universal part-of-speech (POS) tags rather than the language-specific POS tags we already return. The universal tags (UPT) are coarser than the language-specific tags, but enable tracking and comparison across languages. (ETROG-1472)
Added support for returning a disambiguated analysis for each token in Japanese text. For performance, this feature is turned off by default. (ETROG-1324)
Added support for returning morphological tags, where available, and placed an example illustrating the procedure for obtaining morphological tags in samples/morpho-tags. (ETROG-1485)
Removed the small number of dubious acronmym expansions from the lemmatization of English, French, Italian, German, Spanish, and Portuguese input. (ETROG-1547)
Improved the German lemma parser, which now returns the same lemma for German nouns that differ only in gender. (ETROG-1548)
Added API support for Lucene & Solr 4.10. (ETROG-1571)
Enhanced support for Korean linguistic analysis, and integrated a guesser for generating morphemes, morpheme tags, compound components, and parts of speech. (ETROG-1486, ETROG-1512, ETROG-1528)
Added support for Korean user lemma dictionaries. (ETROG-1518)
Added stop words to the Japanese analysis dictionary. (ETROG-1525)
Added the Chinese Script Converter, which can convert tokens in Traditional Chinese text to Simplified Chinese and vice versa. (ETROG-1462)
Terminated support for Lucene/Solr 3.6. (ETROG-1298)
Implemented support for Chinese part-of-speech (POS) tags and readings. (ETROG-1280)
Added support for normalization of Chinese and Japanese numbers. (ETROG-1310)
Implemented generation of Korean part-of-speech (POS) tags. (ETROG-1357)
Added a tool for building user dictionaries. (ETROG-210)
For those cases in which you want to use your own whitespace tokenenizer and you are processing text that requires segmentation (such as Chinese, Japanese, or Thai), we have added support for a base linguistics segmentation token filter to be used after a whitespace tokenizer and before other filters, such as a base linguistics token filter. See the Javadoc for the RBL-JE API for Lucene 4.3-4.7. (ETROG-1240)
For Japanese, modified the base linguistics token filter to exclude lemmas for auxiliary verbs, particles, and adverbs from the token stream. (ETROG-1217)
Added support for using AnalysesAttribute
to get the analyses and disambiguated analysis for each token in a token stream. (ETROG-1279)
Added SLF4J support for logging RBL-JE applications. (ETROG-1318)
Added support for turning case sensitivity on/off when analyzing text. (ETROG-1365)
Deprecated void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path)
in favor of void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path, EnumMap<AnalyzerOption, String> options)
where options
is used to set AnalyzerOption.caseSensitive
to "true" or "false".
Unused analyzer parameter removed from the BaseLinguisticsSegmentationTokenFilter
constructor. (ETROG-1316)
Updated the Japanese normalization dictionary. (ETROG-1229)
Added API support and samples for Lucene 4.9. (ETROG-1446)
Added a Lucene Analyzer that combines the RBL-JE Tokenizer and TokenFilter, along with the LowerCaseFilter, CJKWidth Filter, and optional support for the StopFilter: com.basistech.rosette.lucene.BaseLinguisticsAnalyzer
. Added a Lucene 4.3-4.7 sample application that illustrates its use. (ETROG-1138, ETROG-1172)
Improved support for returning Japanese Hiragana readings. The API for adding readings to the token stream has moved from TokenizerFactory#SetOption
to BaseLinguisticsTokenFilter#setAddReadings
. You can also include ("addReadings", "true")
to the map of options you use to instantiate the BaseLinguisticsAnalyzer
. (ETROG-1054)
Added support for Japanese Hiragana readings.
Factored in support for Lucene 3.6, 4.1-4.2, and 4.3.
For this release, this product has been refactored and renamed to Rosette Base Linguisitcs Java Edition. This release concentrates on the core API instead of implementations for different versions of Lucene and Solr. This release returns part-of-speech tags for a core set of European languages and Japanese.
For licensing and business reasons, support for Bulgarian, Catalan, Estonian, Croatian, Indonesian, Latvian, Malay, Slovak, Slovenian, Serbian, Albanian, and Ukrainian has been removed from the RSE package. (ETROG-921)
Added support for tokenizing and lemmatizing Arabic, Czech, Hungarian, Korean, and Turkish. (ETROG-876)
Added support for segmenting (tokenizing) Thai. (ETROG-448)
Added a tokenizer option (turned off by default) for returning Hebrew roots. (ETROG-788)
Changed required Java platform from 1.5 to 1.6. (ETROG-765)
Added support for using RSE with LucidWorks Enterprise 1.7, which supports a pre-release version of Lucene and Solr 4.0.
Added support for tokenizing and lemmatizing Albanian, Bulgarian, Catalan, Croatian, Estonian, Greek, Hebrew, Indonesian, Latvian, Malay, Polish, Serbian, Slovakian, Slovenian, Russian, and Ukrainian. (ETROG-656, 658, 668, 677)
Added a command line driver for running RSE. For usage details, see the Javadoc for com.basistech.rosette.bl.RBLCmd
. (ETROG-603)
Added support for tokenizing and lemmatizing Norwegian Nynorsk text. (ETROG-637)
Consolidated support for Lucene 2.2, Lucene 2.4, Lucene 2.9, Lucene 3.1, Solr 1.3, Solr 1.4, and Solr 3.1 in a single SDK package with an associated documentation package.
Deprecated support in the com.basistech.rosette.breaks
package (GenericTokenizer
and TokenizerOption
) for returning EOS (end-of-sentence) tokens. includeEOS
is off by default and should not be turned on; it interferes with Lucene searches. (ETROG-706)
Deprecated Lucene 2.9 LemmaFilterFactory.supportedLanguages()
. Use getSupportedLanguages()
. (ETROG-726)
Added support for Lucene 3.0.
Improved support for Japanese and Chinese tokenization.
Added the Japanese lemmatization dictionary and support of Japanese lemma user dictionaries. The Japanese lemmatization dictionary also provides orthographic normalization in the case of Katakana spelling variants and input text with archaic Kanji.
Added the production of normalized numbers to the lemmatization process.
Added support for Chinese lemma user dictionaries. Apart from numbers, which are already handled by the lemma guesser, lemmas do not ordinarily apply to Chinese, but a lemma user dictionary may be used for orthographic normalization.
Added support for Danish and Norwegian (Bokmål). Improved support for Chinese token segmentation and Romanian.
-
To enhance clarity and consistency, and to avoid duplication of package names in class names, made a number of API changes that are not backwards compatible.
-
Renamed some factory classes: (ETROG-436)
All these factory classes include a create()
method for instantiating the Tokenizer
or LemmaFilter
. The getTokenFilter()
, getLuceneTokenizer()
, and getLemmatizer()
methods have been removed.
Promoted classes introduced in Release 1.5.beta.1 for setting tokenizer and lemmatizer options from inner Enums to top-level Enums: com.basistech.rosette.breaks.TokenizerOption
and com.basistech.rosette.bl.LemmatizerOption
. (ETROG-434)
Removed the TokenizerFactory
, LemmaFilterFactory
and LemmatizerFactory
option-specific methods for setting options that predate the introduction of setOption()
.
-
The com.basistech.breaks.BreakerFactory
methods for creating breakers have been renamed.
Added support for Chinese, and limited support for Japanese. For these languages, RSE adds statistically trained models/dictionaries to enabled the tokenization of non-whitespace-delimited text. Support for user dictionaries has also been expanded to include token dictionaries for Chinese, Japanese, and Thai.
Enhanced support for Dutch, Italian, and Portuguese.
Replaced Lucene 2.9 and Solr 1.4 packages with Lucene 3.0 package.
Revised the API for defining tokenizer and lemmatizer options.
Reorganized the documentation to reflect standard RSE usage patterns.
Compiling a Swedish User Dictionary. As described in the RSE Application Developer's Guide, you must use RLP to create a user dictionary. See "Chapter 12. User-Defined Data" In the RLP Application Developer's Guide provides instructions on creating the source file for a user-defined dictionary and compiling the dictionary. The current release of RLP (RLP 7.1.0) does not include support for creating a Swedish user dictionary. To create a Swedish dictionary, you must add a file that we provide in the extras directory to the corresponding location in your RLP installation: rlp/bl1/dicts/sv/tags.txt.
When you create your source file, you can use [+DUMMY] as the POS tag for each entry.
The syntax for compiling a Swedish user dictionary from rlp/bl1/dicts/tools is
build_user_dict.sh sv input output
Removed Rosette Language Analyzer (RLI) 100% Java implementation, which is now a separate product.
Provided separate SDK packages with support for Lucene 2.2, Lucene 2.4, and Lucene 3.0. (ETROG-198)
Added TokenizerFactory, which provides a language-specific Tokenizer for parsing input text. In addition to using the Sentence Breaker and Word Breaker, the Tokenizer normalizes the tokens (Unicode NFC normalization and lowercasing). (ETROG-185)
Added support for Swedish, including tokenization, lemmatization, and decompounding. (ETROG-201)
Added preliminary, limited support for Dutch, Danish, Norwegian, Italian, Portuguese, and Romanian.
Expanded support for German decompounding.
Added support for generating a separate lemma for each space-delimited element in lemmas that contain whitespace.
This distribution provides support for Lucene 2.2.
Upgraded Token Filter Factory support from Lucene 2.2 to Lucene 2.4.
Added The Rosette Language Identifier (RLI), Sentence Breaker, and Word Breaker:
Introduced support for the creation of Lucene 2.2 Base Linguistics token filters for English, French, German, and Spanish text.
Bugs fixed in 7.27.2.c60.0
Bugs fixed in 7.27.1.c60.0
Bugs fixed in 7.26.6.c60.0
Bugs fixed in 7.26.5.c59.3
Bugs fixed in 7.26.4.c60.0
Bugs fixed in 7.26.3.c59.3
Bugs fixed in 7.26.2.c59.3
Bugs fixed in 7.26.1.c59.3
Bugs fixed in 7.26.0.c59.3
Bugs fixed in 7.25.0.c59.3
Bugs fixed in 7.24.6.c59.2
Bugs fixed in 7.24.5.c59.2
Bugs fixed in 7.24.4.c59.2
Bugs fixed in 7.24.2.c59.2
Bugs fixed in 7.24.1.c59.2
Bugs fixed in 7.24.0.c59.2
Bugs fixed in 7.23.3.c59.0
Bugs fixed in 7.23.2.c59.0
Bugs fixed in 7.23.0.c59.0
Bugs fixed in 7.22.2.c59.0
Bugs fixed in 7.22.1.c59.0
Bugs fixed in 7.22.0.c59.0
Bugs fixed in 7.21.3.c59.0
Bugs fixed in 7.21.0.c58.3
Bugs fixed in 7.20.4.c58.3
Bugs fixed in 7.20.3.c58.3
Bugs fixed in 7.20.2.c58.3
Bugs fixed in 7.20.1.c58.3
Bugs fixed in 7.20.0.c58.3
Bugs fixed in 7.19.0.c58.3
Bugs fixed in 7.18.0.c58.3
Bugs fixed in 7.17.0.c58.2
Bugs fixed in 7.16.1.c58.2
Bugs fixed in 7.16.0.c58.2
Bugs fixed in 7.14.2.c57.2
Bugs fixed in 7.14.1.c57.2
Bugs fixed in 7.14.0.c57.2
Bugs fixed in 7.13.0.c56.6
Bugs fixed in 7.12.1.c56.6
4.42. Bugs fixed in 7.12.0
For a list of third-party components that are used in Basis Technology products, see ThirdPartyLicenses.txt.
Third-party component updates in 7.27.1.c60.0
Third-party component updates in 7.26.4.c60.0
Third-party component updates in 7.25.0.c59.3
Third-party component updates in 7.24.6.c59.2
Third-party component updates in 7.24.0.c59.2
Third-party component updates in 7.23.0.c59.0
Third-party component updates in 7.21.1.c59.0
Third-party component updates in 7.18.0.c58.3
Third-party component updates in 7.16.0.c58.2
Third-party component updates in 7.14.0.c57.2
Third-party component updates in 7.13.0.c56.6
If disambiguate is set to false, or if no disambiguator for the language exists, BaseLinguisticsTokenFilter
does not set the type correctly for compound components when adding them to the token stream. It marks compound components as <LEMMA> instead of <COMP> when a non-disambiguating analysis is performed. (ETROG-1552)
The prefixes and suffixes that the RSE tokenizer returns for Hebrew may include punctuation attached to the underlying tokens, such as parentheses (prefix, suffix) and comma (suffix). Accordingly, prefixes and suffixes are assigned a Token PositionIncrement of 1. A multicharacter prefix or suffix may be reported as a sequence of one-character prefixes or suffixes. (ETROG-697)
To avoid a potential out-of-memory error, RSE does not attempt to decompound words longer than 30 characters. For languages with support for decompounding, if a word is longer than 30 characters and is not found in a user dictionary or the standard dictionary, RSE classifies the word as a guessed lemma. (ETROG-191)
Inconsistent handling of numbers and punctuation during lemmatization. (ETROG-266)
RSE expects valid Unicode strings as input. If the input includes illegal Unicode sequences, such as un-paired UTF-16 surrogate characters, the behavior is undefined. (ETROG-284)
Known Problems in 1.3-beta and 1.4.x
Incorrect capitalization in some lemmas, including some German compounds (e.g., unAbhängigkeit).
Incorrect lemma formation of some words with suffixes (e.g., Brötchen).
Over-generation of German compound components (e.g., übergreifen, über, and greifen as separate components).
Failure to recognize some extended written-out German numbers (e.g., zweitausendzwölf).