Rosette基本言語解析Pure Java版 (RBL-JE)は、多くの言語に対応した検索アプリケーションの構築が行える100% Pure Javaソフトウェア開発キット(SDK)
[en] Release 7.47.0.c71.0
[en] September 2023
[en] New
[en] Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when tokenizerType
is set to spaceless_lexical
. (ETROG-3616)
[en] Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when tokenizerType
is set to spaceless_lexical
. (ETROG-3632)
[en] Added secondary parts of speech: We've added support for secondary parts of speech to Chinese and Japanese when tokenizerType
is set to spaceless_lexical
. (ETROG-3636)
-
[en] Improved support for Chinese readings when tokenizerType
is set to spaceless_lexical
:
[en] Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)
-
[en] Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)
[en] Example: "蔭權"
[en] Solr and Lucene support: Lucene 9.5 - 9.7 and Solr 9.3 are now supported (ETROG-3643)
[en] Bug Fixes
[en] We fixed a bug where an ArrayIndexOutOfBoundsException
occurred when the Chinese dictionaries produced more than 6 matches and tokenizerType
was set to spaceless_lexical
. (ETROG-3635)
[en] When Chinese readings are constructed by character and tokenizerType
is set to spaceless_lexical
, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)
[en] We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)
[en] Known Issues
[en] Third-party component updates
表1 [en] Updated
[en] Package |
[en] Old Version |
[en] New Version |
[en] Jackson Annotations |
2.15.0 |
2.15.2 |
[en] Jackson Core |
2.15.0 |
2.15.2 |
[en] Jackson Databind |
2.15.0 |
2.15.2 |
[en] Jackson Dataformat XML |
2.15.0 |
2.15.2 |
[en] Jackson Dataformat YAML |
2.15.0 |
2.15.2 |
[en] Jackson Datatype: Guava |
2.15.0 |
2.15.2 |
[en] Jackson Module: Old JAXB Annotations |
2.15.0 |
2.15.2 |
[en] Guava: Google Core Libraries for Java |
[en] 31.1-jre |
[en] 32.1.2-jre |
[en] Protocol Buffers [Core] |
3.21.7 |
3.23.4 |
[en] Release 7.46.4.c70.0
[en] June 2023
[en] New
[en] Third-party component updates
表2 [en] Updated
[en] Package |
[en] Old Version |
[en] New Version |
[en] Apache Log4J |
2.19.0 |
2.20.0 |
[en] fastutil |
8.5.9 |
8.5.12 |
[en] Jackson Annotations |
2.14.0 |
2.15.0 |
[en] Jackson Core |
2.14.0 |
2.15.0 |
[en] Jackson Databind |
2.14.0 |
2.15.0 |
[en] Jackson Dataformat XML |
2.14.0 |
2.15.0 |
[en] Jackson dataformats: Text |
2.14.0 |
2.15.0 |
[en] Jackson datatypes: collections |
2.14.0 |
2.15.0 |
[en] Jackson modules: Base |
2.14.0 |
2.15.0 |
[en] SnakeYAML |
1.33 |
2.0 |
[en] Release 7.46.3.c69.0
[en] March 2023
[en] Bug Fixes
[en] Known Issues
[en] Release 7.46.2.c69.0
[en] March 2023
[en] New
[en] Known Issues
[en] Release 7.46.1.c68.0
[en] December 2022
[en] Bug Fixes
[en] Release 7.46.0.c68.0
[en] November 2022
[en] New
[en] Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)
[en] Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)
[en] GPU support: Features that use TensorFlow now use a GPU if available. (ETROG-3564)
[en] Emoji support: Emoji 15.0 is now supported. (ETROG-3577)
[en] New option for Katakana: We've added the option joinKatakanaNextToMiddleDot
to control whether sequences of Japanese Katakana tokens adjacent to a middle dot should be merged into a single Katakana token. By default, it is true
, which matches the behavior in previous versions of RBL-JE. (ETROG-3592)
[en] Solr 9.1 support: Lucene and Solr 9.1 are supported. (ETROG-3597)
[en] Bug Fixes
[en] Third-party component updates
表3 [en] Upgraded
[en] Package |
[en] Old version |
[en] New version |
[en] Apache Log4j |
2.17.1 |
2.19.0 |
[en] fastutil |
8.5.6 |
8.5.9 |
[en] Jackson |
2.11.1 |
2.14.0 |
[en] JavaCPP |
[en] 1.58-alpha.20220614.013710.426 |
1.58 |
[en] SLF4J |
1.7.33 |
1.7.36 |
[en] SnakeYAML |
1.30 |
1.33 |
[en] Release 7.45.0.c67.0
[en] September 2022
[en] New
-
[en] Tagalog support:
[en] RBL now supports Part of Speech (POS) tagging in Tagalog. (ETROG-3559)
[en] RBL now supports lemmatization for Tagalog. (ETROG-3570)
[en] The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
[en] Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)
[en] Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)
[en] Bug Fixes
[en] Release 7.44.2.c67.0
[en] July 2022
[en] New
[en] Release 7.44.1.c67.0
[en] June 2022
[en] Bug Fixes
[en] Release 7.44.0.c67.0
[en] June 2022
[en] New
[en] Indonesian support added: RBL now supports Part of Speech (POS) tagging in Indonesian. (ETROG-3543)
[en] Malay (Standard) support added: RBL now supports Part of Speech (POS) tagging in Malay (Standard). (ETROG-3545)
[en] Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)
[en] Java 17 support added: Java 8 and 9 support has been removed. (ETROG-3524)
[en] Solr 9 support added: RBL now supports Lucene and Solr 9. (ETROG-3549)
[en] Solr 6 support deprecated: RBL no longer supports Lucene or Solr 6 or earlier. (ETROG-3519)
[en] Bug Fixes
-
[en] In Japanese, negative forms of ichidan verbs written all in hiragana are no longer lemmatized to end with “なう”. (ETROG-3534)
[en] Example: Input: くれない
[en] Third-party component updates
[en] This release includes the following third-party component changes:
表4 [en] Added
[en] Package |
[en] Version |
[en] License |
[en] Jakarta Annotations API |
1.3.3 |
[en] Eclipse Public License 2.0 and GPL 2 with classpath exception |
[en] Release 7.43.0.c66.0
[en] February 2022
注記
[en] Solr 6 and earlier support is deprecated as of this release.
[en] Java 8 and Java 9 support is deprecated as of this release.
[en] New
[en] Solr 8.11 support: This release supports Solr 8.11 (ETROG-3502)
[en] Deprecated methods: Token#getType
has been deprecated as token types are not used in RBL-JE without the Lucene/Solr plugins and the plugins use a different API. (ETROG-3503)
[en] Solr 6 support deprecated: Support for Solr versions 6.x and earlier is deprecated as of this release and will be removed in the next version.
[en] Permission changes: We removed group and other write permissions from model files. All files are now only writable by the owner. (ETROG-3516)
[en] Third-party component updates
[en] This release includes the following third-party component changes:
表5 [en] Upgraded
[en] Package |
[en] Old Version |
[en] New Version |
[en] Apache Commons IO |
2.7 |
2.11.0 |
[en] Apache Commons Lang |
2.6 |
3.12.0 |
[en] Apache Log4j |
1.2.17 |
2.17.1 |
[en] ICU4J |
59.1 |
70.1 |
[en] fastutil |
8.4.0 |
8.5.6 |
[en] SLF4J |
1.7.28 |
1.7.33 |
[en] SnakeYAML |
1.26 |
1.30 |
[en] TensorFlow for Java |
0.2.0 |
0.3.3 |
[en] Release 7.42.2.c65.0
[en] November 2021
[en] Bug Fixes
[en] Release 7.42.1.c65.0
[en] November 2021
[en] Bug Fixes
[en] Release 7.42.0.c65.0
[en] November 2021
[en] New
[en] Deprecated factories: TokenizerFactory
, AnalyzerFactory
, and CSCAnalyzerFactory
have been deprecated in favor of BaseLinguisticsFactory
. (ETROG-3453)
-
[en] Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts, for Japanese with tokenizerType
set to spaceless_lexical
. (ETROG-3474)
[en] Example: Input: 三・一四
[en] Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)
[en] Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)
[en] Japanese tokenization: In Japanese, when tokenizerType
is set to spaceless_lexical
, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. This is consistent with the default algorithm, spaceless_statistical
. (ETROG-3475)
[en] Solr 8.10 support: This release supports Solr 8.10. (ETROG-3482)
[en] Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)
[en] Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)
[en] Bug Fixes
[en] In Japanese, when tokenizerType
is set to spaceless_lexical
, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)
-
[en] We've reverted two of the POS changes made in version 7.39.0.63.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)
[en] The values are now:
[en] RBL-JE no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)
[en] In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)
[en] Release 7.41.1.c65.0
[en] July 2021
[en] Bug Fixes
[en] Enabling universalPosTags
for Indonesian, Tagalog, or Standard Malay no longer throws a RosetteUnsupportedLanguageException
. POS tags are not supported for these languages, so the universalPosTags
option is ignored. (ETROG-3465)
[en] Release 7.41.0.c65.0
[en] July 2021
[en] New
[en] New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
[en] Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine
parameter. (ETROG-3431)
[en] Solr 8.9 support: This release supports Solr 8.9. (ETROG-3457)
[en] Release 7.40.1.c64.1
[en] May 2021
[en] Bug Fixes
[en] We fixed a bug where enabling alternativeSpanishDisambiguation
for Spanish caused a NullPointerException
to be thrown. (ETROG-3435)
[en] We fixed a bug where setting disambiguatorType
to DNN
for Hebrew caused a RosetteRuntimeException
to be thrown. (ETROG-3437)
[en] Release 7.40.0.c64.1
[en] May 2021
[en] New
[en] New option for tokenizers: We've added a new option, tokenizerType
to specify which tokenizer to use. The options alternativeTokenization
and fstTokenize
are deprecated in favor of tokenizerType
. (ETROG-3419)
[en] New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType
to spaceless_statistical
. (ETROG-3392)
[en] Bug Fixes
-
[en] Multiple punctuation characters are no longer returned as a single token in Chinese when alternativeTokenization
is true
or tokenizerType
is set to spaceless_lexical
. Now each character is its own token. (ETROG-3402)
[en] Example: Input: 天津??
-
[en] Previously:
[en] Token{text=天津}
[en] HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
[en] Token{text=??}
[en] HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}
-
[en] Now:
[en] Token{text=天津}
[en] HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
[en] Token{text=?}
[en] HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
[en] Token{text=?}
[en] HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
[en] Third-party component updates
[en] This release includes the following third-party component changes:
表6 [en] Added
[en] Package |
[en] Version |
[en] JavaCPP |
1.5.4 |
表7 [en] Upgraded
[en] Package |
[en] Old Version |
[en] New Version |
[en] TensorFlow |
1.14.0 |
2.3.1 |
[en] Release 7.39.0.c63.0
[en] March 2021
[en] New
-
[en] Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
[en] TokenizerOption.includeHebrewRoots
and TokenizerOption.guessHebrewPrefixes
have been deprecated and replaced by AnalyzerOption.includeHebrewRoots
and AnalyzerOption.guessHebrewPrefixes
.
[en] NFKC normalization is now supported for Hebrew.
[en] We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
[en] We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
[en] Double apostrophes are now treated like gershayim. (ETROG-3249)
[en] Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
[en] Solr 8.8: This release supports Solr 8.8. (ETROG-3369)
[en] Improved CSCAnnotator output: The CSCAnnotator
now emits tokens in addition to translations, even if no tokens were specified in the input. (ETROG-3356)
[en] Improved directory structure: The contents of the models/
directory are now separated into subdirectories by language. (ETROG-1218)
-
[en] Statistical models moved to models/ directory: The following files have been moved from dicts/
to models/
: (ETROG-1218)
[en] cat/ca-ud-train.downcased.mdl
[en] est/et-ud-train.downcased.mdl
[en] fas/posLemma.mdl
[en] lav/lv-ud-train.downcased.mdl
[en] nno/lemma.mdl
[en] nob/lemma.mdl
[en] slk/sk-ud-train.downcased.mdl
[en] srp/sr-ud-train.downcased.mdl
[en] Bug Fixes
-
[en] Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)
[en] Example: מע'רב
-
[en] Previously:
Token{text=מע'}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]},
partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]},
partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
-
[en] Now:
Token{text=מע'רב}
MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב,
tagSet=MILA_HEBREW}
-
[en] A structured region containing two new lines is now properly labeled as STRUCTURED. Previously, the layout region would be labeled as UNSTRUCTURED. (ETROG-3378)
[en] Example: * item\n* item\n* item\n\n
-
[en] Previously:
{"startOffset": 0,"endOffset": 14,"layout": "STRUCTURED"}
{"startOffset": 14,"endOffset": 22,"layout": "UNSTRUCTURED"}
-
[en] Now:
{"startOffset": 0,"endOffset": 22,"layout": "STRUCTURED"}
[en] Release 7.38.1.c63.0
[en] January 2021
[en] New
[en] RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
[en] Bug Fixes
[en] Third-party component updates
[en] This release includes the following third-party component changes:
[en] Release 7.38.0.c62.2
[en] December 2020
[en] New
[en] Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
-
[en] Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation
must be set to false
, which is the default. (ETROG-3289)
[en] Example: δείξε
[en] New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation
to true
. (ETROG-3304)
[en] Deprecated classes: The classes BufferWordBreaker
and WordBreakResults
have been deprecated. (ETROG-3318)
[en] Bug Fixes
-
[en] The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)
[en] Example: Start+
[en] GenericTokenizer#hasNext
is now implemented to be consistent with the documentation for Iterator#hasNext
. Previously it always returned false
. (ETROG-2140)
[en] Release 7.37.0.c62.2
[en] November 2020
[en] New
[en] Performance improvement: Spanish disambiguation with alternativeSpanishDisambiguation
set to false
is now faster. (ETROG-3271)
[en] Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)
[en] Support for unknown language: If the language is unknown (xxx
), tokenization and sentence breaking is supported. (ETROG-3278)
[en] Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)
[en] Tokenization rule preprocessor: The preprocessor command !!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)
[en] Updated sample: The tokenize-analyze
sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)
[en] New sample: The sample csc-annotate
demonstrates using CSC with the ADM API. (ETROG-3317)
[en] Deprecated option: TokenizerOption#includeRoots
has been deprecated and replaced with TokenizerOption#includeHebrewRoots
. (ETROG-3314)
[en] Deprecated option: The alternative tokenization option deliverExtendedAttributes
is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)
[en] Bug Fixes
[en] Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
[en] A clear exception (RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)
[en] RBL no longer crashes when alternativeTokenization
and fragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)
[en] Correct start and end offsets are now produced when fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)
-
[en] The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)
[en] Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET
[en] Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
[en] Tokens no longer have null token types. (ETROG-3316)
-
[en] When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize
was set to true. (ETROG-2505)
[en] Example: ﷺ
[en] Release 7.36.0.c62.2
[en] September 2020
[en] New
[en] Lucene/Solr: Versions up through 8.6.0 are now supported. (ETROG-3250)
[en] Decompose compounds: The option to control decomposition of compounds is now available in Dutch, German, Hungarian, Danish, Bokmål, Nynorsk, Swedish, and Korean. The default for decomposeCompounds
is true
. (ETROG-3263, ETROG-3264, ETROG-3265)
[en] Performance improvement: English and Spanish disambiguation with is now faster. Alternate disambiguation (alternateEnglishDisambiguation
or alternateSpanishDisambiguation
) must be set to false
. (ETROG-3246, ETROG-3243)
[en] Bug Fixes
-
[en] In Hebrew, prefixes in some acronym tokens are now listed correctly in the list of prefixes, instead of being duplicated in the lemma. (ETROG-3214)
[en] Example: “ומש"ס”
[en] Previously: lemma: “ומומש"ס”, empty prefix list
[en] Now: lemma: “ש"ס”, prefix list = [“ו”, “מ”]
-
[en] Sentence breaks are now correct when there are two line breaks and fragmentBoundaryDetection
is enabled. (ETROG-3241)
[en] Example: "a very very very very long line\nshort\n\n"
-
[en] Previously: 2 sentences
{"startOffset":0,"endOffset":20}
{"startOffset":20,"endOffset":27}
-
[en] Now: 1 sentence
{"startOffset":0,"endOffset":26}
-
[en] In Hebrew, lemmas starting or ending with spaces now have the spaces removed. (ETROG-3248)
[en] Example: "אאורקה"
-
[en] Analysis of unknown Hebrew words with guessed prefixes no longer have duplicate prefixes in their prefix list. (ETROG-3253)
[en] Example: "בפיירפוקס"
[en] In Chinese and Japanese, the system no longer crashes when both fragmentBoundaryDetection
and alternativeTokenization
are enabled. (ETROG-3260)
[en] In Japanese, adjacent tokens are no longer erroneously joined when alternativeTokenization
is enabled. (ETROG-3261)
-
[en] When universalPosTags
are enabled the UPT-16 POS tags are now marked as having the tag set UPT16_V1
instead of the default tag set of the language. (ETROG-3273)
[en] Example: French
[en] We've fixed the tokenize-analyze example in the samples directory. It now correctly produces results for Hebrew analysis. (ETROG-3252)
[en] Release 7.35.0.c62.2
[en] July 2020
[en] New Features
[en] Layout regions added: Layout regions, describing each section of input text as STRUCTURED
or UNSTRUCTURED
, are now identified by the annotator. In order to detect layout regions, fragment boundary detection must be enabled. (ETROG-3172)
[en] New short line parameter: The option maxTokensForShortLine
has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)
[en] Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option fstTokenize
must be set to true
. (ETROG-3226)
[en] Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)
[en] Hebrew user-defined dictionaries added: Static and dynamic user-defined Hebrew analysis dictionaries are now supported. (ETROG-3230)
[en] Deprecated method: HebrewAnalysis#characteristicString
is now deprecated. (ETROG-3209)
[en] Order of user-defined dictionaries: The order in which user-defined dictionaries are consulted has been standardized. Refer to the RBL-JE Application Developer's Guide for details. (ETROG-3148)
[en] Bug Fixes
-
[en] Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when fstTokenize
was enabled and in some languages. (ETROG-3159)
[en] Example: "1\n234" (embedded newline within the number string)
[en] This example assumes fstTokenize
is enabled and the language is French.
[en] Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)
-
[en] Tokens with digits are now eligible for the Greek guesser. (ETROG-3231)
[en] Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags
[en] Now: "HDMI1" gets FM POS tag
-
[en] In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only occured when the guessHebrewPrefixes
option is set to true
.(ETROG-3221)
[en] Example: "ומפיפרנו"
-
[en] Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)
[en] Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"
[en] Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"
[en] Now: One analysis, lemmatized to "разложить"
-
[en] German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)
[en] Example: Input "Mitbehandlung"; "mit" is a separable prefix
-
[en] In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)
[en] Example: "1" (keycap)
[en] Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.
[en] Now: Tokenized to one token, "1"
[en] Release 7.34.2.c62.2
[en] May 2020
[en] New Features
-
[en] Hebrew tokens that have prefixes but not stems now get appropriate parts of speech. Previously, they got the POS tag "unknown". (ETROG-3207)
[en] Example: “ה” from the string “ה70”
[en] Lucene/Solr up through version 8.5.1 is now supported. (ETROG-3208)
-
[en] When guessHebrewPrefixes
is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes. (ETROG-3188)
[en] Example: Token: "ומפיפרנו"
-
[en] Previously: 2 analyses:
[en] hebrewPrefixes=[ו] lemma=מפיפרנו
[en] hebrewPrefixes=[ו, מ] lemma=פיפרנו
-
[en] Now: 3 analysis:
[en] hebrewPrefixes=[ו] lemma=מפיפרנו
[en] hebrewPrefixes=[ו, מ] lemma=פיפרנו
[en] hebrewPrefixes=[] lemma=ומפיפרנו
[en] Bug Fixes
-
[en] Minimally-qualified emoji are no longer split apart. (ETROG-3185)
[en] Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>:
)
[en] Previously: U+1F481 and <U+200D, U+2642> (2 tokens)
[en] Now: <U+1F481, U+200D, U+2642> (1 token)
-
[en] Capitalized nouns are no longer being detected as verbs. (ETROG-3186)
[en] Example: The noun "Service" from the phrase "Price and Quality of Service"
-
[en] When creating multiple analyzers for Chinese, Japanese, or Thai with alternateTokenization
set to false
(the default), the analyzers will now share the same model data. This will improve memory usage when creating multiple analyzers. (ETROG-3200)
[en] Note: While memory usage has been improved, the process is still memory intensive. If RBL throws an OutOfMemoryError
, increase the heap space.
[en] Release 7.34.1.c62.2
[en] March 2020
[en] Bug Fixes
[en] Release 7.34.0.c62.2
[en] March 2020
[en] New Features
[en] Lucene/Solr: RBL-JE now supports Lucene/Solr up through version 8.4.1. (ETROG-3156)
[en] Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized. (ETROG-3164)
[en] Additional emoji support: Emoji hair components are now lemmatized. (ETROG-3167)
[en] German professions: Additional German professions have been added to the German lexicon. (ETROG-3163)
[en] Spanish performance improvements: Spanish disambiguation is now faster when alternativeSpanishDisambiguation
is false
. (ETROG-3169)
[en] Hebrew lemmatization: We increased proper noun coverage in the Hebrew lexicon. (ETROG-3161, ETROG-3162)
[en] Bug Fixes
[en] Low surrogates are no longer stripped from the ends of tokens in Hebrew. (ETROG-3165)
-
[en] Number tokens with embedded spaces are no longer split into multiple tokens when preceded or followed by a symbol when fstTokenize
is true
. (ETROG-3158)
[en] Release 7.33.0.c62.2
[en] January 2020
[en] New Features
[en] The delimiters for the fragment boundary detector are now configurable. (ETROG-3116)
[en] The fragment boundary detector now marks a boundary after any spaces following the fragment boundary delimiter. (ETROG-3116)
[en] An underscore (U+005F) is no longer treated as a token separator in German when fstTokenize
is enabled. (ETROG-3144)
[en] Bug Fixes
[en] We fixed a bug where tokens from multi-script Russian text sometimes had incorrect offsets if fstTokenize
was enabled. (ETROG-3142)
[en] We fixed a bug where multi-script Russian text would have a sentence break each time the script changed. (ETROG-3145)
[en] We fixed a bug where there were unexpected sentence breaks after some short lines not ending in whitespace. (ETROG-3146)
[en] We fixed a bug where sentence breaks were missing when the sentence break did not align with a token boundary. (ETROG-3140)
[en] Release 7.32.0.c62.1
[en] December 2019
[en] New Features
[en] Added support for Lucene/Solr up through version 8.3.0. (ETROG-3128)
[en] Added support for tokenizing and lemmatizing Latvian. (ETROG-2798)
[en] Latin-script regions within Russian documents are now tokenized and analyzed as English. (ETROG-3126)
[en] TokenizerOption.licenseString
, AnalyzerOption.licenseString
, and BaseLinguisticsOption.licenseString
may now be passed into a create
method. Previously, these options had to be set on the factory itself. (ETROG-3134)
[en] Bug Fixes
[en] We fixed a bug where guessed German compounds were sometimes lemmatized as verbs but tagged as nouns. (ETROG-3094)
[en] We fixed a bug where the fragment boundary detector would mark a sentence break after every Windows newline. (ETROG-3133)
[en] Release 7.31.0.c62.0
[en] November 2019
[en] New Features
[en] The Hebrew files dinflections.bin
, dprefixes.data,
and gimatria.data
have been moved from the root/models
directory to root/dicts/heb
. (ETROG-3088)
[en] Specifying the universalPosTags
option now adds the deliverExtendedTags
option as well. (ETROG-2185)
[en] Dynamic user dictionaries can now be created and populated at runtime. See the section User-Defined Dictionaries in the Application Developer's Guide for details. (ETROG-3086, ETROG-3100, ETROG-3109, ETROG-3110, ETROG-3111)
[en] Fragment boundary detection is now enabled by default. Previously it was disabled by default. (ETROG-3108)
[en] TokenizerOption.alternativeTokenizationOptions
has been deprecated in favor of a separate options for each YAML key. See the Javadoc for details. (ETROG-3109)
[en] The UPT-16 files upt-16-pes.yaml
and upt-16-prs.yaml
have been removed from the distribution package, as they were unused. (ETROG-3122)
[en] The -order
option in rbl-build-csc-dictionary
has been removed. All dictionaries are now built as LE, as LE dictionaries still work on BE machines. (ETROG-3120)
[en] We've added imperative forms for 2000 verbs to the Arabic lexicon. (ETROG-3090)
[en] Bug Fixes
[en] Fragment boundary detection is now enabled for Hebrew. (ETROG-1442)
[en] When lemmatizing numbers in Russian, numbers containing spaces will now be lemmatized without the space. For example, "1 234" will now be lemmatized as "1234" instead of "1 234". (ETROG-3101)
[en] We fixed a bug introduced in 7.30.1.c61.0 which raised an ArrayIndexOutOfBoundsException
when processing Japanese with alternativeTokenization
and favorUserDictionary
set to true
. (ETROG-3118)
[en] We fixed a bug where a middle dot would be ignored if it preceded white space when using alternativeTokenization
in Japanese. (ETROG-3113)
[en] Third-party component updates
[en] Release 7.30.2.c61.0
[en] September 2019
[en] Bug Fixes
[en] We fixed a bug where an AssertionError
might be thrown when analyzing Hungarian with Java assertions enabled.
-
[en] Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.
[en] Previously:Аполлона-11
(Apollo-11) was tagged as PROP, MISC, and NOUN
[en] Now:Аполлона-11
(Apollo-11) is tagged as NOUN
[en] Correct token offsets are now returned from a Japanese annotator where a non-katakana character precedes a user-defined katakana token and alternativeTokenization
and favorUserDictionary
are enabled.
[en] We fixed a bug where constructors of factory classes in the Lucene/Solr plugin would throw an UnsupportedOperationException
if passed a Map
that did not support the remove
method.
2019年8月
新機能
セグメンテーションユーザー辞書は、中国語、日本語、タイ語だけでなく、すべての言語で使用できます。
オプションのcompoundComponentSurfaceFormsが追加され、複合語のコンポーネントの表層系が返されます。 デフォルトでは、RBL-JEはレンマのみを返します。
バージョン8.1.1までのLucene / Solrを対応しました。
「-cku」、「-ska」、と「-sku」で終わるポーランド語の単語のレンマは「-cki」または「-ski」で終わる形式になりました。
バグの修正
日本のPOSタグNEはUPT-16に正しく変換されるようになりました。
フランス語のPOSタグCONJQUEはUPT-16 CONJに変換されていましたが、より適切なSCONJに変換されます。
alternativeTokenizationが無効な場合、中国語の句読点をGUESSとしてタグ付けしていましたがPUNCTまたはEOSとしてタグ付けされるように変更しました。
2019年6月
新機能
[en] Third-party component updates
2019年5月
新機能
バグ修正
「מה」などの語基が含まれない複数の接頭辞で構成されるヘブライ語トークンの表面形式は、単に1番目の接頭辞ではなく、全体的なトークンテキストとなりました。
「Аполлона-11」など、末尾に数字がくるハイフンでつなげたロシア語の単語は、DIGとしてタグ付けされなくなりました。7.27.2.c60.0以前と同じ品詞でタグ付けされます。
URLが有効時にURLの後に続く右小括弧、右大括弧、右中括弧は、URLに統合されなくなりました。
曖昧性解消モジュールによって、ATMENTION、EMAIL、HASHTAG、URLのPOSタグ付き解析が、他の解析よりも優先して選択される可能性が高くなりました。
ヘブライ語トークナイザーが、ヘブライ語で使用される文字のすぐ後のヘブライ語で使用されない文字に遭遇すると、新しいトークンを開始します。以前は、その文字と、次のトークンセパレーター(空白など)までのその文字以降の文字が削除されていました。
RBL-JEが、BOMで始まるICUトークン化ルールファイルを正常に読み込めるようになりました。
語基が含まれない複数の接頭辞で構成されるヘブライ語トークンは、単一接頭辞トークンに合わせて、品詞「不明」でタグ付けされるようになりました。
英語トークン「than」は、COTHANとしてのみタグ付けされます。品詞候補のCOORDは、このトークンから削除されました。
2019年5月
新機能
ヘブライ語に対し、パーセプトロンベースの曖昧性解消器が使用できるように変更しました。 デフォルトでオプションDisambiguatorTypeがDisambiguatorType.PERCEPTRONに設定されています。以前のプロセッサーよりLemmaと品詞の分析精度が向上しています。 以前の設定に戻すには、DisambiguatorTypeをDisambiguatorType.DICTIONARYにして下さい。(ETROG-2985, ETROG-3006)
Lucene / Solrのサポートをバージョン7.6.0まで追加しました。(ETROG-2996)
Java 11 に対応しました。(ETROG-2999)
バグ修正
代替トークン化が有効時に、一部の空白文字が中国語トークンの一部であった可能性があります。
文字の長さが何千にもなるトークンは、トークナイザーを遅延させます。
複単語表現に見られるポーランド語トークンは、完全表記の範囲まで見出し語化されなくなりました。たとえば、「dzień」は「dzień_dobry」として見出し語化されません。
1つ以上のハイフンでつなげたロシア語の複合語の非最終文節は、見出し語化されていませんでした。偶然にも形容詞の短縮形に見えた接合辞「е」または「о」のある、ハイフンでつなげたロシア語の複合語の非最終文節は、短縮形のごとく見出し語化されていました。
[en] RBL-JE Release Note Archive 7.27.0.c60.0 and earlier
中国語スクリプトコンバータ機能が、独立したライセンスとなりました。RBLの中国語ライセンスには含まれませんので別途購入が必要です(ETROG-2916)
ペルシャ語に対し、Lemmaサポートが追加されました。 (ETROG-2924)
ヘブライ語に対し、disambiguator(曖昧性解消器)オプションが追加され、これまでの辞書ベースのモデルがデフォルトになっています。 TensorFlowベースの深層学習モデルを使用するには、disambiguationTypeオプションをDisambiguatorType.DNNに設定して下さい。 (ETROG-2928)
ドイツ語において、無視しても影響が無い文字が削除された結果となるように変更しました。(ETROG-2824)
スペイン語において、Lemma精度を向上しました。 (ETROG-2856)
英語において、 disambiguation の精度を向上しました。 (ETROG-2867)
北朝鮮語(qkp)と韓国語(qkr)の方言はどちらも朝鮮語(kor)として扱われます。 (ETROG-2878)
[en] Release 7.24.0.c59.2
[en] Added support for tokenizing and lemmatizing Catalan, Estonian, Serbian, and Slovak. (ETROG-2752, ETROG-2774)
[en] Release 7.23.1.c59.0
[en] Release 7.23.0.c59.0
[en] Added support for Lucene/Solr 7.0.0 through 7.1.0. (ETROG-2706)
[en] POS-tagging and disambiguation are supported for Hebrew. (ETROG-2707, ETROG-2717)
[en] Release 7.22.2.c59.0
[en] Release 7.22.0.c59.0
[en] Release 7.21.2.c59.0
[en] Release 7.21.1.c59.0
[en] Release 7.21.0.c58.3
[en] Added the ArabicMorphoAnalysis
class to allow an Annotated Data Model application to get more information for Arabic, Persian, and Urdu text than the MorphoAnalysis
class would provide. (ETROG-2623)
[en] Improved speed and memory footprint for English and Spanish disambiguation. (ETROG-2607, ETROG-2618, ETROG-2635)
[en] Added the alternativeEnglishDisambiguation
and alternativeSpanishDisambiguation
options to specify the use of the old disambiguator in English and Spanish. The new disambiguator, introduced in version 7.18.0.c58.3, and enhanced in the current release, is more accurate, but slower. (ETROG-2626)
[en] Added the guessHebrewPrefixes
option to control whether to split possible prefixes off unknown Hebrew words. (ETROG-2642)
[en] Normalized U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew. (ETROG-2647)
[en] Filter out punctuation from Lucene/Solr when query
is set. (ETROG-2648)
[en] Added support for Lucene/Solr 6.6. (ETROG-2656)
[en] Release 7.20.0.c58.3
[en] Added tokenization and POS-tagging for at-mentions and hashtags in all languages. (ETROG-2571)
[en] Added the options atMentions
, emailAddresses
, emoticons
, hashtags
, and urls
to enable tokenization and POS-tagging of @mentions, email addresses, emoticons, hashtags, and URLs. They are all disabled by default. (ETROG-2583)
[en] Release 7.19.0.c58.3
[en] Release 7.18.0.c58.3
[en] Implemented the many-to-one normalizer. (ETROG-1961)
[en] Deprecated many classes and methods that are for internal use only. (ETROG-2065)
[en] Added BaseLinguisticsFactory#addUserCscDictionary
. (ETROG-2098)
[en] Removed obsolete big-endian models and dictionaries. (ETROG-2214)
[en] Overhauled RBLCmd. ANNOTATE
is the default command. -showTokenDetails
, -showRawResults
, and -verboseResults
are removed. -inputJson
interprets the input as an ADM. -outputJson
is a boolean option. (ETROG-1392, ETROG-2343)
[en] Decomposed compound verbs in Japanese when using alternativeTokenization
. (ETROG-2350)
[en] Introduced more advanced disambiguation for English and Spanish. (ETROG-2367, ETROG-2370, ETROG-2372, ETROG-2371, ETROG-2467)
[en] Improved decompounding accuracy in Dutch. (ETROG-2408)
[en] Added tokenization, lemmatization, and POS-tagging for emoticons and emoji in all languages. (ETROG-2474, ETROG-2512, ETROG-2516, ETROG-2520, ETROG-2522, ETROG-2538)
[en] Supplemented analysis dictionaries for English and Spanish. (ETROG-2481, ETROG-2532, ETROG-2535)
[en] Added support for Lucene/Solr 6.3. (ETROG-2501)
[en] Introduced the ability to specify a user-defined reading dictionary in Lucene/Solr (userDefinedReadingDictionaryPath
). (ETROG-2527)
[en] Release 7.17.2.c58.3
[en] Release 7.17.1.c58.3
[en] Release 7.17.0.c58.2
[en] Release 7.16.0.c58.2
[en] The Chinese script converter is an entitlement with a standard Chinese license. (ETROG-1605)
[en] Arabic reh is normalized as a decimal separator in numeric contexts. (ETROG-1650)
[en] Provide disambiguation of Dutch compounds. (ETROG-1736)
[en] A custom reading dictionary can be specified on the RBLCmd command line. (ETROG-1938)
[en] Alternative tokenization options are included in BaseLinguisticsOption
. (ETROG-1946)
[en] Improve speed by caching Arabic analyses. (ETROG-1992)
[en] Added support for alternative Chinese segmentation. (ETROG-2034)
[en] Return Hebrew sentence boundaries. (ETROG-2036))
[en] Added support for POS tag mappings for alternative Japanese and Chinese segmentation. (ETROG-2152)
[en] Changed CompoundDictionary to provide its components in an order that reflects the contents of the lemma it returns. (ETROG-2154)
[en] AnalyzerFactory#addUserAnalysisDictionary
now throws an informative exception when either the root or dictionary directory is invalid. (ETROG-2166)
[en] Augmented RBLCmd with the ability to return the RBL-JE version number. (ETROG-2168)
[en] Improve handling of hiragana tokens homophonous to verbs in the alternative Japanese tokenizer (JLA). (ETROG-2188)
[en] Improve handling of POS-ambiguous verb stems in the alternative Japanese tokenizer (JLA). (ETROG-2189)
[en] The RBLCmd help command now sorts its options alphabetically. (ETROG-2195)
[en] Han readings now returned for all Katakana tokens. (ETROG-2208)
[en] In the Russian FST tokenizer, initials are tokenized and given the +Init
morpho-tag. (ETROG-2209)
[en] Memory requirements of the FST tokenizer were reduced. (ETROG-2200, ETROG-2226))
[en] Reduce the memory allocated for tokens by the FST tokenizer. (ETROG-2235))
[en] Terminated support for Lucene/Solr 4.1-4.2. Added support for Lucene/Solr 6.0-6.1. (ETROG-2016, ETROG-2241, ETROG-2299)
[en] Release 7.15.0.c57.2
[en] Note: 7.15.0 was forked directly from 7.14.0 and thus does not have the changes in 7.14.1+.
[en] Release 7.14.0.c57.2
[en] The specification of options to RBLCmd
was refactored. (ETROG-1503)
[en] Added UPT-16 support for Persian and Urdu. (ETROG-1830)
[en] Changed UPT-16 mappings for Czech and Hungarian numbers. (ETROG-1841)
[en] Removed incorrect analyses for Polish adjectives and participles ending in m/mi. (ETROG-1916)
[en] Removed archaic Polish analyses containing "być". (ETROG-1917)
[en] Added raw analyses for English contractions. (ETROG-1944)
[en] The command line tool RBLCmd supports Hebrew tokenization. (ETROG-1973)
[en] Added support for Finnish stemming. (ETROG-2012)
[en] Removed the spurious generation of an accusative case analysis for some Polish nouns. (ETROG-2020)
[en] The Hebrew tokenizer overzealously guessed that periods were part of an abbreviation. (ETROG-2024)
[en] Refactored the position metadata for Lucene tokens of compound components. (ETROG-2042)
[en] Lucene tokens for components of a contraction are identified with type "CONT". To invoke this functionality, set FilterOption.identifyContractionComponents
to true. (ETROG-2044)
[en] AnalysesAttribute
s formatted as JSON in Elasticsearch. (ETROG-2057)
[en] Release 7.13.0.c56.6
[en] Added API support for Lucene & Solr 5.0-5.3. (ETROG-1647)
[en] Added support for Persian and Urdu. (ETROG-1636, ETROG-1667)
[en] The 'nor' (Norwegian) language code is accepted. (ETROG-1690)
[en] Exposed support for using the Rosette Annotated Data Model (ADM) to perform RBL-JE operations. (ETROG-1713)
[en] The Arabic analysis candidate generation code now uses the same algorithm that the Arabic Language Processor in the native (C++) version of Rosette Base Linguistics does. (ETROG-1722)
[en] Provided an alternative Japanese analyzer. This provides parity with the Japanese analyzer in Basis Technology's C++ API (RLP). It offers improved accuracy with query strings and names and provides greater user control of the analysis. (ETROG-1727)
[en] For English, Portuguese, and German text, added ADM support for splitting contractions and analyzing the constituents. (ETROG-1769)
[en] Provided support for returning the set of 16 universal part-of-speech (POS) tags rather than the set of 12 that were introduced in version 7.12.0. (ETROG-1771)
[en] The RBLCmd tool now lists the BaseLinguisticsOption
options. To use these options you must set analyzerType=none
, lang
, and BaseLinguisticsOption.language
. (ETROG-1862)
[en] Release 7.12.1.c56.6
[en] Version 7.12.1.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis JVM SDK (e.g. RBL-JE, RLI-JE, REX-JE) in a single application, then choose versions that have the same compatibility number. (ETROG-1700)
[en] Moved the Tokenize
and Analyze
samples into samples/tokenize-analyze and created a single Ant build script to compile and run both samples. (ETROG-1264)
[en] Provided support for returning universal part-of-speech (POS) tags rather than the language-specific POS tags we already return. The universal tags (UPT) are coarser than the language-specific tags, but enable tracking and comparison across languages. (ETROG-1472)
[en] Added support for returning a disambiguated analysis for each token in Japanese text. For performance, this feature is turned off by default. (ETROG-1324)
[en] Added support for returning morphological tags, where available, and placed an example illustrating the procedure for obtaining morphological tags in samples/morpho-tags. (ETROG-1485)
[en] Removed the small number of dubious acronmym expansions from the lemmatization of English, French, Italian, German, Spanish, and Portuguese input. (ETROG-1547)
[en] Improved the German lemma parser, which now returns the same lemma for German nouns that differ only in gender. (ETROG-1548)
[en] Added API support for Lucene & Solr 4.10. (ETROG-1571)
[en] Enhanced support for Korean linguistic analysis, and integrated a guesser for generating morphemes, morpheme tags, compound components, and parts of speech. (ETROG-1486, ETROG-1512, ETROG-1528)
[en] Added support for Korean user lemma dictionaries. (ETROG-1518)
[en] Added stop words to the Japanese analysis dictionary. (ETROG-1525)
[en] Added the Chinese Script Converter, which can convert tokens in Traditional Chinese text to Simplified Chinese and vice versa. (ETROG-1462)
[en] Terminated support for Lucene/Solr 3.6. (ETROG-1298)
[en] Implemented support for Chinese part-of-speech (POS) tags and readings. (ETROG-1280)
[en] Added support for normalization of Chinese and Japanese numbers. (ETROG-1310)
[en] Implemented generation of Korean part-of-speech (POS) tags. (ETROG-1357)
[en] Added a tool for building user dictionaries. (ETROG-210)
[en] For those cases in which you want to use your own whitespace tokenenizer and you are processing text that requires segmentation (such as Chinese, Japanese, or Thai), we have added support for a base linguistics segmentation token filter to be used after a whitespace tokenizer and before other filters, such as a base linguistics token filter. See the Javadoc for the RBL-JE API for Lucene 4.3-4.7. (ETROG-1240)
[en] For Japanese, modified the base linguistics token filter to exclude lemmas for auxiliary verbs, particles, and adverbs from the token stream. (ETROG-1217)
[en] Added support for using AnalysesAttribute
to get the analyses and disambiguated analysis for each token in a token stream. (ETROG-1279)
[en] Added SLF4J support for logging RBL-JE applications. (ETROG-1318)
[en] Added support for turning case sensitivity on/off when analyzing text. (ETROG-1365)
[en] Deprecated void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path)
in favor of void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path, EnumMap<AnalyzerOption, String> options)
where options
is used to set AnalyzerOption.caseSensitive
to "true" or "false".
[en] Unused analyzer parameter removed from the BaseLinguisticsSegmentationTokenFilter
constructor. (ETROG-1316)
[en] Updated the Japanese normalization dictionary. (ETROG-1229)
[en] Added API support and samples for Lucene 4.9. (ETROG-1446)
[en] Added a Lucene Analyzer that combines the RBL-JE Tokenizer and TokenFilter, along with the LowerCaseFilter, CJKWidth Filter, and optional support for the StopFilter: com.basistech.rosette.lucene.BaseLinguisticsAnalyzer
. Added a Lucene 4.3-4.7 sample application that illustrates its use. (ETROG-1138, ETROG-1172)
[en] Improved support for returning Japanese Hiragana readings. The API for adding readings to the token stream has moved from TokenizerFactory#SetOption
to BaseLinguisticsTokenFilter#setAddReadings
. You can also include ("addReadings", "true")
to the map of options you use to instantiate the BaseLinguisticsAnalyzer
. (ETROG-1054)
[en] Added support for Japanese Hiragana readings.
[en] Factored in support for Lucene 3.6, 4.1-4.2, and 4.3.
[en] For this release, this product has been refactored and renamed to Rosette Base Linguisitcs Java Edition. This release concentrates on the core API instead of implementations for different versions of Lucene and Solr. This release returns part-of-speech tags for a core set of European languages and Japanese.
[en] For licensing and business reasons, support for Bulgarian, Catalan, Estonian, Croatian, Indonesian, Latvian, Malay, Slovak, Slovenian, Serbian, Albanian, and Ukrainian has been removed from the RSE package. (ETROG-921)
[en] Added support for tokenizing and lemmatizing Arabic, Czech, Hungarian, Korean, and Turkish. (ETROG-876)
[en] Added support for segmenting (tokenizing) Thai. (ETROG-448)
[en] Added a tokenizer option (turned off by default) for returning Hebrew roots. (ETROG-788)
[en] Changed required Java platform from 1.5 to 1.6. (ETROG-765)
[en] Added support for using RSE with LucidWorks Enterprise 1.7, which supports a pre-release version of Lucene and Solr 4.0.
[en] Added support for tokenizing and lemmatizing Albanian, Bulgarian, Catalan, Croatian, Estonian, Greek, Hebrew, Indonesian, Latvian, Malay, Polish, Serbian, Slovakian, Slovenian, Russian, and Ukrainian. (ETROG-656, 658, 668, 677)
[en] Added a command line driver for running RSE. For usage details, see the Javadoc for com.basistech.rosette.bl.RBLCmd
. (ETROG-603)
[en] Added support for tokenizing and lemmatizing Norwegian Nynorsk text. (ETROG-637)
[en] Consolidated support for Lucene 2.2, Lucene 2.4, Lucene 2.9, Lucene 3.1, Solr 1.3, Solr 1.4, and Solr 3.1 in a single SDK package with an associated documentation package.
[en] Deprecated support in the com.basistech.rosette.breaks
package (GenericTokenizer
and TokenizerOption
) for returning EOS (end-of-sentence) tokens. includeEOS
is off by default and should not be turned on; it interferes with Lucene searches. (ETROG-706)
[en] Deprecated Lucene 2.9 LemmaFilterFactory.supportedLanguages()
. Use getSupportedLanguages()
. (ETROG-726)
[en] Added support for Lucene 3.0.
[en] Improved support for Japanese and Chinese tokenization.
[en] Added the Japanese lemmatization dictionary and support of Japanese lemma user dictionaries. The Japanese lemmatization dictionary also provides orthographic normalization in the case of Katakana spelling variants and input text with archaic Kanji.
[en] Added the production of normalized numbers to the lemmatization process.
[en] Added support for Chinese lemma user dictionaries. Apart from numbers, which are already handled by the lemma guesser, lemmas do not ordinarily apply to Chinese, but a lemma user dictionary may be used for orthographic normalization.
[en] Added support for Danish and Norwegian (Bokmål). Improved support for Chinese token segmentation and Romanian.
-
[en] To enhance clarity and consistency, and to avoid duplication of package names in class names, made a number of API changes that are not backwards compatible.
-
[en] Renamed some factory classes: (ETROG-436)
[en] All these factory classes include a create()
method for instantiating the Tokenizer
or LemmaFilter
. The getTokenFilter()
, getLuceneTokenizer()
, and getLemmatizer()
methods have been removed.
[en] Promoted classes introduced in Release 1.5.beta.1 for setting tokenizer and lemmatizer options from inner Enums to top-level Enums: com.basistech.rosette.breaks.TokenizerOption
and com.basistech.rosette.bl.LemmatizerOption
. (ETROG-434)
[en] Removed the TokenizerFactory
, LemmaFilterFactory
and LemmatizerFactory
option-specific methods for setting options that predate the introduction of setOption()
.
-
[en] The com.basistech.breaks.BreakerFactory
methods for creating breakers have been renamed.
[en] Added support for Chinese, and limited support for Japanese. For these languages, RSE adds statistically trained models/dictionaries to enabled the tokenization of non-whitespace-delimited text. Support for user dictionaries has also been expanded to include token dictionaries for Chinese, Japanese, and Thai.
[en] Enhanced support for Dutch, Italian, and Portuguese.
[en] Replaced Lucene 2.9 and Solr 1.4 packages with Lucene 3.0 package.
[en] Revised the API for defining tokenizer and lemmatizer options.
[en] Reorganized the documentation to reflect standard RSE usage patterns.
[en] Compiling a Swedish User Dictionary. As described in the RSE Application Developer's Guide, you must use RLP to create a user dictionary. See "Chapter 12. User-Defined Data" In the RLP Application Developer's Guide provides instructions on creating the source file for a user-defined dictionary and compiling the dictionary. The current release of RLP (RLP 7.1.0) does not include support for creating a Swedish user dictionary. To create a Swedish dictionary, you must add a file that we provide in the extras directory to the corresponding location in your RLP installation: rlp/bl1/dicts/sv/tags.txt.
[en] When you create your source file, you can use [+DUMMY] as the POS tag for each entry.
[en] The syntax for compiling a Swedish user dictionary from rlp/bl1/dicts/tools is
build_user_dict.sh sv input output
[en] Removed Rosette Language Analyzer (RLI) 100% Java implementation, which is now a separate product.
[en] Provided separate SDK packages with support for Lucene 2.2, Lucene 2.4, and Lucene 3.0. (ETROG-198)
[en] Added TokenizerFactory, which provides a language-specific Tokenizer for parsing input text. In addition to using the Sentence Breaker and Word Breaker, the Tokenizer normalizes the tokens (Unicode NFC normalization and lowercasing). (ETROG-185)
[en] Added support for Swedish, including tokenization, lemmatization, and decompounding. (ETROG-201)
[en] Added preliminary, limited support for Dutch, Danish, Norwegian, Italian, Portuguese, and Romanian.
[en] Expanded support for German decompounding.
[en] Added support for generating a separate lemma for each space-delimited element in lemmas that contain whitespace.
[en] This distribution provides support for Lucene 2.2.
[en] Upgraded Token Filter Factory support from Lucene 2.2 to Lucene 2.4.
[en] Added The Rosette Language Identifier (RLI), Sentence Breaker, and Word Breaker:
[en] Introduced support for the creation of Lucene 2.2 Base Linguistics token filters for English, French, German, and Spanish text.
[en] Bugs fixed in 7.24.5.c59.2
[en] Bugs fixed in 7.24.4.c59.2
[en] Bugs fixed in 7.24.2.c59.2
[en] Bugs fixed in 7.24.1.c59.2
[en] Bugs fixed in 7.24.0.c59.2
[en] Bugs fixed in 7.23.3.c59.0
[en] Bugs fixed in 7.23.2.c59.0
[en] Bugs fixed in 7.23.0.c59.0
[en] Bugs fixed in 7.22.2.c59.0
[en] Bugs fixed in 7.22.1.c59.0
[en] Bugs fixed in 7.22.0.c59.0
[en] Bugs fixed in 7.21.3.c59.0
[en] Bugs fixed in 7.21.0.c58.3
[en] Bugs fixed in 7.20.4.c58.3
[en] Bugs fixed in 7.20.3.c58.3
[en] Bugs fixed in 7.20.2.c58.3
[en] Bugs fixed in 7.20.1.c58.3
[en] Bugs fixed in 7.20.0.c58.3
[en] Bugs fixed in 7.19.0.c58.3
[en] Bugs fixed in 7.18.0.c58.3
[en] Bugs fixed in 7.17.0.c58.2
[en] Bugs fixed in 7.16.1.c58.2
[en] Bugs fixed in 7.16.0.c58.2
[en] Bugs fixed in 7.14.2.c57.2
[en] Bugs fixed in 7.14.1.c57.2
[en] Bugs fixed in 7.14.0.c57.2
[en] Bugs fixed in 7.13.0.c56.6
[en] Bugs fixed in 7.12.1.c56.6
[en] 4.42. Bugs fixed in 7.12.0
[en] Bugs Fixed in 1.10.1
[en] Third-Party Components
[en] For a list of third-party components that are used in Basis Technology products, see ThirdPartyLicenses.txt.
[en] Third-party component updates in 7.27.1.c60.0
[en] Third-party component updates in 7.26.4.c60.0
[en] Third-party component updates in 7.25.0.c59.3
[en] Third-party component updates in 7.24.6.c59.2
[en] Third-party component updates in 7.24.0.c59.2
[en] Third-party component updates in 7.23.0.c59.0
[en] Third-party component updates in 7.21.1.c59.0
[en] Third-party component updates in 7.18.0.c58.3
[en] Third-party component updates in 7.16.0.c58.2
[en] Third-party component updates in 7.14.0.c57.2
[en] Third-party component updates in 7.13.0.c56.6
[en] Known Problems in 2.x
[en] If disambiguate is set to false, or if no disambiguator for the language exists, BaseLinguisticsTokenFilter
does not set the type correctly for compound components when adding them to the token stream. It marks compound components as <LEMMA> instead of <COMP> when a non-disambiguating analysis is performed. (ETROG-1552)
[en] Known Problems in 1.8.0
[en] The prefixes and suffixes that the RSE tokenizer returns for Hebrew may include punctuation attached to the underlying tokens, such as parentheses (prefix, suffix) and comma (suffix). Accordingly, prefixes and suffixes are assigned a Token PositionIncrement of 1. A multicharacter prefix or suffix may be reported as a sequence of one-character prefixes or suffixes. (ETROG-697)
[en] Known Problems in 1.7.0
[en] Known Problems in 1.4.1
[en] To avoid a potential out-of-memory error, RSE does not attempt to decompound words longer than 30 characters. For languages with support for decompounding, if a word is longer than 30 characters and is not found in a user dictionary or the standard dictionary, RSE classifies the word as a guessed lemma. (ETROG-191)
[en] Known Problems in 1.4.0
[en] Inconsistent handling of numbers and punctuation during lemmatization. (ETROG-266)
[en] RSE expects valid Unicode strings as input. If the input includes illegal Unicode sequences, such as un-paired UTF-16 surrogate characters, the behavior is undefined. (ETROG-284)
[en] Known Problems in 1.3-beta and 1.4.x
[en] Incorrect capitalization in some lemmas, including some German compounds (e.g., unAbhängigkeit).
[en] Incorrect lemma formation of some words with suffixes (e.g., Brötchen).
[en] Over-generation of German compound components (e.g., übergreifen, über, and greifen as separate components).
[en] Failure to recognize some extended written-out German numbers (e.g., zweitausendzwölf).