RLI is a 100% Java implementation of the Rosette Language Identifier. It supports the detection of language, encoding, and writing script for input data in any of 364 language profiles, involving 56 languages, 48 encodings, and 18 writing scripts
September 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-Party Component Updates
Table 1. Updated
Package |
Old Version |
New Version |
Jackson Annotations |
2.15.0 |
2.15.2 |
Jackson Core |
2.15.0 |
2.15.2 |
Jackson Databind |
2.15.0 |
2.15.2 |
Jackson Dataformat XML |
2.15.0 |
2.15.2 |
Jackson Dataformat YAML |
2.15.0 |
2.15.2 |
Jackson Module: Old JAXB Annotations |
2.15.0 |
2.15.2 |
Guava: Google Core Libraries for Java |
31.1-jre |
32.1.2-jre |
June 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-Party Component Updates
Table 2. Updated
Package |
Old Version |
New Version |
Apache Log4J |
2.19.0 |
2.20.0 |
fastutil |
8.5.9 |
8.5.12 |
Jackson Annotations |
2.14.0 |
2.15.0 |
Jackson Core |
2.14.0 |
2.15.0 |
Jackson Databind |
2.14.0 |
2.15.0 |
Jackson Dataformat XML |
2.14.0 |
2.15.0 |
Jackson dataformats: Text |
2.14.0 |
2.15.0 |
Jackson modules: Base |
2.14.0 |
2.15.0 |
SnakeYAML |
1.33 |
2.0 |
March 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-Party Component Updates
Table 3. Updated
Package |
Old Version |
New Version |
Google Guava |
26.0-jre |
31.1-jre |
December 2022
Bug Fixes
Third-party component updates
This release includes the following third-party component changes:
Table 4. Upgraded
Package |
Old version |
New version |
Apache Log4j |
2.17.1 |
2.19.0 |
fastutil |
8.5.6 |
8.5.9 |
Jackson |
2.11.1 |
2.14.0 |
SLF4J |
1.7.33 |
1.7.36 |
SnakeYAML |
1.30 |
1.33 |
February 2022
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Notice
Java 8 and Java 9 support is deprecated as of this release.
Third-party component updates
This release includes the following third-party component changes:
Table 5. Upgraded
Package |
Old Version |
New Version |
Apache Commons IO |
2.7 |
2.11.0 |
Apache Commons Lang |
2.6 |
3.12.0 |
Apache Log4j |
1.2.17 |
2.17.1 |
ICU4J |
59.1 |
70.1 |
fastutil |
8.4.0 |
8.5.6 |
SLF4J |
1.7.28 |
1.7.33 |
SnakeYAML |
1.26 |
1.30 |
May 2021
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
January 2021
New
We added -input-json
(-ij
) as an option to RLICmd to specify that the input is an ADM format file. (RLIJE-533)
When specifying -output-json
as an option in RLICmd, the resulting ADM now has a data
field containing the input data. If the encoding of the data is not recognized by the JVM, the data field will not be populated. (RLIJE-532)
Bug Fixes
Third-party component updates
This release includes the following third party component changes:
September 2020
Bug Fixes
RLI-JE now correctly identifies the primary language of short documents which contain small fragments of a language in another script. Previously, the language of the fragments might be erroneously detected as the primary language. The lengths of the document's script regions are now taken into account when identifying the primary language. (RLIJE-523)
API Changes
January 2020
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
December 2019
New Features
Bugs Fixed
August 2019
New Features
Added support for Albanian, Bulgarian, Catalan, Croatian, Estonian, Icelandic, Kurdish (Arabic script), Kurdish (Latin script), Latvian, Lithuanian, Macedonian, Polish, Serbian (Cyrillic script), Serbian (Latin script), Slovak, Slovenian, Somali, Tagalog, Ukrainian, Urdu (Arabic script), Uzbek (Cyrillic script), Uzbek (Latin script), and Vietnamese to the short string algorithm.
Rosette Language Identifier now returns Malaysian (zsm) instead of Malay (msa).
If shortStringThreshold
is set, the LanguageRegionAnnotator
will utilize short-string detection on sufficiently short script regions.
Rescaled confidence scores for the default (not short string) algorithm such that high-confidence results get scores around 0.9 instead of around 0.03. (RLIJE-447)
Added the option minNonScriptioContinuaRegionLength
to control when short regions of scripts like Latin are merged into regions of scripts like Han. The default is to never merge the regions. To return to the behavior of previous versions, set this option to 10. (RLIJE-454)
To become file-system-agnostic, the use of Path
in the API is now supported. (RLIJE-331)
Version of OSGi (internal use only) upgraded. (RLIJE-380)
Added -dontBreakRegionOnScriptBoundary
to RLICmd
. (RLIJE-324)
Language weight adjustments can now be used to boost any given language, not just demote it. (RLIJE-111)
The weight adjustment API now works for the short string algorithm. (RLIJE-204)
Refactored RLICmd
's command line options. (RLIJE-230)
The short string and legacy algorithms now consistently handle cases when a language cannot be determined. Both now return language Unknown. NoMatchException
, NotEnoughDataException
, and LanguageIdentificationException
are deprecated. (RLIJE-272, RLIJE-304)
Achieved a modest accuracy gain by changing the internals of the matching algorithm to be more tolerant of noise. (RLIJE-277)
Added an option to return only one result per language. See LanguageIdentifierBuilder#uniqueLanguages(boolean)
. (RLIJE-298)
The languageHint
and encodingHint
methods of LanguageIdentifierBuilder
are deprecated. Use the weight adjustment API instead. (RLIJE-320)
Profiles for transliterated languages (e.g. Arabic in Latin script) are disabled by default. To enable them, see LanguageIdentifierBuilder#languageWeightAdjustment(LanguageCode, ISO15924, int)
. (RLIJE-225)
RLICmd can continue running if it fails to find a file in a provided list of files to analyze. (RLIJE-216)
RLICmd can use multiple threads to analyze documents in parallel. (RLIJE-221)
icu4j and args4j have been shaded into the rli-je-shaded jar. (RLIJE-237)
Basis' common-lib jar has been shaded into the rli-je-shaded jar. (RLIJE-248)
RLI-JE now depends on adm-model instead of adm-shaded. With this change, RLI-JE no longer depends on Apache Commons Betwixt or Javassist. (RLIJE-252)
The short string detection algorithm is 20% faster than 7.13.0. (RLIJE-254)
Added alternative analysis for improved accuracy when detecting the language of short strings. (RLIJE-96)
Renamed the license directory from license
to licenses
to be consistent with other Basis products. (RLIJE-124)
Moved the command line utility, RLICmd
, from tools/bin to bin. (RLIJE-200)
Refactored the identification of Chinese to separate language and script. Chinese language (zho
) is now detected when script is Han, Simplified (Hans
) and when script is Han, Traditional (Hant
). RLI-JE used to identify these variants as Simplified Chinese (zhs
) and Traditional Chinese (zht
). (RLIJE-152)
Added support for identifying language regions in a document that contains blocks of text in multiple languages. (RLIJE-88)
Deprecated LanguageIdentifierFactory
and LanguageIdentifier
in favor of LanguageIdentifierBuilder
, which you can use to set options and create Annotator
objects that detect language, encoding, and writing script, as well as language regions in multilingual input. This implementation employs the new data model package (com.basistech.rosette.dm
), which is used in a variety of Rosette products. (RLIJE-93)
Added a factory class (LanguageIdentifierFactory
) for creating instances of the LanguageIdentifier
. (RLIJE-15)
Moved the command line utility from LanguageIdentifier
to RLICmd
. (RLIJE-32)
When returning UTF-16 encoding, RLI-JE now identifies whether the encoding is Little Endian (UTF-16LE
) or Big Endian (UTF-16BE
). (RLIJE-51)
Extended support for specifying the license. LanguageIdentifierFactory
includes constructors for getting the license from a file, and input stream or an .xml string.
With the LanguageIdentifier setLanguageWeight Adjustment
methods, added support for reducing the weight associated with a specific language to a percentage of its original weight in order to assist in the detection of other languages in documents that mix multiple languages.
Adjusted the default weights assigned to Pushto/Latn and Urdu/Latn to return more accurate results. Lowered the default weight for Serbian/Latn to 0, so that Croation is returned. Pushto/Arab, Urdu/Arab, and Serbian/Cyrillic are not affected by these adjustments. (RLIJE-37)
Placed new .jar files in the lib
directory. mahout-collections-1.0.jar
is no longer used by RLI-JE.
This is the first release of RLI - Java separated from the Rosette Search Essentials SDK. It matches the support in the C++ implementation of RLI 6.5.1.
For a list of third-party licenses for components that are used in Basis Technology products, see ThirdPartyLicenses.txt
.
Third-party component updates in 7.21.4
Third-party component updates in 7.21.4
Third-party component updates in 7.21.2
Third-party component updates in 7.21.0
Third-party component updates in 7.20.1
Third-party component updates in 7.18.0
Third-party component updates in 7.16.0
Third-party component updates in 7.15.0
Third-party component updates in 7.14.0
RLI 6.5.1 may occasionally misidentify buffers containing UTF-16 data (e.g., Java Strings). The workaround is to extract a UTF-8 byte array and pass that to detect(byte[] data)
.
In some cases input consisting of Han Script, LanguageIdentifier
may return results that are not sorted by confidence. This reflects heuristics that do not factor into the confidence calculation.