Rosette Entity Extractor Java Edition (REX-JE), includes a statistical extractor (with Structured, Averaged Perceptron statistical models), a pattern matcher (regular expressions), an exact matcher (gazetteers), a redactor for resolving conflicts, and a joiner for joining adjacent entities. It also includes entity linking functionality and a field training kit for optimizing REX results for your data.
September 2023
New
Bug Fixes
Third-party component updates
Table 1. Updated
Package |
Old Version |
New Version |
Jackson Annotations |
2.15.0 |
2.15.2 |
Jackson Core |
2.15.0 |
2.15.2 |
Jackson Databind |
2.15.0 |
2.15.2 |
Jackson Dataformat XML |
2.15.0 |
2.15.2 |
Jackson Dataformat T |
2.15.0 |
2.15.2 |
Jackson Datatype: Guava |
2.15.0 |
2.15.2 |
Jackson Module: Old JAXB Annotations |
2.15.0 |
2.15.2 |
Guava: Google Core Libraries for Java |
31.1-jre |
32.1.2-jre |
Protocol Buffers [Core] |
3.21.7 |
3.23.4 |
June 2023
Bug Fixes
Known Issues
Third-party component updates
This release includes the following third-party component changes:
Table 2. Updated
Package |
Old Version |
New Version |
Apache Commons Compress |
1.22 |
1.23 |
Apache Log4J API |
2.19.0 |
2.20.0 |
Apache Log4J Core |
2.19.0 |
2.20.0 |
Apache Log4J SLF4J Binding |
2.19.0 |
2.20.0 |
fastutil |
8.5.9 |
8.5.12 |
Jackson Annotations |
2.14.0 |
2.15.0 |
Jackson Core |
2.14.0 |
2.15.0 |
Jackson Databind |
2.14.0 |
2.15.0 |
Jackson Dataformat CSV |
2.14.0 |
2.15.0 |
Jackson Dataformat YAML |
2.14.0 |
2.15.0 |
Jackson Dataformat XML |
2.14.0 |
2.15.0 |
Jackson datatype: Guava |
2.14.0 |
2.15.0 |
Jackson JAXRS:base |
2.14.0 |
2.15.0 |
Jackson JAXRS:JSON |
2.14.0 |
2.15.0 |
Jackson module:OLD JAXB Annotations |
2.14.0 |
2.15.0 |
SnakeYAML |
1.33 |
2.0 |
March 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Table 3. Upgraded
Package |
Old Version |
New Version |
Guava: Google Core Libraries for Java |
26.0-jre |
31.1-jre |
Protocol Buffers [Core] |
3.12.2 |
3.21.7 |
March 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Table 4. Upgraded
Package |
Old Version |
New Version |
Guava: Google Core Libraries for Java |
26.0-jre |
31.1-jre |
Protocol Buffers [Core] |
3.12.2 |
3.21.7 |
December 2022
New
Wikidata refreshed: We've updated the knowledge base data for the provided linking knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIki-119, ELK-274, ELK-276)
New currency regex: We've introduced a new option, regexCurrencySplit
, that, when set to true, will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two new entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These two new types represent the amount of the currency (50,000) and the currency type ($), respectively. By default, regexCurrencySplit
is set to false. (TEJ-1792)
Tagalog support: We've added case-insensitive NER support for Tagalog. Previously we released a case-sensitive model and we've now added the case-insensitive model as well. (TEJ-1858)
Parameter removed: We've removed the deprecated genre
extraction option. This option was used to turn the linker on which has been, and will still be, available by the linkEntities
option. The genre
option is no longer available in the REX SDK, in the Rosette Server REX configuration, as well as the Rosette API bindings (TEJ-1855).
September 2022
New
Tagalog (tgl) support: We've added Tagalog to our list of languages. The following processors are supported: gazetteer, regex, statistical NER, linking. (TEJ-1812, TEJ-1822, TEJ-1785, TEJ-1786)
New linking option: We've added a new option for entity linking. When linkMentionMode
is set to entities
the linker will attempt to link the entities extracted by other processors (regex, gazetters, and the statistical processor) instead of using its own processor to extract entity candidates. Depending on your data, this may provide higher accuracy and speed. (TEJ-1806)
-
REXCmd parameter change: The linkEntities
parameter can now act as a toggle instead of taking a true/false value, matching how other REXCmd boolean parameters are handled. (TEJ-1806)
Parameter deprecated: The parameter genre
is deprecated and will be removed in the next release.
Bug Fixes
REX no longer produces an exception when token normalization produces an empty token string. (TEJ-1803)
When looking for candidate mentions in text, if there is an overlap between these mentions the linker now resolves the longest spanning mention before disambiguation. (ELK-277)
June 2022
New
-
Configure knowledge base linking priority: With multiple knowledge bases it is possible to set the order in which to try linking against each knowledge base. Set the priority in the redactor configuration file (ne_types.xml)
(TEJ-1726, TEJ-1754)
Example: The following XML element will set the custom-kb
priority higher than the default knowledge base (kb-linker
) when linking a PRODUCT entity type:
<ne_type>
<name>PRODUCT</name>
<weight name="kb-linker" value="100" />
<weight name="kb-linker:custom-kb" value="1" />
</ne_type>
relatedEntities renamed to contextWords: When creating a custom knowledge base, the feature contextWords
, which was previously called relatedEntities
, is required. Context words are language-specific words that are strongly related to the entity. The term relatedEntities
has been deprecated. (TEJ-1756)
Java 17 support added: Java 8 and 9 support has been removed. (TEJ-1728, TEJ-1763)
Solr 9 support added: REX now supports Lucene and Solr 9. (TEJ-1731)
Solr 6 support deprecated: REX no longer supports Solr 6 or earlier. (TEJ-1731)
Bug Fixes
Bug fix: An error is no longer generated when there are null prefixes in Arabic morphological analyses. (TEJ-1765)
Bug fix: We fixed a bug to enable using noisy_context_vector
feature for disambiguation. (ELK-265, ELK-268, ELS-272, TEJ-1776)
March 2022
Notice
Solr 6 and earlier support is deprecated as of this release.
Java 8 and Java 9 support is deprecated as of this release.
Bug Fixes
Third-party component updates
This release includes the following third-party component changes:
Table 5. Upgraded
Package |
Old Version |
New Version |
Apache Commons Compress |
1.9 |
1.21 |
Apache Commons IO |
2.7 |
2.11.0 |
Apache Commons Lang3 |
3.32 |
3.12.0 |
Apache Log4j |
1.2.17 |
2.17.1 |
Auto Common Libraries |
0.3 |
0.8 |
AutoService |
1.0-r3 |
0.8 |
ICU4J |
58.1 |
70.1 |
fastutil |
8.4.0 |
8.5.6 |
LibLinear |
2.30 |
2.42 |
SLF4J |
1.7.28 |
1.7.33 |
SnakeYAML |
1.26 |
1.30 |
TensorFlow for Java |
0.2.0 |
0.3.3 |
Table 6. Added
Package |
Version |
License |
AOP alliance |
1.0 |
Public Domain |
Apache Commons Logging |
1.2 |
Apache License 2.0 |
Apache Commons Math |
2.0 |
Apache License 2.0 |
Apache POI |
3.9 |
Apache License 2.0 |
DOM4J |
1.6.1 |
DOM4J License |
JCommon |
1.0.17 |
GNU Lesser General Public Licence |
JFreeChart |
1.0.14 |
GNU Lesser General Public Licence |
JUnit |
4.13.2 |
Eclipse Public License 1.0 |
JVM Integration for Metrics |
3.0.4 |
Apache License 2.0 |
Java Architecture for XML Binding |
2.3.2 |
Eclipse Distribution License - v 1.0 |
Java Common Annotations API |
1.3.2 |
CDDL + GPLv2 with classpath exception |
Java Message Service |
1.1 |
Common Development and Distribution License (CDDL) v1.0 |
JavaBeans Activation Framework (JAF) |
1.1 |
Common Development and Distribution License (CDDL) v1.0 |
JavaBeans Activation Framework API jar |
1.2.1 |
EDL 1.0 |
JavaMail API |
1.4 |
Common Development and Distribution License (CDDL) v1.0 |
Javax WS-RS API |
2.1.5 |
EPL 2.0 |
JetBrains Java Annotations |
23.0.0 |
Apache License 2.0 |
Jimfs |
1.1 |
Apache License 2.0 |
Legion of the Bouncy Castle Java Cryptography APIs |
138 |
Bouncy Castle License |
Lib TensorFlow |
1.5.0 |
Apache License 2.0 |
Mockito |
1.9.5 |
The MIT License |
ODFDOM |
0.8.6 |
Apache License 2.0 |
Project Lombok |
1.18.22 |
The MIT License |
Spring |
4.2.4.RELEASE |
Apache License 2.0 |
StAX API |
1.0.1 |
Apache License 2.0 |
Sun Multi-Schema XML Validator |
20050913 |
The BSD License |
TensorFlow |
1.5.0 |
Apache License 2.0 |
XML Commons External Components XML APIs |
1.3.04 |
Apache License 2.0 |
Xerces2 Java Parser |
2.9.4 |
Apache License 2.0 |
XMLBeans |
2.3.0 |
Apache License 2.0 |
ZIP4J |
1.3.2 |
Apache License 2.0 |
iText |
2.1.5 |
Mozilla Public License |
Table 7. Removed
Package |
Apache Geronimo |
JAX-WS |
JBoss RMI |
JSR203 Hadoop |
Jacorb Omg |
Jakarta Activation |
Jakarta WS-RS API |
Jakarta XML Bind API |
Javax Activation |
Javax Annotation |
Javax XML Soap |
MIME Pull |
SAAJ Impl |
STAX-EX |
December 2021
New
Bug Fixes
Hungarian dates are now extracted correctly. Previously, dates with embedded periods followed by a space were not being extracted. (TEJ-1681)
rexcmd info
no longer lists TEMPORAL types by default for SWEDISH. (TEJ-1687)
August 2021
New
Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions. (TEJ-1657, TEJ-1658)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.41.1.c65.0. (TEJ-1667)
Bug Fixes
A single line followed by an empty line is no longer always considered a fragment. (ETROG-3431)
The Field Training Kit (FTK) no longer returns erroneous error messages from generating wordclasses. (TEJ-1636)
The RBL models directory is now correctly specified in the FTK. (TEJ-1655)
The REX Training Server (RTS) no longer fails when the request contains the language code msa
. msa
is now mapped to zsm
, the language code supported by REX for Malay. (TEJ-1669)
May 2021
Bug Fixes
Open Source Changes
Table 8. Upgraded
Package |
Old Version |
New Version |
jackson |
2.10.0 |
2.11.1 |
commons-io |
2.6 |
2.7 |
fastutil |
8.3.0 |
8.4.0 |
liblinear |
1.95 |
2.42 |
snakeyaml |
1.25 |
1.26 |
stax2-api |
4.2 |
4.2.1 |
Table 9. New
Package |
Version |
License |
JavaCPP |
1.5.4 |
Apache 2.0 |
TensorFlow Core API |
0.2.0 |
Apache 2.0 |
TensorFlow NDArray |
0.2.0 |
Apache 2.0 |
Table 10. Deleted
Package |
libtensorflow |
libtensorflow jni |
protobuf |
April 2021
New
Language-specific joiner rules: Custom joiner rules can now be language-specific or apply to all languages. (TEJ-178)
New default processing for structured text regions (lists, tables): Because structured text is often just words or phrases, and thus missing the syntactic context that REX was trained on, some REX users would pre-process input text to remove structured regions, on which REX performed poorly. Users no longer have to pre-process the input as now the statistical/DNN model is turned off by default for structured regions. This mode increases precision but may result in reduced recall in these regions. Note, the other REX processors (pattern match, exact match, entity linking) which do not rely on context will continue to analyze the structured regions. To turn on the statistical/DNN model for structured regions, set the parameter structuredRegionProcessingType
to nerModel
. (TEJ-1502) (TEJ-1502)
New name classifier model for structured regions (LABS): We've added a new model for processing structured regions. The name classifier classifies a text fragment as PERSON, LOCATION, ORGANIZATION, or NONE. The entire structured region is classified as a single label, an entity type or NONE. It is disabled by default. (TEJ-1613, TEJ-1621)
Japanese organization gazetteers: The gazetteers for Japanese organizations has been updated to improve extraction of Japanese organizations. (TEJ-1612)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.39.0. (TEJ-1618)
Rosette Training Server (RTS) results: When using REX with Rosette Adaptation Studio (RAS), the results returned by RTS are now preferred by default. (TEJ-1605)
Bug Fixes
Entities are no longer extracted when they cross a sentence boundary. To enable entity linking across sentence boundaries, set disableApplySentenceBoundaries
to true
. (ELK-259)
Entities are now checked to ensure they are normalized. (TEJ-1615)
Third-party component updates
This release includes the following third-party component changes:
December 2020
New
Updated the internal database for Wikidata linking. QIDs for some entities may change from previous versions, as Wikidata has been refreshed and re-indexed. (TEJ-1579, ELK-249, ELK-251, RWIKI-77)
Updated RBL version (TEJ-1579)
Bug Fixes
The sqlite-kb-connector sample now works correctly. Runtime issues with sqlite dependencies have been corrected. (ELK-245, ELK-257)
Extraction no longer fails when a custom processor returns a NULL annotator; instead a warning is generated. (TEJ-1580)
Mentions normalized by the custom processor are no longer ignored. (TEJ-1573)
Windows-formatted carriage returns (/r, /r/n) are now handled correctly.
September 2020
New Features
Joiner runs before redactor: The joiner now runs before the redactor by default, providing more flexibility and control over the joiner results. Set runJoinerPostRedactor
to true
to run the joiner after the redactor. (TEJ-1534)
Improved phone number recognition: Regular expressions for phone number extraction have been improved and now extract more phone number patterns. (TEJ-1556)
-
REXCmd input from stdin: REXCmd can now accept input from stdin by specifying the command line option -stdin
.
Example:
$ echo "Basis Technology is a company in Massachusetts" | REXCmd extract -stdin -langCode eng
Bug Fixes
We fixed a bug where sometimes a null pointer exception was returned when the custom processor and the linker had overlapping results. (TEJ-1561)
Custom processors can now only modify the entity and metadata sections of the ADM. Previously, any modification could be made which could override annotation data. (TEJ-1537)
We've partially fixed a problem in Japanese ORG extraction where sometimes the model extracts multiple ORG entities or includes non-related adjacent tokens. (TEJ-1534)
The Field Training Kit no longer generates invalid models for when creating custom knowledge bases. This occurred for all languages except eng, jpn, and zho. (ELK-252)
June 2020
New Features
Improved sample The sample files to build the SQLite connector described in the Custom Knowledge Base Connectors section now includes all files required to build with Maven. The configuration to run the connector with Rosette Enterprise is now provided as well. (TEJ-1508)
Language-specific alias Custom knowledge bases compiled with the Field Training Kit (FTK) will now maintain the language of the alias. Aliases will only be extracted in documents of the language the alias is defined for. Aliases can be defined as for all languages or for a specific language. (ELK-241)
Custom knowledge bases can be compiled without disambiguation. While adding a knowledge base without a disambiguation model will not provide the best results, it will function as an enhanced gazetteer that attaches an assigned ID to each gazetteer entry and supports multiple aliases per entry. To compile a custom knowledge base without compiling a disambiguation model pass -d
as an argument to train-linker-model
. (ELK-233)
New method A method getBaseLinguisticsParameters
has been added to retrieve the base linguistics parameters that were used in training the model. Use the retrieved parameters to configure an external instance of RBL to produce tokens consistent with the training tokenization. A new sample application, RBLParametersSample.java
, is available in the samples
directory. (TEJ-1501)
Base linguistics added The FTK can now use input ADM files containing base linguistics annotations, such as tokens, sentence boundaries, and morphological analysis for languages such as Korean and Arabic. For REX to produce the optimal results, tokenize with the options provided by the getBaseLinguisticsParameters
method when creating the ADM file from RBL. (APE-1793)
Hebrew improvements REX has improved Hebrew normalization and added the ability of the disambiguator to identify prefixes removed from the entity's normalized form. Improvements are a result of enhancements in Hebrew base linguistics. (ETROG-3189)
Bug Fixes
A new line character in a regex (\n) will now also match carriage returns (\r) and a combination of both (\r\n). (TEJ-1525)
Confidence scores for entity linking now use the same scale, whether linking to Wikidata or a custom knowledge base. Previously, the confidence scores given for links to custom knowledge bases were much lower than those calculated for the Wikidata knowledge base. (ELK-240)
March 2020
New Features
Connector framework for custom Knowledge Bases added. See section 5.6 in the Application Developer's Guide. (TEJ-1476, TEJ-1477, TEJ-1485)
Added Deep Neural Network model for Hebrew for improved accuracy. Replace statistical model with it by using the flag -useDeepNeuralNetworkProcessor.
(TEJ-1503)
Hebrew normalization improved: instead of using the lemma form, just the prefixes are being removed, except the definite article. (TEJ-1505)
New statistical model for Hebrew trained on news and finance data. (TEJ-1497)
Solr plugin now available as a Docker container. (TEJ-1492)
Supplemental regex support for ISO-6709 geo-coordinates. (TEJ-1431, DATA-761)
Support for setting prioritization for multiple custom Knowledge Bases. See section 5.2 in the Application Developer's Guide. (ELK-236)
Redactor weighs can now be configured for specific subsources. See section 3.2.1 in the Application Developer's Guide. (TEJ-1480)
Separate license key required for linker custom Knowledge Bases. Note: extractions against existing custom Knowledge Bases will fail unless licenses are updated. (TEJ-1483)
Custom Knowledge Bases can be set in Rosette Enterprise profiles. Note: To support this feature, the flinx
directory was moved into {rex-installation}/data
. Any custom data inside must also be moved to the new location. (TEJ-1494)
Bug Fixes
TEJ-1499 REXAnnotatorFactory
failed to assign linking confidence thresholds.
TEJ-1479 Fixed dynamic gazetteers for Malay.
TEJ-1506 Deep Neural Network extractions failed in REXCmd.
December 2019
New Features
Bug Fixes
August 2019
New Features
Tested and confirmed compatibility with Java 11.
Updated internal database for Wikidata linking. The DBPedia Type field now supports multiple subtypes. QIDs for some of the entities may change from previous versions, as Wikidata has been refreshed and re-indexed.
Entity linking returns PermIDs (IDs from Thomson Reuters knowledge base) in addition to QIDs (Wikidata IDs) for some of the entities.
Bug Fixes
June 2019
New Features
Flinx disambiguation models are packaged with optional parameter files which control some parameters during runtime. These file were missing from several previous distribution packages, which may have affected accuracy performance. They have now been re-added to the distribution packages.
Fixed additional cases where Japanese characters were wrongly normalized into their simplified Chinese equivalents in entity linking, an issue addressed also in the previous release.
Chinese language code is now composed of three characters uniformly throughout the file system.
Entity extraction and entity linking now consume the latest version of RBL (Rosette Base Linguistics), which includes several improvements and bug fixes.
Improved installation by providing a script to facilitate unzip and installation of documentation and language packages.
Bug Fixes
In Japanese, a middle dot comes in the middle of Western names and acts as a sort of whitespace separating words in the name. Previously, some of the entities with middle dot have been split into two entities, extracting only a part of the name. This is now handled correctly, and entities with middle dot are not split. (TEJ-1341)
In Japanese entity linking, in some cases the last character of the Japanese word changed into a Chinese character. This is now fixed. (ELK-118)
Previously, when includeDbPediaTypes option was off, entity linking occassionally extracted an inaccurate entity type. Now, fine types of linked entities are identified also when includeDbPediaTypes option is off. (ELK-115)
Provided a distribution package per language. (TEJ-1361)
Reduced linker data package size (TEJ-1306, TEJ-1321)
Updated the linking confidence calculation and thresholds to improve accuracy (TEJ-1343)
Improved the accuracy of Korean extraction, largely through better handling of Josa (postpositions) and compound words.
Added support for Entity linking to Wikipedia for both the top level types (PERSON, LOCATION, ORGANIZATION, ETC.) as well as the over 700 DBpedia types in the remaining 16 languages supported by Entity Extraction. This is in addition to the languages currently supported by entity linking: Chinese, English, Japanese, and Spanish.
The linker process now has the option of returning over 700 new entity types drawn from the DBpedia ontology. To access these entity types, turn on the kbLinker
processor and add the includeDBpediaType
flag to the factory configuration. You’ll notice more than 10 additional primary types in the type field as well as the all new DBpedia type field. Note that this is a LABS (experimental) api and subject to change. Send us your feedback!
New language: Entity extraction now supports Hungarian.
Enabled string normalization for Hebrew based on DNN disambiguation model to improve indoc-coref
results (chaining mention`s into a single `Entity
) and to present a more proper form of the name (TEJ-1139, TEJ-1173)
Social-media characters such as '@' and '#' are removed from Mention`s normalized string, offsets to the original string `data
field remain the same. This feature can be disabled by EntityExtractor.setRetainSocialMediaSymbols()
(TEJ-418)
Improved statistical model confidence score to emit maximal confidence less frequently (TEJ-1146)
Added static and dynamic capabilities to adding entries to the custom knowledge base for entity linking (ELK-30, ELK-41, ELK-44, TEJ-1150)
Added a new deep neural network processor (BETA) as an alternative entity extraction processor, which can be used in place of the standard statistical extractor for English, Arabic and Korean (TEJ-1132, TEJ-1142, TEJ-1150)
Accuracy of Korean statistical model is improved (APE-1737)
Default linking confidence thresholds are set (TEJ-1080, TEJ-1068)
The method setUseDeepNeuralNetworkProcessor()
in com.basistech.rosette.rex.EntityExtractor
is part of a new experimental API to replace the statistical model by new deep learning model. Another option to use it is to provide ProcessorType.deepNeuralNetwork
for the method setProcessors
. Currently available only for English and Arabic. Some operating systems do not support the deep neural network model, and some do not provide good latency.
FTK supports training a disambiguation model for custom knowledge base (ELK-13, ELK-14, ELK-16, ELK-22, ELK-34)
Added default linking confidence threshold for linked entities (TEJ-1068)
Updated RBL version (TEJ-1048)
Application developer’s guide and customization guide are merged into a single manual (ELK-22)
The new salience classifier has been incorporated. The salience calculation is enabled via EntityExtractor.setCalculateSalience()
or by setting calculateSalience in either REXFactoryConfiguration
or REXAnnotatorConfiguration
. (TEJ-936)
Added manual custom processor registration API. (TEJ-972, TEJ-982)
Deduped partial duplicated regex. (TEJ-785)
Improved multiple gazetteer. (exact-match) processor support (TEJ-960, TEJ-1005)
Confidence score calculation is improved to correlate well with precision, may be used for thresholding and removal of false positives (TEJ-910, TEJ-919)
Statistical models are trained with new emoticon-sensitive tokenizer (TEJ-924)
New script allows repacking REX with minimal configuration per language (TEJ-893)
Automatic case sensitivity mode prefers case-sensitive for short text by default (TEJ-931)
Added Custom Processor for rejection. (TEJ-840, TEJ-841, TEJ-843, TEJ-880)
Redactor improvements: dynamic rules prioritization and subtypes handling (TEJ-863, TEJ-858)
Pronominal resolver is fully supported for English. Added as a processor type, as well as indoc-coref (TEJ-867)
Indoc-coref allows partial match for ORGANIZATION type (INDOC-26)
Added full support for Vietnamese. (APE-1691)
Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/MapReduceExample/README.md
. (TEJ-807)
The method setResolvePronouns()
in com.basistech.rosette.rex.EntityExtractor
is part of a new experimental API to resolve pronouns like 'he' and 'she' to entities of type Person. It may be changed or removed in future releases. Available only in English. (TEJ-831)
Reject regex and gazetteers allow wildcard entity type. (TEJ-853, TEJ-817)
The kb-linker experimental processor now supports Chienese and Japanese in addition to English. This functionality may be changed or removed in future minor releases. (TEJ-857)
Automatic case sensitivity improved (English only). (TEJ-861)
Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/SparkEntityCount/README.md
. (TEJ-155)
Information about what languages are licensed can now be accessed. The methods getLanguageInformation()
and getSupportedEntityTypes()
in com.basistech.rosette.rex.EntityExtractor
now take in a flag for whether or not to return information on all languages REX supports, or just those that are licensed. Additionally, REXCmd info now has a -onlyLicensed option
. (TEJ-767)
Added partial support for Vietnamese to extract phone numbers and dates using regexes. (TEJ-740)
REX annotators are now created faster and can feasibly be created on a document level when using the new com.basistech.rosette.rex.REXAnnotatorFactory API
. The current com.basistech.rosette.rex.EntityExtractor
API now also has a faster startup time. See the javadocs, the API Overview section in the Application Developer’s guide, and the sample program at ./samples/EntityAnnotatorFactorySample.java
for additional details. (TEJ-773)
REX now reports its results using two new classes, Entity and Mention, such that each Entity in a document has one or more Mentions that refer to the same real-world identity. Moving forward, this API will replace EntityMention
and its coreferenceChainId
. This version of REX is backwards-compatible and still supports the deprecated EntityMention
. (TEJ-702)
Improved Malaysian statistical model and added a new Malaysian gazetteer. (TEJ-711, TEJ-715)
kbLinker
(flinx) is part of a new experimental API to to link entities from social media text to knowledge bases and may be changed or removed in future minor releases. (TEJ-722, TEJ-725, TEJ-757)
REX has been upgraded to Rosette Platform compatibility level 58.2. If you intend to use more than one Rosette JVM SDK in a single application, then you should choose versions that have the same compatibility number. (TEJ-702)
Improved case-insensitivity detection in European languages. (TEJ-687)
Entity mentions extracted with the statistical model now also specify the model’s path as a subsource. (TEJ-724)
Added a new method, EntityExtractor setOverlayDataDirectory(Path overlayDataDirectory)
, that allows you to specify an additional data directory for REX to use. (TEJ-731)
The REXCmd command line utility now allows you to specify any additional regex files you want to use, besides just the default. (TEJ-628)
Added support for extraction using two statistical models operating in tandem. (TEJ-674)
In order to reduce disk footprint, Big Endian binaries are no longer shipped. REX will correctly memory map Little Endian models and dictionaries even on Big Endian systems. (TEJ-664)
Optional new packaging: RBL and REX classes are available in one combined jar. (TEJ-692)
Added a new setting for REX to automatically choose the most accurate CaseSensitivity model (case-insensitive or case-sensitive) for the input text. This is not activated by default, see the sample programs or javadocs for reference on how to enable this feature. (TEJ-568)
Added case-insensitive models for German, Italian, Dutch, and Spanish. (TEJ-566)
REX is now built with JDK 1.7, so users can no longer run REX on Java Virtual Machines versioned 1.6 and earlier. (TEJ-551)
Improved accuracy of the English statistical model by using multiple Brown clusters. (TEJ-396, TEJ-559)
New disableStatisticalCleaner option added to REXCmd
and EntityExtractor
. (TEJ-379)
You can now reactivate regular expression-based entities that are disabled by default by instructing REX to load the regex files in each language’s supplemental directory. See the Javadoc for EntityExtractor.addRegularExpressions()
. (TEJ-587)
Refined redaction rules for PERSON entities. (TEJ-115)
EntityMentions
and the returned fields are now documented in the Application Developer Guide. (TEJ-623)
Added a boolean caseSensitive
parameter to the EntityExtractor’s `addGazetteer
and addGazetteerEntity
method, to allow case-insensitive string matching of user-provided textual gazetteer entries. (TEJ-56)
A RosetteUnsupportedLanguageException
is now thrown when REX cannot find data for the requested language, instead of a generic runtime exception. (TEJ-536)
Adding a duplicate gazetteer entry will overwrite the existing one. (TEJ-74)
Support for script-insensitive Chinese added: Entities are now extracted from Chinese input documents for which the 'Simplified' or 'Traditional' writing system is not specified. Applications may now submit text using the zho language code instead of specifying zhs or zht. (TEJ-525)
Version 7.14.0.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis Technology JVM SDK in a single application, then choose versions that have the same compatibility number. (TEJ-524)
Indonesian (Bahasa Indonesia) support added. (TEJ-441)
To enhance performance, and in response to customer feedback, we deactivated the regular expressions for extracting the following entity types: IDENTIFIER:DISTANCE, IDENTIFIER:LATITUDE_LONGITUDE, IDENTIFIER:UTM, TEMPORAL:DATE, and TEMPORAL:TIME. You can restore support for any of these entity types by removing the @ignore=rex-je
attribute value that appears in front of the relevant regular expressions in the regexes.xml
files. (TEJ-510)
Improved speed and reduced memory consumption of regex and gazetteer matching. (TEJ-489)
Improved the accuracy of the statistical models for case-insensitive Portuguese and French. (TEJ-475)
Added the ability to modify the indoc shut-off threshold: setMaxResolvedEntities()
. (TEJ-456)
Introduced an experimental EntityExtractor API for excluding entity types in which you are not interested. See the Javadoc for {get,set}ExcludedEntityTypesfor
details. Note that this is an experimental API which can change or be removed in future minor versions of REX. (TEJ-480)
REX emits an exception when asked to extract in a language it has no data for. (TEJ-447)
Missing values for confidence
and coreferenceChainId
are now represented as nulls instead of -1. (TEJ-466, TEJ-467)
Improved Korean Entity Extraction and In-Document Co-reference Resolution
REX uses a new statistical model that achieves higher accuracy (about 25% overall error reduction). (APE-1111)
In-document co-reference resolution now recognizes Korean prefixes and suffixes, and will attempt to chain morphological variations of Korean entity mentions. (TEJ-366)
Added features to the REXCmd
utility
Plaintext output now pretty-prints the chain ID for each entity mention it returns. Mentions with the same chain ID refer to the same entity. (TEJ-410)
The -context
option marks the entities in their original text context, with embedded entity type and chain ID. (TEJ-410)
REXCmd
now supports pre-annotated input in json-serialized Annotated Text format. (TEJ-455)
Modified the reporting of text offsets for partial regular-expression and gazetteer matches to align with the token boundaries of the tokens that contain the matched text. (TEJ-393)
The REX Tcl implementation of regular expressions now supports characters in the Supplementary Multilingual Plane (SMP) (TEJ-454). Previous releases represented SMP codepoints as two characters each. (TEJ-330)
Provided an EntityExtractor
option to instruct the REX statistical processor to ignore the lowercase/uppercase distinction. This feature is currently supported for English, French, and Portuguese. (TEJ-458)
Added the EntityExtractor.createDispatchAnnotator
method to allow annotating documents from a predefined set of languages. (TEJ-432)
Added Portuguese regular expressions for temporal and monetary expressions. (TEJ-444)
Added the Fragment Boundary Detector to enable the extractor to separate entities in text fragments that do not form sentences. (TEJ-117)
Refined the usage pattern for the com.basistech.rosette.rex.REXCmd
command-line utility. The JSON output this utility generates has been trimmed to represent the serialization of an AnnotatedText
object. (TEJ-374)
Deprecated com.basistech.rosette.rex.EntityCursor
. Use com.basistech.rosette.rex.EntityExtractor
and com.basistech.rosette.dm.Annotator
to extract entities from an input document.
Re-established pattern matcher support for the regular expressions disabled in 2.1.0.
Added a document-level API for extracting entities. Over time, we expect to deprecate EntityCursor
(streaming) in favor of EntityExtractor
and Annotator
(document-level extraction).
Added an API (the EntityExtractor setPostConfidence
method) for extracting a confidence floating point value for each entity that REX Java Edition finds. A potential entity is ignored if its confidence score is below the threshold set with the setConfidenceThreshold
method. (TEJ-60)
REX Java Edition returns normalized entities from all sources: statistical, pattern matching, and exact matches (gazetteers). (TEJ-288)
We have disabled the use of pattern matcher regular expressions that do not terminate properly, consuming large amounts of CPU time. This change could cause REX to miss some temporal, distance and long/lat expressions. If your use case requires high recall on these numeric types, please contact support@rosette.com for assistance on enabling these regular expressions.(TEJ-283)
The REXCmd
command-line utility now includes the REX Java Edition version number in the output. (TEJ-261)
Added a shell script (Unix) and .bat file (Windows) to simplify the running of the RexCmd
command-line utility. (TEJ-247)
The JSON document generated by the command-line utility is much more verbose than in previous releases. Accordingly, if you are not writing the output to a file, you may want to pipe the output through a JSON parser (e.g., | python -mjson.tool
) and concentrate on the EntityMention
elements. (TEJ-260)
Added support for Arabic, Simplified Chinese, Traditional Chinese, Korean, and Japanese.
Support for using the REX Field Training Kit to enhance accuracy handling a particular category of documents and to return new entity types. (TEJ-220)
Added support for Dutch, Hebrew, Persian (Western Farsi and Dari), Portuguese, Pashto, and Urdu.
Incorporated improvements to the statistical language models, gazetteers, and regular expressions introduced in the Rosette C++ REX implementation since the release of REX Java Edition 1.1.
Enhanced the command-line utility with new options. (TEJ-212)
Added support for resetting the maximum number of tokens that an entity may include, which defaults to 8. Use the EntityExtractor setMaxEntityTokens(int)
method. (TEJ-188)
Added support for Uppercase English, French, German, Italian, Russian, and Spanish.
Added support for resolving coreferences to the same entity. Use EntityExtractor setResolveNamedEntities(true)
to put coreferences to the same entity in an entity chain: see EntityCursor getChainId()
. (TEJ-52)
In response to customer feedback, removed IDENTIFIER:NUMBER from the default set of entity types returned by regular expressions. We commented out the IDENTIFIER:NUMBER entries in the regexes.xml
files in data/regex/lang/accept
, so you can re-activate any of these entries if you wish. (TEJ-171)
Added a public EntityCursor hasNext()
method that can be used to determine whether there are any more entities in the result set, without advancing to the next entity. (TEJ-150)
Added the following EntityExtractor methods
:
public void setStatisticalModel(LanguageCode, InputStream);
public void addGazetteer(LanguageCode, InputStream, boolean);
public void addGazetteer(LanguageCode, InputStream);
public void addRegularExpressions(LanguageCode, InputStream, boolean);
public void setRedactorWeights(InputStream);
public void addJoinerRules(InputStream);
public void setLicense(InputStream);
These methods enable access to data files placed in a JAR file (perhaps for use in a Hadoop environment). (TEJ-59)
Bug number is followed by a brief bug description.
TEJ-1281 Set log level to Debug instead of Warn when linkEntities and genre don’t agree
TEJ-1319, TEJ-1346, ELK-114 Picked up new TVEC to improve the efficiency of the initial load time
TEJ-1324 Fixed a bug where the salience score was not always returned for entities with pronominal mentions, when requested.
TEJ-1327 Consumed new RBL to fix null pointer exception with pronoun resolver
TEJ-1331 Removed xxx from reported supported languages
APE-1766 Fixed a bug where all entity mentions were not always returned by statistical model
TEJ-1282 Relocated LIBLINEAR and TVEC, tested with rli
TEJ-1283 Fixed cases in which QID was not returned although DBPedia result was available, when using REX’s Kblinker
TEJ-1292 Custom processors made configurable via REXFactoryConfiguration
ELK-82 Fixed an acronym feature issue.
TEJ-1160 Moved ORGANIZATION eng regexes to supplemental directory to improve performance
TEJ-1167 Improved failure message for missing flinx
data directory
TEJ-1168 Fixed null point exception with long chains
TEJ-1067 Fixed calculateConfidence
configuration bug
TEJ-1020 Added regex for Israeli ID number
TEJ-1054 Cleaned MD5 codes extracted as PRODUCT entity type
TEJ-1049 Fixed case-insensitive text file gazetteer treated as case-sensitive
TEJ-1041 Improved redactor rules for extractor and linking overlaps and indoc-coref to prefer chaining based on linking ID.
TEJ-1039 Enabled emoticon mode for RBL for RosAPI
TEJ-1042 Fixed error handling for requesting salience score while indoc-coref is disabled
TEJ-950 Improved custom processor examples.
TEJ-951 Apply American SSN regex for English only.
TEJ-962 Fixed personal ID entity type in Vietnamese regex.
TEJ-1005 Fixed source and subsource for static user gazetteer.
TEJ-755 Relocated RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.
TEJ-735, TEJ-738 Fixed a bug in which Entity Type Survey didn’t support non-default field training models or customer-defined types.
TEJ-747 Fixed a bug in which REXCmd produced unexpected offsets for files with DOS-style line endings.
TEJ-718 Added Malaysian sample text.
TEJ-676, TEJ-691 Shaded and relocated all 3rd party dependencies.
TEJ-683 Removed extraneous pom.xml’s from META-INF in distro jar.
TEJ-573 REXCmd now ignores a BOM at the beginning of its input file.
TEJ-574 Lookbehind assertions are not supported, and this is now included in the Application Developer Guide.
TEJ-459, TEJ-507 indoc chaining applied only to PER/LOC/ORG
TEJ-498 Application Developer Guide claims that REX extracts full postal addresses
TEJ-460 Partial match regex produces wrong offsets
TEJ-438 ICU dependency not relocated/shaded
TEJ-415 REX requires an RBL license