A Solr plugin is available for REX. The plugin integrates into Solr's update chain, processing requested text fields and creating new fields containing extracted entities.
You can find a short tutorial walking through a basic installation and configuration of a simple REX sample in the doc
directory inside the REX Solr Plugin package.
Installing the Solr Plugin
To install the Solr plugin, copy all files from the lib
directory inside the REX Solr Plugin package into the lib
directory of your Solr core. In addition, a reference to your REX installation is required in your core's solrconfig.xml
file:
<lib dir="${rexje.root}/lib" regex=".*\.jar"/>
Either set rexje.root
in your Solr installation to point to your REX installation directory, or change ${rexje.root}
in the line above directly.
To use the plugin, a processor chain using it should be configured in solrconfig.xml
. You can either create a special processor chain or integrate it into an existing one. The simplest configuration looks like this:
<updateRequestProcessorChain name="rex">
<processor class="com.basistech.rosette.solr.EntityExtractorUpdateProcessorFactory">
<str name="rootDirectory">${rexje.root}</str>
<str name="fields">text_eng</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
rootDirectory
and fields
are mandatory parameters. rootDirectory
should point to your REX installation directory. fields
instructs the plugin which document fields to process, as described below.
The REX plugin outputs new fields once it is run. Make sure your schema allows for dynamic fields to be created, or configure it ahead of time with the dynamic fields required. REX output fields are multi-valued string fields. They are named the same as the field they're processing, postfixed with _REX_{ENTITYTYPE}
. For example, in the simple processor chain above, a field called text_eng_REX_PERSON
might be created. A schema entry for this (and other types for the text_eng
field) can be set up like this in your schema definition:
<dynamicField name="text_eng_REX_*" type="string" indexed="true" stored="true" multiValued="true"/>
Solr Plugin Docker Container
The REX SOLR plugin is also available as a docker container. To install it, unzip the REX SOLR plugin container distribution package and run:
docker load -i rex-solr-docker.tar.gz
To run the image with its default configuration containing the demo SOLR project, locate your REX license file (rlp-license.xml), and \ run
REX_LICENSE_PATH=[path_to_license_file]/rlp-license.xml docker-compose up
The SOLR instance inside the image should run on its default port, 8983, with the REX SOLR demo active.
The default docker-compose.yaml
provided loads the image with English data files. To use other languages, edit docker-compose.yaml
\ before running docker-compose up
to run the image:
-
Add a service reference to the desired language, following this template (English should already be defined in the default configur\ ation and can be copied and modified): root-rex-[lang]-[version]: image: rosette/root-rex-[lang]:[version] volumes: - rosette-roots-vol:/roots-vol
-
Add the new service reference to the depends_on
section of the rex-solr-demo
reference, as follows: rex-solr-demo: image: rex-solr-demo:latest depends_on: - root-rex-[lang]-[version] - root-rex-[lang]-[version] ... - root-rex-root-[version]
To modify the SOLR configuration to use your own instead of the demo, connect to the running image: identify its process id by running\ docker ps
and then run
docker exec -it [pid] /bin/bash
The SOLR installation can be found in var/solr
. You may also mount external database storage locations using regular Docker commands\ and reference them from SOLR configuration.
When the plugin is run as part of a Solr processor chain, REX processes all fields listed in the fields
parameter in the plugin configuration, and populates multi-valued entity fields for every entity type extracted.
Fields processed by the plugin must conform to a naming convention, and be post fixed with an underscore followed by the ISO639 language code the field's text is in. For example, REX can run on fields named text_eng
or article_content_jpn
. You should either set up your fields with these names in the original schema, or use other update processor plugins to identify a field's language and copy its content to a dynamic field with a compatible name.
Every configuration option available in the SDK for the EntityExtractor
class is available in the plugin, save for those related to entity linking, which is not currently supported. Also not supported are the addition of dynamic individual regex expressions or gazetteer entries. Reference the chart below for the appropriate parameter to add to the processor definition in solrconfig.xml
. All parameters are strings.
For example, setting up the plugin to use confidence threshold might look like this:
<processor class="com.basistech.rosette.solr.EntityExtractorUpdateProcessorFactory">
<str name="rootDirectory">${rexje.root}</str>
<str name="fields">text_eng</str>
<str name="calculateConfidence">True</str>
<str name="confidenceThreshold">0.8</str>
</processor>