You can configure Apache Solr search to use RBL for both indexing documents as well as for queries.
To index and search documents with RBL in a Solr application, you must add JARs to the Solr classpath and define Solr analysis chains that apply the RBL analysis components to process text at index and query time.
Setting Solr 9.0 Permissions
In Solr 9.0.0 and later, you must update the policy file at server/etc/security.policy
to grant Solr permissions to read the files in RBL-JE's root directory. Add the following line to the file:
grant {
permission java.io.FilePermission "<RBLJE_ROOT>${/}-", "read";
};
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation.
Adding to the Solr Classpath
Add the following lib
elements to the solrconfig.xml
for each Solr collection you are using.
<lib path="<RBLJE_ROOT>/lib/btrbl-je-<version>.jar"/>
<lib path="<RBLJE_ROOT>/lib/btcommon-api-<version>.jar"/>
<lib path="<RBLJE_ROOT>/lib/slf4j-api-<version>.jar"/>
<lib path="<RBLJE_ROOT>/lib/btrbl-je-lucene-solr-<version>-<version>.jar"/>
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation and take the full file names with correct <version> values from the RBL <lib path>
. They can change with each new release.
For example, if the root of the RBL installation is /opt/local/bt/rbl-je
, the version of Solr is 8.6, and the version of RBL is 7.44.0.c67.0, the lib paths are:
<lib path="/opt/local/bt/rbl-je/lib/btrbl-je-7.44.0.c67.0.jar"/>
<lib path="/opt/local/bt/rbl-je/lib/btcommon-api-37.0.1.jar"/>
<lib path="/opt/local/bt/rbl-je/lib/slf4j-api-1.7.28.jar"/>
<lib path="/opt/local/bt/rbl-je/btrbl-je-lucene-solr-7_0-7.44.0.c67.0.jar"/>
The correct Lucene/Solr version files are listed in the section Lucene/Solr Versions.
The SLF4J JARs enable logging.
Defining a Solr Analysis Chain
In the Solr schema.xml
or managed-schema.xml
, add a fieldType
element and a corresponding field
element for the language of the documents processed by the application.
Field Type. The fieldType
includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter.
Here, for example, is a fieldType
for Japanese:
<fieldtype name="basis-japanese" class="solr.TextField">
<analyzer type="index">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
language="jpn"
rootDirectory="<RBLJE_ROOT>"
/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
language="jpn"
rootDirectory="<RBLJE_ROOT>"
/>
</analyzer>
<analyzer type="query">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
language="jpn"
rootDirectory="<RBLJE_ROOT>"
/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
language="jpn"
rootDirectory="<RBLJE_ROOT>"
query="true"
/>
</analyzer>
</fieldtype>
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation. The fieldType name
indicates the language, and each language
attribute is set to the ISO 693-3 language code for Japanese.
Note
You can incorporate any Solr filter you need, such as the Solr lowercase filter; however, they should be added into the chain after the Base Linguistics token filter. If you modify the token stream too radically before RBL, you will degrade its ability to analyze the text.
Field. The analysis chain requires a field
definition with a type
attribute that maps to the fieldType
. For the Japanese example above, add the following field
definition to schema.xml
.
<field name="text-japanese" type="basis-japanese" indexed="true" stored="true"/>
In your Solr application, you can now index and query Japanese documents placed in the text-japanese
field.
Most API options can be used in a Solr analysis chain. In Solr, you do not directly use the option enum classes. Instead, options are specified in the schema using the format option="value"
.
When specifying options for the tokenizer
(class BaseLinguisticsTokenizerFactory
), use options in the TokenizerOption
enum class. When specifying options for a filter
(class BaseLinguisticsTokenFilterFactory
), use options in the AnalyzerOption
and FilterOption
enum classes.
Example:
<fieldType class="solr.TextField" name="basis-french">
<analyzer type="index">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="fra"
rootDirectory="${bt_root}"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="fra"
rootDirectory="${bt_root}"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="fra"
query="true" rootDirectory="${bt_root}"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="fra"
query="true" rootDirectory="${bt_root}"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Using Neural Models with Solr
When using the neural models with Solr, you must grant the appropriate permissions and may want to increase the JVM memory.
The security manager is enabled by default in Solr 9. To use any of the neural models, grant the following permissions in server/etc/security.policy
:
grant {
permission java.io.FilePermission \"$HOME/.javacpp/cache\", \"read,write,delete,execute\";
permission java.lang.RuntimePermission \"loadLibrary.*\";
};
Neural models may require more memory. We recommend increasing the memory to at least 1 GB by editing bin/solr/in.sh
:
SOLR_JAVA_MEM="-Xms1g -Xmx1g"
Activating User Dictionaries in Solr
User Dictionaries can be used when using RBL with Solr.
Use the option userDefinedDictionaryPath
as shown in this example:
<fieldType class="solr.TextField" name="basis-japanese">
<analyzer>
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
tokenizerType="spaceless_lexical"
language="jpn"
rootDirectory="${bt_root}"
userDefinedDictionaryPath= "/path/to/my/jpn-udd.bin"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
tokenizerType="spaceless_lexical"
language="jpn"
rootDirectory="${bt_root}"/>
</analyzer>
</fieldType>
Here is an example of using a JLA reading dictionary:
<fieldType class="solr.TextField" name="basis-japanese-rd">
<analyzer>
<tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory"
tokenizerType="spaceless_lexical"
readings="true"
language="jpn"
rootDirectory="${bt_root}"
userDefinedReadingDictionaryPath="/path/to/my/readings.bin"/>
<filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory"
tokenizerType="spaceless_lexical"
language="jpn"
rootDirectory="${bt_root}"/>
</analyzer>
</fieldType>