RBL provides an API for integrating with Apache Lucene. See the Javadoc that accompanies this package for the complete API documentation.
A Lucene analyzer examines the text of fields and generates a token stream. Rosette provides an RBL base linguistics analyzer for Lucene, which uses RBL's linguistic tools.
RBL supports two usage patterns for incorporating Rosette Base Linguistics in a Lucene application.
Samples are included in the use cases below. The code comes from the Lucene 7.0 samples found in rbl-je-<version>/samples/lucene-7_0
.
RBL supports Lucene versions 7.0 - 9.7 and Solr versions 7.0 - 9.3 with the following files, where <version> is the RBL version:
The following options are passed in through an options Map
.
Table 26. Lucene Filter Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
addLemmaTokens
|
Indicates whether the token filter should add the lemmas (if none, the steps) of each surface token to the tokens being returned..
|
Boolean
(true)
|
All
|
addReadings
|
Indicates whether the token filter should add the readings of each surface token to the tokens being returned
|
Boolean
(false)
|
Chinese, Japanese
|
identifyContractionComponents
|
Indicates whether the token filter should identify contraction components as contraction components rather than as lemmas
|
Boolean
(false)
|
All
|
replaceTokensWithLemmas
|
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled.
|
Boolean
(false)
|
All
|
Enum Classes:
Using the RBL Lucene Base Linguistics Analyzer
The RBL Lucene base linguistics analyzer (BaseLinguisticsAnalyzer) provides an analysis chain with a language-specific tokenizer and token filter.
You can configure the analyzer with a number of tokenizer and token filter options.
The analysis chain includes the Lucene filters LowerCaseFilter and CJKWidthFilter,. It also provides support for including a StopFilter.
Japanese Analyzer Sample. This Lucene sample, JapaneseAnalyzerSample.java
, uses the base linguistics analyzer to generate an enriched token stream from Japanese text.
-
Assemble a set of options that include the root directory, the path to the Rosette license, the language to analyze (Japanese), and the generation of part-of-speech tags for the tokens.
File rootPath = new File(rootDirectory);
String licensePath = new File(
rootPath, "licenses/rlp-license.xml").getAbsolutePath();
Map<String, String> options = new HashMap<>();
options.put("language", "jpn");
options.put("rootDirectory", rootDirectory);
options.put("licensePath", licensePath);
options.put("addReadings", "true");
-
Instantiate a Japanese base linguistics analyzer with these options.
rblAnalyzer = new BaseLinguisticsAnalyzer(options);
-
Read in a Japanese text file.
-
Use the analyzer to generate a token stream. The token stream contains tokens, lemmas that are not identical to their tokens, and readings. Disambiguation is turned off by default for Japanese, so multiple analyses may be returned for each token. To turn on disambiguation, add
options.put("disambiguate", "true");
to the construction of the options
Map.
-
Write each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/lucene-5_0
and use the Ant build script:
The sample reads rbl-je-<version>/samples/data/jpn-text.txt
and writes the output to jpn-Analyzed-byAnalyzer.txt
.
The output includes each token, lemma, and reading in the token stream on a separate line with the type attribute for each element. There may be more than one analysis (hence lemma and reading in the sample output) for a token; lemmas are not put into the token stream when identical to the token.
Table 27. Supported Token Types
Type Attribute
|
Meaning
|
Option
|
<ALNUM>
|
token
|
|
<COMP>
|
compound component
|
|
<CONT>
|
contraction component
|
identifyContractionComponents
|
<LEMMA>
|
lemma
|
addLemmaTokens , identifyContractionComponents
|
<READING>
|
reading
|
addReadings
|
For example:
メルボルン <ALNUM> メルボルン <READING> で <ALNUM> 行わ <ALNUM> 行う <LEMMA> オコナワ <READING> れ <ALNUM> る <LEMMA> れる <LEMMA> レ <READING>
Creating your own RBL Analysis Chain
When creating an analysis chain, you can do the following:
-
Use the BaseLinguisticsTokenizerFactory
to generate a language-specific tokenizer that applies Rosette Base Linguistics to tokenize text.
-
Use the BaseLinguisticsTokenFilterFactory
to generate a language-specific token filter that enhances a stream of tokens.
-
Add other token filters to the analysis chain.
One of the tasks of the token filter is to set tokens’ token types. For example, given a lemma token, the token filter gives it the type <LEMMA>
. Given a contraction component token, the token filter has a choice: by default, the type is set to <LEMMA>
, but when the identifyContractionComponents
option is enabled, the type is set to <CONT>
.
Japanese Tokenizer and Filter Sample
This Lucene sample, JapaneseTokenizerAndFilterSample.java
, creates an analysis chain to generate an enriched token stream.
-
Use a factory to set up a language-specific base linguistics tokenizer, which puts tokens in the token stream.
BaseLinguisticsFactory factory = new (BaseLinguisticsFactory);
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
factory.setOption(BaseLinguisticsOption.licensePath, licensePath);
Map<String, String> options = new HashMap<>();
options.put("language", "jpn");
options.put("rootDirectory", rootDirectory);
options.put("addReadings", "true");
tokenFilterFactory = new BaseLinguisticsTokenFilterFactory(options);
-
Use a base linguistics token filter factory to set up language-specific base linguistics token filter, which adds lemmas and readings to the tokens in the token stream.
Tokenizer tokenizer = new BaseLinguisticsTokenizer(factory.createTokenizer(null, LanguageCode.JAPANESE));
tokenizer.setReader(input);
TokenStream tokens = tokenFilterFactory.create(tokenizer);
-
To replicate the behavior of the analyzer in the previous example, this sample also includes the LowerCaseFilter and CJKWidthFilter.
tokens = new LowerCaseFilter(tokens);
tokens = new CJKWidthFilter(tokens);
-
Write each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/lucene-7_0
and use the Ant build script:
ant runTokenizerAndFilter
The example reads the same file as the previous sample and writes the output to a jpn-analyzed-byTokenizerAndFilter.txt
. The content matches the content generated by the previous example.
Using the BaseLinguisticsSegmentationTokenFilter
If you are using your own whitespace tokenizer and processing text that requires segmenting Chinese, Japanese, or Thai, you can use the BaseLinguisticsSegmentationTokenFilterFactory
to create a BaseLinguisticsSegmentationTokenFilter
, then place the segmentation token filter in an analysis chain following the whitespace tokenizer and preceding other filters, such as a base linguistics token filter.
The segmentation token filter segments each of the tokens from the whitespace tokenizer into individual tokens where necessary. Refer to the Javadoc for the RBL API for Lucene for more information.
You can use the com.basistech.rosette.lucene.AnalysesAttribute
object to gather linguistic data about the text in a document. Depending on the language, the data may include tokens, normalized tokens, lemmas, part-of-speech tags, readings, compound components, and Semitic roots.
The Lucene sample, AnalysesAttributeSample.java
, illustrates this.
To run the sample with the German sample file, navigate to rbl-je-<version>/samples/lucene-<version>
, and call ant as follows:
ant -Dtest.language=deu runAnalysesAttribute
The sample writes the output to deu-analysesAttributes.txt
.
Case Sensitivity During the Analysis
In some languages, case distinctions are meaningful. For example, in German, a word may be a noun if it begins with an upper-case letter, and not a noun if it does not. As a result, RBL delivers higher accuracy in selecting lemmas and splitting compounds when it can process text with correct casing. On the other hand, users typing in queries may be sloppy with capital letters.
For this reason, the default behavior of the Lucene integration is to perform the following analysis steps:
- tokenize
- determine lemmas
- map to lowercase
The result is that the index contains the lowercase form of the most accurately selected lemma.
However, some applications work with text in which case distinctions are not reliably present, even in languages where they are important. These applications need to determine lemmas and compound components even though the spelling is nominally incorrect with respect to case.
To support these applications, RBL provides a 'case-insensitive' mode of operation. In this mode, RBL performs the following analysis steps:
- tokenize, ignoring the case of abbreviations and such
- determine lemmas, ignoring case in choosing lemmas and compound components
- map to lowercase
The mapping is still required to ensure that the index or query ends up with uniformly lowercase text.
To specify case sensitivity for the analysis, set com.basistech.rosette.bl.AnalyzerOption.caseSensitive
to true
or false
. By default, the setting is true
, except for Danish, Norwegian, and Swedish, for which our dictionaries are lowercase and the setting is false
irrespective of the user setting.
When you are making this setting in the com.basistech.rosette.lucene
package, include the caseSensitive
option as a string. For example:
Map<String, String> options = new HashMap<>();
options.put("language", LanguageCode.ITALIAN.ISO639_30);
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);
Activating User Dictionaries in Lucene
User Dictionaries can be used when using RBL with Lucene.
In the com.basistech.rosette.lucene
package, BaseLinguisticsTokenizerFactory
and BaseLinguisticsTokenFilterFactory
can load segmentation and analysis dictionaries respectively.
The path options are provided as a list of paths, separated by semicolons or the OS-specific path separator.
Table 28. Lucene User Dictionary Path Options
Option
|
Description
|
Type
|
Supported Languages
|
lemDictionaryPath
|
A list of paths to user lemma dictionaries.
|
List of Paths
|
Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai
|
segDictionaryPath
|
A list of paths to user segmentation dictionaries.
|
List of Paths
|
All
|
userDefinedDictionaryPath
|
A list of paths to user dictionaries.
|
List of Paths
|
All
|
userDefinedReadingDictionaryPath
|
A list of paths to reading dictionaries.
|
List of Paths
|
Japanese
|
BaseLinguisticsTokenizerFactory
provides the method addUserDefinedDictionary
for adding a segmentation dictionary. For example:
Map<String, String> args = new HashMap<>();
args.put(TokenizerOption.language.name(), LanguageCode.JAPANESE.ISO639_30);
args.put(TokenizerOption.nfkcNormalize.name(), "true");
BaseLinguisticsTokenizerFactory factory = new BaseLinguisticsTokenizerFactory(args);
factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");
The constructor for BaseLinguisticsTokenFilterFactory
takes a Map
of options. Use the userDefinedDictionaryPath
option to load an analysis dictionary:
Map<String, String> options = new HashMap<>();
options.put("language", LanguageCode.ITALIAN.ISO639_30);
options.put("userDefinedDictionaryPath", "/path/to/my/ita-dict.bin");
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);
-
Set up a com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory
.
-
Use the BaseLinguisticsTokenizerFactory
to create a com.basistech.rosette.lucene.BaseLinguisticsTokenizer
, which contains a Lucene Tokenizer
.
-
Set up a com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilterFactory
.
-
Use the BaseLinguisticsCSCTokenFilterFactory
to create a com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilter
to convert from TC to SC or vice versa.
-
Use the BaseLinguisticsCSCTokenFilter
to convert each Token
found by the Tokenizer
.
RBL/Lucene Distribution Sample. For supported versions of Lucene, the RBL distribution includes a sample (CSCCharTermAttributeSample
) that you can compile and run with an ant
build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to the samples directory (rbl-je-<version>/samples) and the file for your version of lucene (/csc-analyze-<luceneversion>
and call:
ant run
The sample reads an input file in SC and prints the TC conversion for each token to standard out.