RBL-JE provides an API for integrating with Apache Lucene 4.3–8.3. See the Javadoc that accompanies this package.
RBL-JE supports two usage patterns for incorporating Rosette Base Linguistics in a Lucene application.
Samples are included in the use cases below. The code comes from the Lucene 5.0 samples found in rbl-je-<rblversion>/samples/lucene-5_0
. Samples for other versions of Lucene are located in the following directories:
Using the RBL-JE Lucene Base Linguistics Analyzer
The RBL-JE Lucene base linguistics analyzer (BaseLinguisticsAnalyzer) provides an analysis chain with a language-specific tokenizer and token filter.
You can configure the analyzer with a number of tokenizer and token filter options (see the Javadoc).
The analysis chain also includes the LowerCaseFilter and CJKWidthFilter, and it provides support for including a StopFilter.
Japanese Analyzer Sample. This Lucene sample, JapaneseAnalyzerSample.java
, uses the base linguistics analyzer to generate an enriched token stream from Japanese text. The sample does the following:
Assembles a set of options that include the root directory, the path to the Rosette license, the language to analyze (Japanese), and the generation of part-of-speech tags for the tokens.
Instantiates a Japanese base linguistics analyzer with these options.
Reads in a Japanese text file.
-
Uses the analyzer to generate a token stream. The token stream contains tokens, lemmas that are not identical to their tokens, and readings. Disambiguation is turned off by default for Japanese, so multiple analyses may be returned for each token. To turn on disambiguation, add
options.put("disambiguate", "true");
to the construction of the options
Map.
Writes each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<rblversion>/samples/lucene-5_0
and use the Ant build script:
The sample reads rbl-je-<rblversion>/samples/data/jpn-text.txt
and writes the output to jpn-Analyzed-byAnalyzer.txt
.
The output includes each token, lemma, and reading in the token stream on a separate line with the type attribute for each element: <ALNUM>
for tokens, <LEMMA>
for lemmas, and <READING>
for readings. There may be more than one analysis (hence lemma and reading in the sample output) for a token; lemmas are not put into the token stream when identical to the token.
For example:
メルボルン <ALNUM>
メルボルン <READING>
で <ALNUM>
行わ <ALNUM>
行う <LEMMA>
オコナワ <READING>
れ <ALNUM>
る <LEMMA>
れる <LEMMA>
レ <READING>
Creating your own RBL-JE Analysis Chain
When creating an analysis chain, you can do the following:
Use the BaseLinguisticsTokenizerFactory
to generate a language-specific tokenizer that applies Rosette Base Linguistics to tokenize text.
Use the BaseLinguisticsTokenFilterFactory
to generate a language-specific token filter that enhances a stream of tokens.
Add other token filters to the analysis chain.
Japanese Tokenizer and Filter Sample
This Lucene sample, JapaneseTokenizerAndFilterSample.java
, creates an analysis chain to generate an enriched token stream. The sample does the following:
Uses a tokenizer factory to set up a language-specific base linguistics tokenizer, which puts tokens in the token stream.
Uses a base linguistics token filter factory to set up language-specific base linguistics token filter, which adds lemmas and readings to the tokens in the token stream.
To replicate the behavior of the analyzer in the previous example, this sample also includes the LowerCaseFilter and CJKWidthFilter.
Writes each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<rblversion>/samples/lucene-5_0
and use the Ant build script:
ant runTokenizerAndFilter
The example reads the same file as the previous sample and writes the output to a jpn-analyzed-byTokenizerAndFilter.txt
. The content matches the content generated by the previous example.
Using the BaseLinguisticsSegmentationTokenFilter
If you are using your own whitespace tokenizer and processing text that requires segmenting Chinese, Japanese, or Thai, you can use the BaseLinguisticsSegmentationTokenFilterFactory
to create a BaseLinguisticsSegmentationTokenFilter
, then place the segmentation token filter in an analysis chain following the whitespace tokenizer and preceding other filters, such as a base linguistics token filter.
The segmentation token filter segments each of the tokens from the whitespace tokenizer into individual tokens where necessary. See the Javadoc for the RBL-JE API for Lucene 4.3–4.8, Lucene 4.9, Lucene 4.10, Lucene 5.0–5.5, Lucene 6.0–6.1, or Lucene 6.2–8.2.
You can use the com.basistech.rosette.lucene.AnalysesAttribute
object to gather linguistic data about the text in a document. Depending on the language, the data may include tokens, normalized tokens, lemmas, part-of-speech tags, readings, compound components, and Semitic roots.
The Lucene sample, AnalysesAttributeSample.java
, illustrates this.
To run the sample with the German sample file, navigate to rbl-je-<rblversion>/samples/lucene-5_0
, and call ant as follows:
ant -Dtest.language=deu runAnalysesAttribute
The sample writes the output to deu-analysesAttributes.txt
.
Case Sensitivity During the Analysis
In some languages, case distinctions are meaningful. For example, in German, a word may be a noun if it begins with an upper-case letter, and not a noun if it does not. As a result, RBL-JE delivers higher accuracy in selecting lemmas and splitting compounds when it can process text with correct casing. On the other hand, users typing in queries may be sloppy with capital letters.
For this reason, the default behavior of the Lucene integration is to perform the following analysis steps:
- tokenize
- determine lemmas
- map to lower case
The result is that the index contains the lowercase form of the most accurately selected lemma.
However, some applications work with text in which case distinctions are not reliably present, even in languages where they are important. These applications need to determine lemmas and compound components even though the spelling is nominally incorrect with respect to case.
To support these applications, RBL provides a 'case-insensitive' mode of operation. In this mode, RBL-JE performs the following analysis steps:
- tokenize, ignoring the case of abbreviations and such
- determine lemmas, ignoring case in choosing lemmas and compound components
- map to lower case
The mapping is still required to ensure that the index or query ends up with uniformly lower-case text.
To specify case sensitivity for the analysis, set com.basistech.rosette.bl.AnalyzerOption.caseSensitive
to "true" or "false". By default, the setting is "true", except for Danish, Norwegian, and Swedish, for which our dictionaries are lower case and the setting is "false" irrespective of the user setting.
When you are making this setting in the com.basistech.rosette.lucene
package, include the "caseSensitive" option as a string. For example:
Map<String, String> options = new HashMap<>();
options.put("language", LanguageCode.ITALIAN.ISO639_30);
options.put("caseSensitive", "true");
TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);