Rosette Base Linguistics (RBL) provides a set of linguistic tools to prepare your data for analysis. Language-specific modules provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots. RBL also includes a Chinese Script Converter (CSC) which converts tokens in Traditional Chinese text to Simplified Chinese and vice versa.
You can use RBL in your own JVM application, use its Apache Lucene compatible API in a Lucene application, or integrate it directly with either Apache Solr or Elasticsearch.
-
JVM Applications
To integrate base linguistics functionality in your applications, the JVM includes two sets of Java classes and interfaces:
-
ADM API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis as a set of annotations. This collection is called the Annotated Data Model (ADM) and is used in other Rosette tools, such as Rosette Language Identifier and Rosette Entity Extractor, as well as RBL. There are some advanced features which are only supported in the ADM API and not the classic API.
When using the ADM API, you create an annotator which includes both tokenizer and analyzer functions.
-
Classic API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis that is analogous to the ADM API, except it is not compatible with any other Rosette products. It also supports streaming: a user can start processing a document before the entire document is available and it can produce results for pieces of a document without storing the results for the entire document in memory at once.
When using the classic API, you create tokenizers and analyzers.
-
Lucene
In an Apache Lucene application, you use a Lucene analyzer which incorporates a Base Linguistics tokenizer and token filter to produce an enriched token stream for indexing documents and for queries.
-
Solr
With the Solr plugin, an Apache Solr search server uses RBL for both indexing documents and for queries.
-
Elasticsearch
Install the Elasticsearch plugin to use RBL for analysis, indexing, and queries.
Note
The Lucene, Solr, and Elasticsearch plugins use APIs based on the classic API. All options that are in the Enum classes TokenizerOption
or AnalyzerOption
are available along with some additional plugin specific options.
Most of the linguistic work is performed by tokenizers and analyzers. Tokenizers identify the words in text. Analyzers determine their linguistic attributes such as lemmas and parts of speech.
Depending on the language, each analysis object may contain one or more of the following:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object.
In the ADM API, the BaseLinguisticsFactory
sets the linguistic options and instantiates an Annotator
which annotates the input text with the linguistic objects.
In the classic API, the TokenizerFactory
and AnalyzerFactory
configure and create these objects.
When using the classic API, you instantiate separate factories for tokenizers and analyzers.
The TokenizerFactory
produces a language-specific tokenizer that processes documents, producing a sequence of tokens.
The AnalyzerFactory
produces a language-specific analyzer that uses dictionaries and statistical analysis to add analysis objects to tokens.
If your application requires streaming, use this API. The Lucene, Solr, and Elasticsearch integrations use these factories.
For the complete API documentation, consult the Javadoc for these classes.
Use the TokenizerFactory
to create a language-specific tokenizer that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with TokenizerOption
to define the root of your RBL installation, as illustrated in the following sample. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
The Tokenizer
uses a word breaker to establish token boundaries and detect sentences. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate analysis entries in the token object that they return. For other languages, you use the analyzer described below to return analysis objects for each token.
Create a tokenizer factory
TokenizerFactory factory = new TokenizerFactory();
factory.setOption(TokenizerOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(TokenizerOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Set tokenization options
factory.setOption(TokenizerOption.nfkcNormalize, "true");
Use the AnalyzerFactory
to create a language-specific analyzer. Prior to creating the analyzer, use the factory and AnalyzerOption
to define RBL root, as illustrated in the sample below. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
AnalyzerFactory factory = new AnalyzerFactory();
factory.setOption(AnalyzerOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(AnalyzerOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Use the Analyzer
to return an array of Analysis
objects for each token.