Rosette Base Linguistics (RBL) provides a set of linguistic tools to prepare your data for analysis. Language-specific modules provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots. RBL also includes a Chinese Script Converter (CSC) which converts tokens in Traditional Chinese text to Simplified Chinese and vice versa.
You can use RBL in your own JVM application, use its Apache Lucene compatible API in a Lucene application, or integrate it directly with either Apache Solr or Elasticsearch.
-
JVM Applications
To integrate base linguistics functionality in your applications, the JVM includes two sets of Java classes and interfaces:
-
ADM API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis as a set of annotations. This collection is called the Annotated Data Model (ADM) and is used in other Rosette tools, such as Rosette Language Identifier and Rosette Entity Extractor, as well as RBL. There are some advanced features which are only supported in the ADM API and not the classic API.
When using the ADM API, you create an annotator which includes both tokenizer and analyzer functions.
-
Classic API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis that is analogous to the ADM API, except it is not compatible with any other Rosette products. It also supports streaming: a user can start processing a document before the entire document is available and it can produce results for pieces of a document without storing the results for the entire document in memory at once.
When using the classic API, you create tokenizers and analyzers.
-
Lucene
In an Apache Lucene application, you use a Lucene analyzer which incorporates a Base Linguistics tokenizer and token filter to produce an enriched token stream for indexing documents and for queries.
-
Solr
With the Solr plugin, an Apache Solr search server uses RBL for both indexing documents and for queries.
-
Elasticsearch
Install the Elasticsearch plugin to use RBL for analysis, indexing, and queries.
Note
The Lucene, Solr, and Elasticsearch plugins use APIs based on the classic API. All options that are in the Enum classes TokenizerOption
or AnalyzerOption
are available along with some additional plugin specific options.
Most of the linguistic work is performed by tokenizers and analyzers. Tokenizers identify the words in text. Analyzers determine their linguistic attributes such as lemmas and parts of speech.
Depending on the language, each analysis object may contain one or more of the following:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object.
In the ADM API, the BaseLinguisticsFactory
sets the linguistic options and instantiates an Annotator
which annotates the input text with the linguistic objects.
In the classic API, the TokenizerFactory
and AnalyzerFactory
configure and create these objects.
Basis Technology has created a collection of Java classes that generate and represent Rosette's linguistic analyses as a set of annotations. This collection of Java classes is called the Annotated Data Model (ADM) and may be used in RLI and REX as well as in RBL.
The ADM provides advanced functionality not available with the classic API.
When using the ADM API, you use BaseLinguisticsFactory
to set the options and instantiate an Annotator
. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
The ADM API supports more options than the classic API. Whenever possible, we recommend using this API.
For complete API documentation of the ADM, consult the Javadoc for the package:
The standard procedure for using the ADM is as follows:
Use BaseLinguisticsFactory
to set the BaseLinguisticsOptions
and to instantiate an Annotator
. Options may be set on the factory itself or passed in to the create
method.
-
At the minimum, you should set options for rootDirectory
and language
.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(BaseLinguisticsOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
-
Then use the BaseLinguisticsFactory
to create the Annotator
. This sample sets the language to English (eng
).
Annotator annotator;
EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class);
options.put(BaseLinguisticsOption.language, "eng");
annotator = factory.createSingleLanguageAnnotator(options);
Use Annotator
to annotate the input text, which returns a AnnotatedText
object.
The AnnotatedText
object provides an API for gathering data from the linguistic analysis that RBL performs during the annotation process. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
getTokens()
returns a list of tokens, each of which contains a list of morphological analyses.
AnnotatedText results = annotator.annotate(getInput(inputFilePathname));
int index = 0;
for (Token token : results.getTokens()) {
outputData.format("token %d:\t%s%n", index, token.getText());
int aindex = 0;
List<MorphoAnalysis> analyses = token.getAnalyses();
if (null != analyses) {
outputData.format("\tindex\tlemma\tpart-of-speech%n");
for (MorphoAnalysis ma : analyses) {
outputData.format("\t%d %s\t%s%n", aindex, ma.getLemma(),
ma.getPartOfSpeech());
aindex++;
}
}
index++;
}
These options are only valid when using the ADM API.
Table 1. Annotator Object Options
Option |
Description |
Type (Default) |
Supported Languages |
analyze
|
Enables analysis. If false, the annotator will only perform tokenization. |
Boolean
(true )
|
All |
customPosTagsUri
|
URI of a POS tag map file for use by the univeralPosTags option. |
Boolean
(true )
|
Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
Enum Classes:
When using the classic API, you instantiate separate factories for tokenizers and analyzers.
The TokenizerFactory
produces a language-specific tokenizer that processes documents, producing a sequence of tokens.
The AnalyzerFactory
produces a language-specific analyzer that uses dictionaries and statistical analysis to add analysis objects to tokens.
If your application requires streaming, use this API. The Lucene, Solr, and Elasticsearch integrations use these factories.
For the complete API documentation, consult the Javadoc for these classes.
Use the TokenizerFactory
to create a language-specific tokenizer that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with TokenizerOption
to define the root of your RBL installation, as illustrated in the following sample. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
The Tokenizer
uses a word breaker to establish token boundaries and detect sentences. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate analysis entries in the token object that they return. For other languages, you use the analyzer described below to return analysis objects for each token.
Create a tokenizer factory
TokenizerFactory factory = new TokenizerFactory();
factory.setOption(TokenizerOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(TokenizerOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Set tokenization options
factory.setOption(TokenizerOption.includeHebrewRoots, "true");
factory.setOption(TokenizerOption.nfkcNormalize, "true");
Use the AnalyzerFactory
to create a language-specific analyzer. Prior to creating the analyzer, use the factory and AnalyzerOption
to define RBL root, as illustrated in the sample below. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
AnalyzerFactory factory = new AnalyzerFactory();
factory.setOption(AnalyzerOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(AnalyzerOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Use the Analyzer
to return an array of Analysis
objects for each token.
Annotator Management for Multithreaded Applications
For a multithreaded application it is recommended that a pool of annotators be constructed. Annotators must be used on a per-thread basis, and it is desirable to avoid the overhead of creating annotators each time one is required. The settings of annotators cannot be changed after they are built, so if multiple configurations are required they should be stored separately.
RBL integrates easily with multi-threaded architectures, but, to avoid performance penalties, it does not make promiscuous use of locks. Instead, most objects and interfaces in RBL are either read-only, reentrant objects or read-write, per-thread objects. TokenizerFactory
and AnalyzerFactory
objects are hybrids as they have both thread-safe and per-thread methods. The create
methods of these factories are thread-safe because they do not alter any data within the factories, but the setOption
, addDynamicUserDictionary
, and addUserDefinedDictionary
methods are not thread-safe because they do alter data within the factory. The tokenizers and analyzers created by these factories are always meant to be used on a per-thread basis because they are not reentrant and do alter data within the objects. You can use a factory across multiple threads to create objects as long as calls to the factory methods for setting options or adding user dictionaries are synchronized appropriately. The objects created by the factory must each be created and used by only one thread, which need not be the thread initializing the factory.