Rosette Base Linguistics (RBL) provides a set of linguistic tools to prepare your data for analysis. Language-specific modules provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots. RBL also includes a Chinese Script Converter (CSC) which converts tokens in Traditional Chinese text to Simplified Chinese and vice versa.
You can use RBL in your own JVM application, use its Apache Lucene compatible API in a Lucene application, or integrate it directly with either Apache Solr or Elasticsearch.
-
JVM Applications
To integrate base linguistics functionality in your applications, the JVM includes two sets of Java classes and interfaces:
-
ADM API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis as a set of annotations. This collection is called the Annotated Data Model (ADM) and is used in other Rosette tools, such as Rosette Language Identifier and Rosette Entity Extractor, as well as RBL. There are some advanced features which are only supported in the ADM API and not the classic API.
When using the ADM API, you create an annotator which includes both tokenizer and analyzer functions.
-
Classic API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis that is analogous to the ADM API, except it is not compatible with any other Rosette products. It also supports streaming: a user can start processing a document before the entire document is available and it can produce results for pieces of a document without storing the results for the entire document in memory at once.
When using the classic API, you create tokenizers and analyzers.
-
Lucene
In an Apache Lucene application, you use a Lucene analyzer which incorporates a Base Linguistics tokenizer and token filter to produce an enriched token stream for indexing documents and for queries.
-
Solr
With the Solr plugin, an Apache Solr search server uses RBL for both indexing documents and for queries.
-
Elasticsearch
Install the Elasticsearch plugin to use RBL for analysis, indexing, and queries.
Note
The Lucene, Solr, and Elasticsearch plugins use APIs based on the classic API. All options that are in the enums TokenizerOption
or AnalyzerOption
are available along with some additional plugin-specific options.
RBL performs multiple types of Depending on the language, one or more of the following may be identified in the input text:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object.
In the ADM API, use the BaseLinguisticsFactory
to set the linguistic options and instantiate an Annotator
which annotates the input text. The ADM API creates an annotator for all linguistic objects, including tokens.
In the classic API, use the BaseLinguisticsFactory
to configure and create tokenizers, analyzers, and CSC analyzers. The classic API creates separate tokenizers and analyzers.
The following table indicates the type of support that RBL provides for each supported language. The RBL tokenizer provides normalization, tokenization, and sentence boundary detection. The RBL analyzer provides lemma lookup (including orthographic normalization for Japanese), lemma guessing (when the lookup fails), decompounding, and supports lemma, segmentation, and many-to-one normalization user dictionaries.
For unknown languages (language code xxx
), RBL uses generic rules, such as whitespace and punctuation delimitation, to tokenize. It will also identify some common acronyms and abbreviations, as well as sentence boundaries. Segmentation user dictionaries are supported for unknown languages.
Some of the features in RBL are supported using deep learning, or neural models which require the native library, TensorFlow. The neural network is used for the following features:
Parts-of-speech (POS) tagging for Indonesian, Standard Malay, and Tagalog.
Hebrew disambiguation when the disambiguatorType
is set to DNN
.
Tokenization for spaceless Korean, when the tokenizerType
is set to SPACELESS_STATISTICAL
.
BasisTech has created a collection of Java classes that generate and represent Rosette's linguistic analyses as a set of annotations. This collection of Java classes is called the Annotated Data Model (ADM) and may be used in RLI and REX as well as in RBL.
The ADM provides advanced functionality not available with the classic API.
When using the ADM API, you use BaseLinguisticsFactory
to set the options and instantiate an Annotator
. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
The ADM API supports more options than the classic API. Whenever possible, we recommend using this API.
For complete API documentation of the ADM, consult the Javadoc for the package:
The standard procedure for using the ADM is as follows:
Instantiate an Annotator
.
Use Annotator
to annotate the input text.
Get the analytical data you want from the returned AnnotatedText
object.
Use BaseLinguisticsFactory
to set the BaseLinguisticsOptions
and to instantiate an Annotator
. Options may be set on the factory itself or passed in to a create method, such as createSingleLanguageAnnotator
or createCSCAnnotator
.
-
At the minimum, you should set options for rootDirectory
and language
.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
-
Then use the BaseLinguisticsFactory
to create the Annotator
. This sample sets the language to English (eng
).
EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class);
options.put(BaseLinguisticsOption.language, "eng");
Annotator annotator = factory.createSingleLanguageAnnotator(options);
Use Annotator
to annotate the input text, which returns a AnnotatedText
object.
The AnnotatedText
object provides an API for gathering data from the linguistic analysis that RBL performs during the annotation process. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
getTokens()
returns a list of tokens, each of which contains a list of morphological analyses.
AnnotatedText results = annotator.annotate(getInput(inputFilePathname));
int index = 0;
for (Token token : results.getTokens()) {
outputData.format("token %d:\t%s%n", index, token.getText());
int aindex = 0;
List<MorphoAnalysis> analyses = token.getAnalyses();
if (null != analyses) {
outputData.format("\tindex\tlemma\tpart-of-speech%n");
for (MorphoAnalysis ma : analyses) {
outputData.format("\t%d %s\t%s%n", aindex, ma.getLemma(),
ma.getPartOfSpeech());
aindex++;
}
}
index++;
}
When using the classic API, you instantiate separate factories for tokenizers and analyzers.
BaseLinguisticsFactory#createTokenizer
produces a language-specific tokenizer that processes documents, producing a sequence of tokens.
BaseLinguisticsFactory#createAnalyzer
produces a language-specific analyzer that uses dictionaries and statistical analysis to add analysis objects to tokens.
If your application requires streaming, use this API. The Lucene, Solr, and Elasticsearch integrations use these methods.
For the complete API documentation, consult the Javadoc for BaseLinguisticsFactory.
Use BaseLinguisticsFactory#createTokenizer
to create a language-specific tokenizer that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with BaseLinguisticsOption
to define the root of your RBL installation, as illustrated in the following sample. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
The Tokenizer
uses a word breaker to establish token boundaries and detect sentences. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate analysis entries in the token object that they return. For other languages, you use the analyzer described below to return analysis objects for each token.
Create a factory
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
Set tokenization options
factory.setOption(BaseLinguisticsOption.nfkcNormalize, "true");
Create the tokenizer
Tokenizer tokenizer = factory.createTokenizer();
Use the BaseLinguisticsFactory#createAnalyzer
to create a language-specific analyzer. Prior to creating the analyzer, use the factory and BaseLinguisticsOption
to define RBL root, as illustrated in the sample below. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(BaseLinguisticsOption.licensePath,
new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Analyzer analyzer = factory.createAnalyzer();
Use the Analyzer
to return an array of Analysis
objects for each token.
Annotator Management for Multithreaded Applications
For a multithreaded application it is recommended that a pool of annotators be constructed. Annotators must be used on a per-thread basis, and it is desirable to avoid the overhead of creating annotators each time one is required. The settings of annotators cannot be changed after they are built, so if multiple configurations are required they should be stored separately.
RBL integrates easily with multi-threaded architectures, but, to avoid performance penalties, it does not make promiscuous use of locks. Instead, most objects and interfaces in RBL are either read-only, reentrant objects or read-write, per-thread objects. BaseLinguisticsFactory
objects are hybrids, as they have both thread-safe and per-thread methods. The create
methods of this factory are thread-safe because they do not alter any data within the factories, but the setOption
, user dictionary, and dynamic user dictionary methods are not thread-safe because they do alter data within the factory. The tokenizers and analyzers created by this factory are always meant to be used on a per-thread basis because they are not reentrant and do alter data within the objects. You can use a factory across multiple threads to create objects as long as calls to the factory methods for setting options or adding user dictionaries are synchronized appropriately. The objects created by the factory must each be created and used by only one thread, which need not be the thread initializing the factory.