This chapter introduces two RBL-JE factories, the objects they create, and the RBL-JE command line driver. For the API, consult the Javadoc in this package.
Use TokenizerFactory
to create a language-specific Tokenizer
that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with TokenizerOption
to define the root of your RBL-JE installation and your RLP license, as illustrated in the following sample. See the Javadoc for other options you may set.
The Tokenizer
uses Sentence Breaker and Word Breaker (for European languages) or a Segmentation Tokenizer and Script Region Breaker (for Chinese, Japanese, and Thai) to establish token boundaries. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate Analysis entries in the Token object that they return. For other languages, you use the Analyzer described below to return Analysis objects for each token.
Here is a sample:
Use the AnalyzerFactory
to create a language-specific Analyzer
. Prior to creating the analyzer, use the factory and AnalyzerOption
to define RBL-JE root and your license license file, as illustrated in the sample below. See the Javadoc for other options you may set.
AnalyzerFactory factory = new AnalyzerFactory();
factory.setOption(AnalyzerOption.rootDirectory, rootDirectory);
File rootPath = new File(rootDirectory);
factory.setOption(AnalyzerOption.licensePath, new File(
rootPath, "licenses/rlp-license.xml").getAbsolutePath());
Use the Analyzer
to return an array of Analysis
objects for each token. Depending on the language, each Analysis
object may contain one or more of the following:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the Analyzer
can disambiguate between multiple Analysis
objects and return the disambiguated Analysis
object.
Here is an example:
The samples are in rbl-je-<rblversion>/samples/tokenize-analyze
. In a Bash shell (Unix) or Command Prompt (Windows), navigate to this directory and use the Ant build script to compile and run both of these samples. Your license (rlp-license.xml
) must be in the licenses
subdirectory of the RBL-JE installation.
To compile and run both samples, call:
Tokenize
tokenizes the sample German document and Analyze
provides a disambiguated analysis of each token.
The output appears in two files: deu-tokenized.txt
and deu-analyzed.txt
. The first file contains a token, Tab, <ALNUM>
tag on each line, with a blank line following the end of a sentence.
The second file contains the token, lemma, part of speech, and compound components (where relevant) on each line. For those languages for which disambiguation is not supported, there may be multiple rows for each token (the token appearing in the first column), one for each analysis. Here is a fragment with a sentence from deu-analyzed.txt
:
TOKEN LEMMA POS COMPOUNDS
----- ----- --- ---------
3.11.06 3.11.06 CARD
- - PUNCT
Not not ADV
und und COORD
Elend Elend NOUN
in in PREP
ihren ihr POSDET
Heimatländern Heimatland NOUN [Heimat, Land]
lassen lassen VVFIN
immer immer ADV
mehr mehr INDADJ
Afrikaner Afrikaner NOUN
die die ART
Reise Reise NOUN
nach nach PREP
Europa Europa NOUN
antreten antreten VVINF
. . SENT
To run the samples with sample text in a different language, set the test.language
parameter with the ISO 639-3 code for the language. For example to tokenize and analyze the Spanish sample, call
ant -Dtest.language=esp run
RBL-JE Command Line Utility
RBLCmd
is a general-purpose command line utility for RBL-JE. It provides a simple way to produce RBL-JE output without writing code. It is also useful for ad hoc speed and thread testing.
A Bash shell script (RBLCmd
) and Windows script (RBLCmd.bat
) for running this utility are in rbl-je-<rblversion>/tools/bin
. For more information, see RBLCmd
's on-line help, RBLCmd -h
. Examples of its use are shown below:
$ echo Hello world. | ./tools/bin/RBLCmd --language eng --rootDirectory .
Token{text=Hello}
MorphoAnalysis{extendedProperties={}, partOfSpeech=ITJ, lemma=hello, raw=hello[+ITJ]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=VI, lemma=hello, raw=hello[+VI]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=VPRES, lemma=hello, raw=hello[+VPRES]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=NOUN, lemma=hello, raw=hello[+NOUN]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=PROP, lemma=Hello}
Token{text=world}
MorphoAnalysis{extendedProperties={}, partOfSpeech=NOUN, lemma=world, raw=world[+NOUN]}
Token{text=.}
MorphoAnalysis{extendedProperties={}, partOfSpeech=SENT, lemma=., raw=.[+SENT]}
$
$
$ echo 'Hello world! :)' | ./tools/bin/RBLCmd --language eng --rootDirectory . --emoticons
Token{text=Hello}
MorphoAnalysis{extendedProperties={}, partOfSpeech=ITJ, lemma=hello, raw=hello[+ITJ]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=VI, lemma=hello, raw=hello[+VI]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=VPRES, lemma=hello, raw=hello[+VPRES]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=NOUN, lemma=hello, raw=hello[+NOUN]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=PROP, lemma=Hello}
Token{text=world}
MorphoAnalysis{extendedProperties={}, partOfSpeech=NOUN, lemma=world, raw=world[+NOUN]}
Token{text=!}
MorphoAnalysis{extendedProperties={}, partOfSpeech=SENT, lemma=!, raw=![+SENT]}
Token{text=:)}
MorphoAnalysis{extendedProperties={}, partOfSpeech=EMO, lemma=:)}
MorphoAnalysis{extendedProperties={}, partOfSpeech=ADJ, lemma=:), raw=:)[+ADJ]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=NOUN, lemma=:), raw=:)[+NOUN]}
MorphoAnalysis{extendedProperties={}, partOfSpeech=PROP, lemma=:), raw=:)[+PROP]}