public final class LanguageIdentifierBuilder extends Object
Annotator
objects
that perform encoding, script, language, and language region detection.
Construct an object of this class by supplying an RLI root directory. Root directories contain
data required by RLI, and optionally a license file in the file licenses/rlp-license.xml.
If you do not wish to put your license in a file in that location, you may pass the contents of your
XML license to the license(byte[])
or license(String)
method.
Once you have created an instance of this class, you call fluent methods to specify any options
to control the identification process. Then, you call buildSingleLanguageAnnotator()
or buildLanguageRegionAnnotator()
to get a single-language or language region annotator.
Single language annotators assume that the entire input is in one language. They accept bytes or
characters, and return detailed information on the possible language and encoding, including
multiple possible results, ranked by confidence.
Language region annotators accept only characters, and return a division of the text into regions
by language and script. They do not return multiple alternatives.
Building annotators with this builder is thread safe. Changing settings is typically not thread safe.
Annotators are not thread safe.Modifier and Type | Class and Description |
---|---|
static class |
LanguageIdentifierBuilder.WeightAdjustmentKey
A pair, consisting of a language and (optional) script, used for language hints.
|
Constructor and Description |
---|
LanguageIdentifierBuilder(File rootDirectory)
Constructs a builder from a directory.
|
LanguageIdentifierBuilder(String licenseFromXML)
Constructs a builder using a license from an XML-formatted string.
|
Modifier and Type | Method and Description |
---|---|
double |
ambiguityThreshold() |
LanguageIdentifierBuilder |
ambiguityThreshold(double ambiguityThreshold)
Set the N-gram distance ambiguity threshold.
|
boolean |
breakRegionOnScriptBoundary() |
LanguageIdentifierBuilder |
breakRegionOnScriptBoundary(boolean breakRegionOnScriptBoundary)
Sets whether to force a language boundary when the script changes
|
LanguageIdentifier |
buildLanguageIdentifier()
Deprecated.
|
Annotator |
buildLanguageRegionAnnotator()
Build an annotator that divides the input document into regions by language and script.
|
Annotator |
buildLanguageRegionAnnotator(Annotator scriptRegionAnnotator,
Annotator sentenceBoundaryAnnotator)
Build an annotator that the input document into regions by language and script, taking annotators for
script regions and sentences as inputs.
|
LanguageIdentificationAnnotator |
buildSingleLanguageAnnotator()
Create an
LanguageIdentificationAnnotator
that accepts characters or bytes and performs language identification. |
EncodingCode |
encodingHint()
Deprecated.
|
LanguageIdentifierBuilder |
encodingHint(EncodingCode encoding,
Double weight)
Deprecated.
|
double |
encodingHintWeight()
Deprecated.
|
double |
invalidityThreshold() |
LanguageIdentifierBuilder |
invalidityThreshold(double invalidityThreshold)
Set N-gram distance invalidity threshold, value range [0-100].
|
LanguageCode |
languageHint()
Deprecated.
|
LanguageIdentifierBuilder |
languageHint(LanguageCode language,
Double weight)
Deprecated.
|
double |
languageHintWeight()
Deprecated.
|
LanguageIdentifierBuilder |
languageWeightAdjustment(LanguageCode language,
ISO15924 script,
int weight)
Set a hint value for a language, or a language and a script.
|
Map<LanguageIdentifierBuilder.WeightAdjustmentKey,Integer> |
languageWeightAdjustments() |
byte[] |
license() |
LanguageIdentifierBuilder |
license(byte[] license)
Supply a license, overriding any license file in the rootDirectory/licenses directory.
|
LanguageIdentifierBuilder |
license(String license)
Supply a license, overriding any license file in the rootDirectory/licenses directory..
|
int |
maxRegionLength() |
LanguageIdentifierBuilder |
maxRegionLength(int maxRegionLength)
set the maximum region length (amount of text examined in a script region)
for the language region detector.
|
int |
maxResults() |
LanguageIdentifierBuilder |
maxResults(int maxResults)
Specify the maximum number of results to return.
|
int |
minRegionLength() |
LanguageIdentifierBuilder |
minRegionLength(int minRegionLength)
Set the minimum length for a region.
|
int |
minValidChars() |
LanguageIdentifierBuilder |
minValidChars(int minValidChars)
Set the minimum number of valid characters required for identification.
|
int |
profileDepth() |
LanguageIdentifierBuilder |
profileDepth(int profileDepth)
Control whether the identifier uses low-frequency character sequences in the identification process.
|
File |
rootDirectory() |
LanguageIdentifierBuilder |
rootDirectory(File rootDirectory)
Specify the root directory.
|
int |
shortStringThreshold() |
LanguageIdentifierBuilder |
shortStringThreshold(int shortStringThreshold)
Set the threshold for short string processing.
|
boolean |
uniqueLanguages()
Gets whether to ignore script and encoding differences in language
detection results.
|
LanguageIdentifierBuilder |
uniqueLanguages(boolean uniqueLanguages)
Sets whether to ignore script and encoding differences when returning
language detection results.
|
boolean |
useModelsInJar() |
LanguageIdentifierBuilder |
useModelsInJar(boolean useModelsInJar)
Set whether to use the short-string models in a JAR file, instead of the models in the file system.
|
public LanguageIdentifierBuilder(File rootDirectory)
rootDirectory
- the RLI root directorypublic LanguageIdentifierBuilder(String licenseFromXML)
shortStringThreshold(int)
, then also use
rootDirectory(File)
or useModelsInJar(boolean)
so that it can find its models, e.g.:
new LanguageIdentifierBuilder(xmlLicense) .rootDirectory(new File(rootDirectory)) .shortStringThreshold(threshold) .buildSingleLanguageAnnotator();or:
new LanguageIdentifierBuilder(xmlLicense) .useModelsInJar(true) .shortStringThreshold(threshold) .buildSingleLanguageAnnotator();
licenseFromXML
- an XML-formatted string containing a valid licensepublic LanguageIdentifierBuilder rootDirectory(File rootDirectory)
rootDirectory
- the root directory.public File rootDirectory()
public LanguageIdentifierBuilder license(byte[] license)
license
- XML license content.public LanguageIdentifierBuilder license(String license)
license
- XML license content.public byte[] license()
public LanguageIdentifierBuilder minRegionLength(int minRegionLength)
minRegionLength
- the region length, in characters.public int minRegionLength()
public LanguageIdentifierBuilder maxRegionLength(int maxRegionLength)
maxRegionLength
- the maximum length.public int maxRegionLength()
public LanguageIdentifierBuilder minValidChars(int minValidChars)
minValidChars
- minimum valid character length.public int minValidChars()
public LanguageIdentifierBuilder profileDepth(int profileDepth)
profileDepth
- how many ngrams to consider in the profile.public int profileDepth()
public LanguageIdentifierBuilder ambiguityThreshold(double ambiguityThreshold)
ambiguityThreshold
- the threshold.public double ambiguityThreshold()
public LanguageIdentifierBuilder invalidityThreshold(double invalidityThreshold)
invalidityThreshold
- the threshold.public double invalidityThreshold()
@Deprecated public LanguageIdentifierBuilder encodingHint(EncodingCode encoding, Double weight)
encoding
- the encoding to hint.weight
- the weight of the hint.@Deprecated public EncodingCode encodingHint()
@Deprecated public double encodingHintWeight()
@Deprecated public LanguageIdentifierBuilder languageHint(LanguageCode language, Double weight)
language
- the encoding to hint.weight
- the weight of the hint.@Deprecated public LanguageCode languageHint()
@Deprecated public double languageHintWeight()
public LanguageIdentifierBuilder languageWeightAdjustment(LanguageCode language, ISO15924 script, int weight)
language
- the language.script
- the script. If this is null, it resets the weight adjustment for all scripts
for the given language, including any that may have been specifically set
previously via this API.weight
- the weight adjustment. The default is 100. If a different
number is given, the weight is changed to a percentage of the default weight.
For example, if 80 is set, the weight becomes 80% of the original weight. This
function can be called multiple times for multiple language/script pairs.public Map<LanguageIdentifierBuilder.WeightAdjustmentKey,Integer> languageWeightAdjustments()
public LanguageIdentifierBuilder useModelsInJar(boolean useModelsInJar)
public boolean useModelsInJar()
public LanguageIdentifierBuilder shortStringThreshold(int shortStringThreshold)
shortStringThreshold
- the threshold.public int shortStringThreshold()
public LanguageIdentifierBuilder breakRegionOnScriptBoundary(boolean breakRegionOnScriptBoundary)
breakRegionOnScriptBoundary
- public boolean breakRegionOnScriptBoundary()
public LanguageIdentifierBuilder maxResults(int maxResults) throws IllegalArgumentException
maxResults
- maximum number of results produced by a call to LanguageAnnotator.annotate(com.basistech.rosette.dm.AnnotatedText)
.IllegalArgumentException
- if maxResults is non-positivepublic int maxResults()
public LanguageIdentifierBuilder uniqueLanguages(boolean uniqueLanguages)
uniqueLanguages
- whether to treat language detection results the
same if they share a language but have different
scripts or encodingspublic boolean uniqueLanguages()
public LanguageIdentificationAnnotator buildSingleLanguageAnnotator() throws RosetteException
LanguageIdentificationAnnotator
that accepts characters or bytes and performs language identification.
For character input, call annotate(CharSequence input)
to return an
AnnotatedText
object.LanguageIdentificationAnnotator.annotate(com.basistech.rosette.dm.RawData)
to return an AnnotatedText
object.
AnnotatedText
provides a method for returning a
LanguageDetection
object for
single-language detection: getWholeTextLanguageDetection()
.RosetteException
@Deprecated public LanguageIdentifier buildLanguageIdentifier() throws RosetteException
LanguageIdentifier
. This class is obsolete, and this
method is provided to aid in migrating code to the LanguageIdentificationAnnotator
interface.RosetteException
public Annotator buildLanguageRegionAnnotator() throws RosetteException
RosetteException
public Annotator buildLanguageRegionAnnotator(Annotator scriptRegionAnnotator, Annotator sentenceBoundaryAnnotator) throws RosetteException
scriptRegionAnnotator
- a script region annotator obtained from RBL.RosetteException
Copyright © 2016 Basis Technology Corporation. All Rights Reserved.