LanguageIdentifierBuilder
to create an
Annotator
.
Rosette Language Identifier - the starting point for identification of language + encoding + script .
You can obtain a LanguageIdentifier instance from a LanguageIdentifierFactory instance. Once that is done, LanguageIdentifier settings can be configured on a per-instance basis.
LanguageIdentifier instances cannot be shared by different threads.Typical Usage:
LanguageIdentifierFactory lif = new LanguageIdentifierFactory(validLicense); LanguageIdentifier LanguageId = lif.create(); Result[] results = languageId.detect("Data string");The return value of any of the detect methods is an ordered array of Result objects. Each Result represents a possibility for the detected language and encoding. (see
com.basistech.rli.Result
)
Any call to a detect method takes the given input and forms an InputProfile. Characteristics of the InputProfile and Results may be set within the LanguageIdentifier using set methods before calling detect:
@Deprecated public final class LanguageIdentifier extends Object
Modifier and Type | Method and Description |
---|---|
Result[] |
detect(byte[] data)
Deprecated.
Run RLI to detect the language, script, and encoding of an array of bytes.
|
Result[] |
detect(ByteBuffer data)
Deprecated.
Run RLI to detect the language, script, and encoding of a ByteBuffer.
|
Result[] |
detect(char[] data)
Deprecated.
Run RLI to detect the language, script, and encoding of data represented as a char[].
|
Result[] |
detect(char[] data,
int start,
int length)
Deprecated.
Run RLI to detect the language, script, and encoding of a portion of a char[].
|
Result[] |
detect(CharBuffer data)
Deprecated.
Run RLI to detect the language, script, and encoding of a CharBuffer.
|
Result[] |
detect(String data)
Deprecated.
Run RLI to detect the language, script, and encoding of data of a String.
|
double |
getAmbiguityThreshold()
Deprecated.
Get the N-gram distance ambiguity threshold, value range [0-100].
|
static double |
getDefaultEncodingHintWeight()
Deprecated.
Get the Encoding hint weight factor.
|
static double |
getDefaultFineAmbiguityThreshold()
Deprecated.
Get the default N-gram distance ambiguity threshold.
|
static double |
getDefaultInvalidityThreshold()
Deprecated.
Get the default N-gram distance invalidity threshold, value range [0-100].
|
static double |
getDefaultLanguageHintWeight()
Deprecated.
Return the default weight for the hint used to help resolve ambiguous results.
|
static int |
getDefaultMinValidChars()
Deprecated.
Get the default minimum number of valid characters required for identification.
|
static int |
getDefaultProfileDepth()
Deprecated.
Returns the default number of most frequent ngrams to consider in detection.
|
EncodingCode |
getEncodingHint()
Deprecated.
Get the EncodingCode used to help resolve ambiguous results, results
matching the hint encoding will be favored.
|
double |
getEncodingHintWeight()
Deprecated.
Get the weight used to help resolve ambiguous results.
|
double |
getInvalidityThreshold()
Deprecated.
Get N-gram distance invalidity threshold, value range [0-100].
|
LanguageCode |
getLanguageHint()
Deprecated.
Return the current LanguageCode for the hint used to help resolve ambiguous results.
|
double |
getLanguageHintWeight()
Deprecated.
Return the current weight for the hint used to help resolve ambiguous results.
|
int |
getMinValidChars()
Deprecated.
Get the minimum number of valid characters required for identification.
|
int |
getNumProfiles()
Deprecated.
Get the number of language/script/encoding profiles supported by LanguageIdentifier.
|
int |
getProfileDepth()
Deprecated.
Returns the current number of most frequent ngrams to consider in detection.
|
static Map<LanguageCode,com.google.common.collect.Multimap<ISO15924,EncodingCode>> |
getSupportedProfiles()
Deprecated.
List supported profiles, i.e., triples of languages, scripts, and encodings
that can be returned by
detect(byte[]) . |
double[] |
getWeightAdjustments()
Deprecated.
For testing, allow a test to fish these back out.
|
void |
setAmbiguityThreshold(double threshold)
Deprecated.
Set the N-gram distance ambiguity threshold.
|
void |
setEncodingHint(EncodingCode encoding)
Deprecated.
Set the EncodingCode used to help resolve ambiguous results, results
matching the hint encoding will be favored.
|
void |
setEncodingHint(EncodingCode encoding,
double weight)
Deprecated.
Set the EncodingCode and weight used to help resolve ambiguous results, results
matching the hint encoding will be favored.
|
void |
setInvalidityThreshold(double threshold)
Deprecated.
Set N-gram distance invalidity threshold, value range [0-100].
|
void |
setLanguageHint(LanguageCode language)
Deprecated.
Set the LanguageCode for the hint used to help resolve ambiguous results.
|
void |
setLanguageHint(LanguageCode language,
double weight)
Deprecated.
Set the LanguageCode and weight for the hint used to help resolve ambiguous results.
|
void |
setLanguageWeightAdjustment(LanguageCode language,
int weight)
Deprecated.
Specifies a language weight adjustment.
|
void |
setLanguageWeightAdjustment(LanguageCode language,
ISO15924 script,
int weight)
Deprecated.
Specifies a language and script weight adjustment.
|
void |
setMaxResults(int max)
Deprecated.
Set the maximum number of results to return.
|
void |
setMinValidChars(int count)
Deprecated.
Set the minimum number of valid characters required for identification.
|
void |
setProfileDepth(int depth)
Deprecated.
Sets the current number of most frequent ngrams to consider in detection.
|
public int getNumProfiles()
public static int getDefaultMinValidChars()
public int getMinValidChars()
public void setMinValidChars(int count)
public static int getDefaultProfileDepth()
public int getProfileDepth()
public void setProfileDepth(int depth)
public static double getDefaultFineAmbiguityThreshold()
public double getAmbiguityThreshold()
public void setAmbiguityThreshold(double threshold)
public static double getDefaultInvalidityThreshold()
public double getInvalidityThreshold()
public void setInvalidityThreshold(double threshold)
public LanguageCode getLanguageHint()
public double getLanguageHintWeight()
public static double getDefaultLanguageHintWeight()
public void setLanguageHint(LanguageCode language)
language
- Hint encoding. Should be one of the system's supported
LanguageCodes. LanguageCode.Unknown can be used to disable the hint.public void setLanguageHint(LanguageCode language, double weight)
language
- Hint encoding. LanguageCode.Unknown can be used to disable the hint.weight
- Value range [1-99]. Value of 1 is the lightest hint, value of 99 the strongest.public EncodingCode getEncodingHint()
public double getEncodingHintWeight()
public static double getDefaultEncodingHintWeight()
public void setEncodingHint(EncodingCode encoding)
encoding
- Hint encoding. EncodingCode.Unknown can be used to disable the hint.public void setEncodingHint(EncodingCode encoding, double weight)
encoding
- Hint encoding. EncodingCode.Unknown can be used to disable the hint.weight
- Value range [1-100]. Value of 100 forces only those results which match
encoding hint to be considered during detection.public void setLanguageWeightAdjustment(LanguageCode language, int weight)
language
- the language to be adjustedweight
- the weight adjustment percentage. Must not be negative.public void setLanguageWeightAdjustment(LanguageCode language, ISO15924 script, int weight)
language
- the language to be adjustedscript
- the script to be adjustedweight
- the weight adjustment percentage. Must not be negative.public void setMaxResults(int max) throws RosetteIllegalArgumentException
max
- number of resultsRosetteIllegalArgumentException
- if max is non-positivepublic static Map<LanguageCode,com.google.common.collect.Multimap<ISO15924,EncodingCode>> getSupportedProfiles()
detect(byte[])
.public Result[] detect(byte[] data) throws LanguageIdentificationException
data
- data to examine.LanguageIdentificationException
public Result[] detect(ByteBuffer data) throws LanguageIdentificationException
data
- the data to examine. RLI will examine the bytes from data.position() to data.limit().LanguageIdentificationException
public Result[] detect(char[] data) throws LanguageIdentificationException
data
- data to examine.LanguageIdentificationException
public Result[] detect(char[] data, int start, int length) throws LanguageIdentificationException
data
- the data to examine.start
- the index of the first character to examine.length
- how many characters to examine.LanguageIdentificationException
public Result[] detect(CharBuffer data) throws LanguageIdentificationException
data
- the data to examine. RLI will examine the bytes from the position() to the limit().LanguageIdentificationException
public Result[] detect(String data) throws LanguageIdentificationException
data
- the dataLanguageIdentificationException
public double[] getWeightAdjustments()
Copyright © 2016 Basis Technology Corporation. All Rights Reserved.