Rosette Language Identifier (RLI) identifies the language, encoding, and writing script of input text. For a number of languages, the endpoint provides a different algorithm for detecting the language of short strings.
Standard Language Detection
The standard algorithm identifies the language, encoding, and writing script of the contents of a buffer, a file, or a string. It also identifies language regions with this information in documents that contain blocks of text in different languages. The input data may be in or contain regions in any of the supported language-encoding-script combinations listed in RLI Languages .
RLI uses an n-gram algorithm to detect language, encoding, and script. Each of the built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order (i.e. shortest first) by the distance of the input from the pre-built profiles.
Short-String Language Detection
RLI provides a different, rule- and model-based algorithm that is more accurate for short inputs. By default, short-string detection will read its models from disk. Alternatively, you can direct it to read models from a JAR file using LanguageIdentifierBuilder#useModelsInJar(boolean)
.
For a list of the language–encoding–script combinations supported by the standard algorithm and the languages supported by the short-string algorithm, see RLI Languages.