The Rosette language identifier endpoint identifies in which language(s) the input is written. The endpoint returns a list of language identification results in descending order of confidence, so the first result is the best.
Input data may be in any of the supported language–encoding–script combinations, involving multiple languages, encodings, and writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.
When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.
For a number of languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).
You can use the
GET /language/supported-languages method to dynamically retrieve the list of supported languages and scripts for the language identification endpoint.
If you'd like to try out Rosette's language identification capabilities, sign up for Rosette here.
The Language Identification REST endpoint is: