The Rosette language identifier endpoint identifies in which language(s) the input is written. The endpoint returns a list of language identification results in descending order of confidence, so the first result is the best.
The input data may be in any of 364 language–encoding–script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.
When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.
For a number of languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).
If you'd like to try out Rosette's language identification capabilities, sign up for the Rosette API here.
The Language Identification REST endpoint is: