The Rosette language identifier endpoint identifies the language of the input.
The input data may be in any of 364 language- encoding-script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.
When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles(a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.
For a number of languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings.
The endpoint returns a list of language detection results in descending order of confidence, so the first result is the best.