How does Rosette language identification work?

The input data to the Rosette language identifier may be in any of 377 (and counting) language-encoding-script combinations, involving 56 languages, 48 encodings, and 18 writing scripts.

The core algorithm builds language models based on sequences of four characters, known as character n-grams, specifically quad-grams. For instance, the string "Rosette is great!" contains the quad-grams: "Rose", "oset", "sett", "ette", "tte ", "te i", "e is", " is ", "is g", "s gr", " gre", "grea", "reat", and "eat!".

Each of the built-in language profiles lists the quad-grams most frequently encountered in documents of a given language-encoding-script. The default number of n-grams in a language model is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

When input text is submitted for detection, a similar n-gram profile is built for that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile). The pre-built profiles are returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

Several languages also use a different, proprietary algorithm, optimized for short strings such as social media.

The /language endpoint returns a list of language detection results in descending order of confidence, so the first result is the best.

More Info:

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request


Please sign in to leave a comment.

Powered by Zendesk