Rosette text analytics uses linguistic analysis, statistical modeling, and machine learning to accurately process unstructured text and names, revealing valuable information and actionable data. Many of the calls return scores measuring confidence, salience, or matching along with the data. Each endpoint and SDK may have a unique definition and calculation for these scores.
Threshold values may be used with these scores to determine which values to return as part of the output when analyzing text documents. This article attempts to explain the meanings behind these values by endpoint, along with some best practices for using them.
In general, your results will depend on your data. No single value will represent relevance or accuracy across all possible applications. By testing with and evaluating your own data, you can determine what values work for your environment.
Depending on the endpoint, Rosette may return a confidence score, a salience score, a raw score, a match score, or some combination of scores. What are these scores measuring?
The dictionary defines confidence as "the quality or state of being certain". In machine learning, confidence defines the probability of the event. In the context of Rosette, Confidence is a measure of how correct Rosette believes its response to be. It attempts to answer the question: How sure am I that this answer is the right answer?
In the context of Rosette, Salience is a measure of how relevant the output is to the overall content of the input text. It attempts to answer the question: Does this analysis or information matter? It may be completely correct, but not of interest to the problem being evaluated. Salience refers to the relevance of the response to the overall input text.
Some endpoints return raw scores, which are normalized into confidence scores. Confidence scores are values between 0 and 1, while raw scores may be outside that range.
The name processing endpoints sometimes use the term match score instead of confidence scores, as they pertain to how well names match when compared. The match scores are also values between 0 and 1, and provide a similar functionality as confidence scores.
In the context of Rosette, a threshold determines a boundary. When a threshold is defined, only results with scores above the threshold are returned. Determining the correct threshold value allows you to see all the positive results of interest, without extraneous negative results. Increasing the threshold removes false positives, but if the threshold is set too high, true positives may be removed as well.
While Rosette may have a default value defined for a threshold, the optimal value will depend on your input text, data, and the problem you are solving.
Entity Extraction and Entity Linking
The Entity Extraction and Linking endpoint uses statistical or deep neural network (DNN) based models, as well as regex rules and gazetteer lists to identify entity mentions and label them with an entity type. Entity linking disambiguates between similar entity mentions, connecting multiple occurrences of an entity and linking to an external knowledge base.
Scores Returned
confidence: A value between 0 and 1. Only returned for statistical models. No confidence score is returned from the DNN based models, or from results generated by regex rules and gazetteers. This is the confidence value for the extracted entity.
linkingConfidence: A value between 0 and 1.
salience: A value of 0 or 1. Salience indicates whether a given entity is important to the overall scope of the document. Salience values are binary, either 0 (not salient) or 1 (salient). Salience is determined by a classifier trained to predict which entities would be included in a summary or abstract of an article.
Options
The Entity Extraction and Linking endpoint returns separate extraction scores for extraction and linking.
Options Example
"options": {"linkEntities": true,
"linkingConfidence": true,
"calculateSalience": true}
Recommendation
Confidence scores correlate well with precision and may be used as a threshold for removing false positives. There is no suggested threshold; you should set the value based on evaluation of Rosette's responses to your data and the needs of your application. We recommend that you start with a lower threshold, around 0.2 or 0.3 and evaluate the effect as you increase the threshold. Increasing the confidence threshold removes false positives, however, after a certain point, the true positives will be removed as well.
Rosette's Language Identification endpoint returns a list of detected languages in descending order of confidence.
Scores Returned
The algorithm returns a score based on the similarity of a mathematical representation of the input text to a mathematical representation of an entire language. These scores are then scaled to fall between the range 0-1. The scaled value is returned as the confidence score.
If Rosette is unable to detect a language, the endpoint returns Unknown (xxx
) and a value of 1.
The short string language identification algorithm, applied when the input text is less than 141 characters, can return confidence scores greater than 1 if the user has set a language weight. The short string threshold is disabled in the RLI-JE SDK by default.
Options
Recommendation
The top result has the highest confidence, is considered the best result, and is usually the correct language. As with all statistical systems, no single value will always signal correctness. Users should set baselines based on their own applications and data. The language with the highest confidence, which is the first language returned, is usually the correct language.
The topic extraction endpoint returns the key words, phrases, and concepts present in a text. Provided some input text, Rosette returns a list of concepts and a list of keyphrases.
Scores Returned
With each concept or key phrase, a salience score is returned. No confidence score is returned.
Options
The topic extraction endpoint allows an optional threshold on the salience scores. When specified, Rosette will only return the concepts and key phrases in which the associated salience is greater than the threshold value. Concept salience scores tend to be lower than key phrase salience scores
Example:
"options": {"keyphraseSalienceThreshold": value,
"conceptSalienceThreshold": other_value}
Recommendation
Concept salience scores are more susceptible to threshold changes, that is, lowering the concept threshold by 0.05 will have much greater impact than doing the same to the key phrase threshold. If you believe you are getting non-salient phrases (false positives), try increasing the threshold values. If there are key phrases or concepts not being identified, lower the threshold value.
The categorization endpoint identifies the category or categories associated with a document.
Scores Returned
Two scores are returned with each category:
score: A value between -1 to 1. All raw score values are independent of each other.
confidence: A value between 0 and 1. The raw scores are normalized such that the sum of all the confidence scores for a given document = 1.
Confidence score values for the Categorization endpoint reflect the likelihood that a given category label is accurate relative to all other possible category labels. That means that for every document, all twenty one possible category labels are assigned a confidence score, and all these scores together sum to one. If multiple labels are relevant to a given document, confidence scores can be relatively low.
Rosette supports both single label and multi-label categorization. By default, the endpoint is set to multi-label and will return all relevant category labels where the raw score is above the scoreThreshold
. The default threshold is -0.25; you can override the threshold with by setting scoreThreshold
to any value. In general, a negative raw score for a category indicates the content probably doesn't fall under that category.
When in single label mode, Rosette returns the category with the highest confidence score. Both the raw score (score
) and confidence score (confidence
) are returned.
Options
Example
"options": {"singleLabel": true,
"scoreThreshold": -.20}
Recommendation
To establish expectations and set an appropriate baseline threshold, run a series of tests with your data and see what kind of values the model returns. Don’t assume there is an absolute confidence level that indicates correct.
The Sentiment Analysis endpoint is a specialized case of the categorization endpoint, where there are only three categories based on the subjective attitude of the content: positive (pos
), negative (neg
), or neutral (neu
).
Scores Returned
Sentiment Analysis returns a sentiment label for each unique entity in the document, along with its confidence score. A label for the entire document is returned as well.
label: The sentiment: pos
, neg
, or neu
.
confidence: a value between 0 and 1. The sum of all the confidence scores for a given document or entity = 1.
Like the Categorization endpoint, the sum of all possible sentiment labels for a given document is always equal to one. However, with only three possible labels, confidence scores will appear higher and more meaningful.
Confidence scores are returned when using either the standard or deep neural network (DNN) based sentiment model. The DNN model may return different scores than the standard model.
Options
Options Example
"options": {"modelType": "dnn"}
The Relationship Extraction endpoint identifies specific connections between individual entities mentioned in an input text.
Scores Returned
While the endpoint identifies a number of relationship types (predicates), confidence scores are only returned for two types:
citizen of (CIT-OF
)
educated at (EDU-AT
)
For each relationship with one of the above predicates:
Options
This endpoint does not support specifying a threshold.
Name-translation translates or transliterates the name of an entity between English and another language.
Scores Returned
Different languages and names can return very different results, even if they are equally "good". You can't use the match score to compare the quality of different translations. The higher the number returned, the better the translation.
Options
You can improve the accuracy of the match score by providing the entity type (PERSON, LOCATION,
or ORGANIZATION
and the three-letter language code for the name.
Recommendation
Run a set of names through the name translation to see the values returned from your data. You may be able to determine a minimum value for acceptable translations.
The Name Similarity endpoint compares the semantic similarity of two entity names (Person, Location, or Organization) and returns a score between 0 and 1, representing the degree of similarity between two entity names.
Scores Returned
Options
You can improve the accuracy of the match score by providing the entity type (PERSON, LOCATION,
or ORGANIZATION
and the three-letter language code for the name.
Recommendation
To improve the score returned, provide the language and the entity type. Run a set of names through the name similarity to see the values returned from your data. You may be able to determine a minimum value for acceptable matches.
The Name Deduplication endpoint separates a list of names into clusters, where each cluster contains names that are potential duplicates. A threshold parameter determines how much variation between names is permitted in any given cluster. The threshold uses the confidence (match) scores calculated when any two names in the list are compared.
Scores Returned
Options
All pairs of names drawn from the same cluster will have match scores above the specified threshold.
Recommendation
To establish expectations and set an appropriate threshold, run a series of tests with your data and see what kind of clusters the model returns. If the names in the cluster are not similar enough, increase the threshold. If similar names are not being identified in the same cluster, decrease the threshold.