To fully understand the name match scores, you need to understand how the match scores are determined. The match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar the two names are; it is not an absolute value. When comparing different name matches, the relative values of the match scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores.
A value of 1.0 is returned if and only if the two names are identical. Character strings, languages, languages of origin, and entity types must all match for the two names to be considered identical.
Calculating the match score is a complex process that involves multiple steps and algorithms.
Identify and normalize the tokens in each name. Each name will usually have multiple tokens.
Calculate the score for every token pair.
Pick the best matching pairs to maximize final combined score.
Score remaining tokens as deletions or conflicts.
Compute a weighted average score.
Adjust the final score. For example, the score is decreased if the gender of the two names does not appear to match.
Tokenize, Normalize, and Transliterate
Before any matching algorithms can be run, the names have to be transformed into tokens that can be compared. This step, as with many of the steps in name matching, often has language-specific components.
This step includes:
Removing stop words, such as Mr. or Senator or General. Stop words are language-dependent.
-
Transliterating into English and/or translating if necessary, including:
Adding vocalizations, or vowels, in the correct location for languages such as Arabic and Thai which don't have vowels.
Adding spaces (segmentation) to languages such as Chinese, Japanese, Korean, and Thai that don't use spaces, separating given and surname tokens.
Normalization, including removing diacritical marks, to get a canonical representation of the token.
The resulting token output enhances search accuracy and increases relevancy.
Calculate Scores for Token Pairs
Every token in Name 1 is matched and scored against every token in Name 2 to find the matching pairs that will result in the highest total score for the pair. All candidate token pairs are scored to determine the best match alignment.
Token scorers are modular, allowing different scorers to be chosen for each token pair. The choice of scorer applied will depend on the type of tokens, the type of matches, and the languages of the names.
There are multiple types of scorers used, including:
Each matcher returns a (ts
) score for the pair.
Once all the token pairs have been scored, the best combination of tokens is selected to maximize the complete score. The chosen pairs are displayed along with their scores.
Score Deletions and Conflicts
Once the token pair matches have been selected, there may be some tokens remaining in one or both names that don't have matches. Tokens that are left unmatched are assigned scores depending on whether they appear to be conflicts, in-order deletions, or out-of-order deletions. Whether an unmatched token is a conflict or some kind of deletion depends on the tokens in the immediate surroundings and how they were matched in the other name.
A conflict threshold determines the score for which a match score is so low that it should be considered a conflict rather than a very unlikely match. A conflict exists if the two sequences of tokens are positioned in a way that suggests that they should have matched each other, but no satisfactory match was found. For example, when "Johann Sebastian Bach" is matched against "Johann Ambrosius Bach", Sebastian is in conflict with Ambrosius.
Calculate the Weighted Score
The token pair scorer returns 2 values, a (ts
) score and a (cs
) score. The ts
is the score of how well the tokens match, while the cs
score includes the placement of the token in the score calculation. The cs
will be lower if the tokens match but are out of order. The cs
scores are displayed on the match score computation table.
Under each token is a weight. The weightings determine how important the token pair match is in calculating the final score. Full names are rated higher than initials, and unusual tokens get a higher weighting than more common names because it is more significant when they match. For example, Andrew is less common than John, so it gets a higher weighting. These weightings are used in calculating the final score, which is a weighted average of the cs
scores.
At the end, all selected tokens and token-pairs are considered together and some final adjustments are made to the score. Examples of adjustments include:
A penalty is applied if the two names do not appear to have the same gender. This, of course, is language-dependent.
Sometimes there is a weighting penalty applied early in the process, which when considered at the end, in the full context of the match, is not as significant as initially determined. These values may be adjusted in the final score.