To fully understand the name match scores, you need to understand how the match scores are determined. The match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar the two names are; it is not an absolute value. When comparing different name matches, the relative values of the match scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores.
A value of 1.0 is returned if and only if the two names are identical. Character strings, languages, languages of origin, and entity types must all match for the two names to be considered identical.
Calculating the match score is a complex process that involves multiple steps and algorithms.
-
Identify and normalize the tokens in each name. Each name will usually have multiple tokens.
-
Compare each token from name 1 with each token from name 2, calculating the score for every token pair.
-
Once all the token pairs have been scored, the best combination of tokens is selected to maximize the complete score.
-
Score unmatched tokens as deletions or conflicts.
-
Compute a weighted average score.
-
Adjust the final score. For example, the score is decreased if the gender of the two names does not appear to match.
The Pairwise Match Endpoint REST endpoint performs a pairwise match between two names, and returns detailed information about how the match scores were determined. You can use Parameter Universe and the Parameters REST endpoint to modify parameters and using the information returned by the pairwise match to determine the optimal parameter set for your use case.
Tokenize, Normalize, and Transliterate
Before any matching algorithms can be run, the names have to be transformed into tokens that can be compared. This step, as with many of the steps in name matching, often has language-specific components.
This step includes:
-
Removing stop words, such as Mr. or Senator or General. Stop words are language-dependent.
-
Transliterating into English and/or translating if necessary, including:
-
Adding vocalizations, or vowels in the correct location for languages such as Arabic which are often written in an unvocalized form.
-
Adding spaces (segmentation) to languages such as Chinese, Japanese, Korean, and Thai that don't use spaces, separating given and surname tokens.
-
Normalization, including removing diacritical marks, to get a canonical representation of the token.
The resulting token output enhances search accuracy and increases relevancy.
Calculate Scores for Token Pairs
Once the base query is completed, the rescorer selects the names to send to the pairwise matcher. The pairwise matcher takes the query name (Name 1) and performs a pairwise match against each candidate document passed on (Name 2).
Every token in Name 1 is matched and scored against every token in Name 2 to find the matching pairs that will result in the highest total score for the pair. All candidate token pairs are scored to determine the best match alignment.
Token scorers are modular, allowing different scorers to be chosen for each token pair. The choice of scorer applied will depend on the type of tokens, the type of matches, and the languages of the names.
There are multiple types of scorers used, including:
Each matcher returns a (ts
) score for the pair.
Score Deletions and Conflicts
To be considered a match, token pairs must score higher than both the conflictThreshold
and estimatedConflictOrDeletion
parameters. Tokens not part of a satisfactory pair in this regard will be considered either conflicts or deletions.
A conflict is a score given to a token pair which context suggests should have matched, but whose score was not high enough to be considered a match. For example, when “Johann Sebastian Bach” is matched against “Johann Ambrosius Bach”, Sebastian is in conflict with Ambrosius. The token pair is considered a conflict.
A deletion is a score given to individual tokens that are not part of successful match pairs or conflict pairs.
There are two types of deletions:
Calculate the Weighted Score
The token pair scorer returns 2 values, a (ts
) score and a (cs
) score. The ts
is the score of how well the tokens match, while the cs
score includes the placement of the token in the score calculation. The cs
will be lower if the tokens match but are out of order.
Each token has a weight. The weightings determine how important the token pair match is in calculating the final score. Full names are rated higher than initials, and unusual tokens get a higher weighting than more common names because it is more significant when they match. For example, Andrew is less common than John, so it gets a higher weighting. These weightings are used in calculating the final score, which is a weighted average of the cs
scores.
Tip
We apply specific cultural knowledge to the weighted scoring; for example in Spanish we handle parsing of matrilineal vs patrilineal surnames, with the onomastic understanding that the patrilineal name is likely to be treated as the person's primary surname.
At the end, all selected tokens and token-pairs are considered together and some final adjustments are made to the score. Examples of adjustments include:
-
A penalty is applied if the two names do not appear to have the same gender. This, of course, is language-dependent.
-
Sometimes there is a weighting penalty applied early in the process, which when considered at the end, in the full context of the match, is not as significant as initially determined. These values may be adjusted in the final score.
Score adjustment examples
This table lists a few of the adjustments that may be applied to a name match.