RNI provides a Java API for matching names across the boundaries of writing scripts. For the complete list of the languages and writing scripts that name matching supports, see Supported Text Domains for and Name Matching.
In the RNI context, name matching means comparing two names, performing linguistic analysis, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two names are. A value of 1.0 is returned if and only if the two names are identical (the strings, languages, languages of origin, and entity types match). A score of less than 1.0 is returned for names that potentially match, with different mismatched name variations.
Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.
RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).
The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding Name Match Scores.
Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:
Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.
You can configure RNI to customize how it scores different match phenomena.
The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).
For a Java sample that illustrates the handling of these and other phenomena, see MatchPhenomenaSample.
The entityType
field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.
Important
The entityType
should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType
, the type PERSON
will be used.
Table 3. Entity Types
Type
|
Description
|
Features
|
PERSON
|
A human identified by name, nickname, or alias.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, frequency and gender models are supported.
|
LOCATION
|
A city, state, country, region or other location.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, and frequency models are supported.
|
ORGANIZATION
|
A corporation, institution, government agency, or other group of people defined by an established organizational structure.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, frequency models, and embeddings are supported.
Real World Ids are supported.
|
IDENTIFIER
IDENTIFIER:DRIVERS_LICENSE
IDENTIFIER:LICENSE_PLATE
IDENTIFIER:NATIONAL_ID_NUM
|
An alphanumeric identifier.
|
Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.
|
By using a string array (such as String[] nameData = {"John", "Smith"};
), you can create a name with data fields. The maximum number of data fields is 5. We assign no explicit semantics to each field (such as given name or surname), but the order of the fields does matter when comparing two names that have fields. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in one name matches the second field in another name). The use of fields may enhance accuracy when you are performing queries and matches with PERSON names in languages where standard name ordering is not the norm. By dictating a consistent name ordering, you can avoid penalties for mis-ordered tokens.
For consistency, you may want to adopt a paradigm for name fields, such as {title, given names, surname, suffix}. Include empty fields in the appropriate position for names that do not contain all these elements. If a trailing field is empty, you can leave it out. For example:
{"Mr", "John Miles", "Doe", "Jr"}
{"Queen", "Elizabeth", "", "II"}
{"Mr", "Anthony Charles", "Blair"}
{"Ms", "Rosanne Christine", "Atwood"}
{"", "Martin Luther", "King", "Jr"}
Note
When scoring a potential match between a name with data fields and a name without data fields, RNI treats the name without data fields as if it were a name with one data field.
RNI treats trailing empty fields as if they were not present. For example, {"Rosanne", "Taylor Smith",""} is treated the same as {"Rosanne", "Taylor Smith"}.
Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with Name.UNKNOWN_FIELD_MARKER
.
Understanding Name Match Scores
To fully understand the name match scores, you need to understand how the match scores are determined. The match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar the two names are; it is not an absolute value. When comparing different name matches, the relative values of the match scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores.
A value of 1.0 is returned if and only if the two names are identical. Character strings, languages, languages of origin, and entity types must all match for the two names to be considered identical.
Calculating the match score is a complex process that involves multiple steps and algorithms.
-
Identify and normalize the tokens in each name. Each name will usually have multiple tokens.
-
Compare each token from name 1 with each token from name 2, calculating the score for every token pair.
-
Once all the token pairs have been scored, the best combination of tokens is selected to maximize the complete score.
-
Score unmatched tokens as deletions or conflicts.
-
Compute a weighted average score.
-
Adjust the final score. For example, the score is decreased if the gender of the two names does not appear to match.
Tokenize, Normalize, and Transliterate
Before any matching algorithms can be run, the names have to be transformed into tokens that can be compared. This step, as with many of the steps in name matching, often has language-specific components.
This step includes:
-
Removing stop words, such as Mr. or Senator or General. Stop words are language-dependent.
-
Transliterating into English and/or translating if necessary, including:
-
Adding vocalizations, or vowels in the correct location for languages such as Arabic which are often written in an unvocalized form.
-
Adding spaces (segmentation) to languages such as Chinese, Japanese, Korean, and Thai that don't use spaces, separating given and surname tokens.
-
Normalization, including removing diacritical marks, to get a canonical representation of the token.
The resulting token output enhances search accuracy and increases relevancy.
Calculate Scores for Token Pairs
Every token in Name 1 is matched and scored against every token in Name 2 to find the matching pairs that will result in the highest total score for the pair. All candidate token pairs are scored to determine the best match alignment.
Token scorers are modular, allowing different scorers to be chosen for each token pair. The choice of scorer applied will depend on the type of tokens, the type of matches, and the languages of the names.
There are multiple types of scorers used, including:
Each matcher returns a (ts
) score for the pair.
Score Deletions and Conflicts
To be considered a match, token pairs must score higher than both the conflictThreshold
and estimatedConflictOrDeletion
parameters. Tokens not part of a satisfactory pair in this regard will be considered either conflicts or deletions.
A conflict is a score given to a token pair which context suggests should have matched, but whose score was not high enough to be considered a match. For example, when “Johann Sebastian Bach” is matched against “Johann Ambrosius Bach”, Sebastian is in conflict with Ambrosius. The token pair is considered a conflict.
A deletion is a score given to individual tokens that are not part of successful match pairs or conflict pairs.
There are two types of deletions:
Calculate the Weighted Score
The token pair scorer returns 2 values, a (ts
) score and a (cs
) score. The ts
is the score of how well the tokens match, while the cs
score includes the placement of the token in the score calculation. The cs
will be lower if the tokens match but are out of order.
Each token has a weight. The weightings determine how important the token pair match is in calculating the final score. Full names are rated higher than initials, and unusual tokens get a higher weighting than more common names because it is more significant when they match. For example, Andrew is less common than John, so it gets a higher weighting. These weightings are used in calculating the final score, which is a weighted average of the cs
scores.
Tip
We apply specific cultural knowledge to the weighted scoring; for example in Spanish we handle parsing of matrilineal vs patrilineal surnames, with the onomastic understanding that the patrilineal name is likely to be treated as the person's primary surname.
At the end, all selected tokens and token-pairs are considered together and some final adjustments are made to the score. Examples of adjustments include:
-
A penalty is applied if the two names do not appear to have the same gender. This, of course, is language-dependent.
-
Sometimes there is a weighting penalty applied early in the process, which when considered at the end, in the full context of the match, is not as significant as initially determined. These values may be adjusted in the final score.
Score adjustment examples
This table lists a few of the adjustments that may be applied to a name match.
Name Matching Usage Model
Identify two names to compare. They may be in different languages (languages of use) and writing scripts.
Use MatchScorer
to score the similarity of two Name
objects. MatchScorer
and Name
are in the com.basistech.rnm
package.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_2names.java
For the Arabic name نايف أبو شرخ and its IC transliteration Nayif Abu-Sharakh, this comparison returns a score of 0.99
.
If you want to compare one name to many names, for improved efficiency you can cache the scorer with the one name (the query name) and used the cached scorer to compare that name to multiple names. As illustrated in the following code snippet, you must prepare each name that you use with the cached scorer.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_1name_tomany.java
For a sample Java application that matches two names and matches a query name against multiple reference names, see MatchNamesSample.