Compare two values of the same field type and view the match score, along with details about how the match score was determined. The comparison can be between a search value and a matched value or two new terms.
To open the Compare tab:
To compare two values:
-
Enter two values to compare. The values might be the search value with a matched value, or two new values. These will be pre-filled for you when navigating from the search results list.
-
(Optional) If comparing Names, select the language of each value from the drop-down list. Adding the source languages may improve the match scores.
-
Compare.
The Match Score Computation displays the computation table matrix, a tabular representation of component scores and, if selected, a textual explanation describing how the final match score was determined. For more details on how how name matching works, see Understanding Name Match Scores.
For Name fields, the computation table matrix includes:
The tokens from Name 1 are listed down the first column, while the tokens from Name 2 are along the top row.
The shaded boxes highlight the token pairs selected during matching that produce the best score. A token pair is a token from Name 1 and its matching token from Name 2. Each shaded box contains a Match reason and a match value.
-
The top line indicates Match reason.
-
The match value takes into consideration the placement of the token in the score calculation. A penalty is applied if the tokens are out of order. When the tokens line up on the diagonal, they are all in order.
Under each token is a weight. The weightings determine how important the token pair match is in calculating the final score. For example, unusual tokens get a higher weighting than common names because it is more significant when they match and initials are weighted less than full names.
For more detail about the match score, select Explanation
You can modify the match parameters in real-time to see how each value impacts the final match score.
Rosette Match Studio and Rosette Name Indexer are turned to perform well in a variety of name matching scenarios. However, every use case uses different data with distinct match requirements. In Rosette Match Studio, you can easily modify and tune a set of the most common parameters, to improve the accuracy of matches.
To view and modify these settings:
-
Click on the settings icon
in the upper right corner of the compare window.
-
Select the configuration values to use. There are two ways to set the values:
-
Select a defined match configuration from the drop-down menu. You will not be able to manually change the values.
-
Change the value or values of interest. You may have to scroll down to view the entire list of available parameters.
-
Compare.
Changes are applied to the pairwise match currently displayed; overall system search settings are not modified.
-
The default parameter values are based on the entity type (person, organization, or location) and the language.
-
Reset sets the values back to their default settings.
-
To save a set of parameter values, click on the copy url
in the upper right corner. The url contains the parameter values.
This table lists the most common match reasons, but is not an exhaustive list of all possible match reasons.
Rosette Match Studio calculates the match score based on five different components:
-
Time: The number of days between Date 1 and Date 2.
-
Year: The difference of the year fields of Date 1 and Date 2.
-
Month: The difference of the month fields of Date 1 and Date 2.
-
Day The difference of the day fields of Date 1 and Date 2. 1 and 30 are far apart in value, even if they are close in time.
-
String The string distance is calculated by converting Date 1 and Date 2 to a standard format. The score is calculated on the edit distance between the two strings.
For more detail about the match score, select Explanation.
Rosette Match Studio supports a wide variety of date formats.
-
Days can be represented by 1 or 2 digits
-
Months can entered as numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation)
-
Years can be represented as 1, 2, 3 or 4 digits
-
Supported delimiters include , . - /
, as well as a space
-
Partial fields can be entered
Examples: All of the following are acceptable:
-
3/24/1984
-
March 1984
-
3-24
-
-24-84
-
March 24, 1984
-
24-3-84
-
March
-
1984
You can modify the weights given to each type of date match to see how each value impacts the final match score.
To view and modify the date settings:
-
Click on the settings icon
in the upper right corner of the compare window.
-
Change the weightings of the date component matches. You can also choose to disable swap.
-
Compare.
Changes are applied to the date pair currently displayed; overall date search settings are not modified.
-
Reset sets the values back to their default settings.
-
To save a set of parameter values, click on the copy url
in the upper right corner. The url contains the parameter values.
Because dates are sometimes written month day and other times written day month, swap tries matching the date fields as written as well as with the month and date fields switched. The best score is returned as the match score. For example, if the dates in question are 1970-3-5 and 1970-6-4, this feature will match the following four pairs:
The maximum score of the four pairs is then returned as the match score.
Check the Enable Swap box in the Advanced settings if you think there may be formatting inconsistencies in your dates and the dates you are matching may not always have days and months in the same positions.
If the selected match score is from a swapped pair, a penalty score is applied, indicating less certainty in the match. The displayed Pre Swap Penalty Score is the score returned for the selected swapped pair, before the penalty is applied.
The date weighting fields control the relative strength of each aspect of the date-matching algorithm. A separate score is calculated for each match type. The final match score is calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.
Table 3. Date Weighting Parameters
Display Name
|
Parameter Name
|
Score based on
|
Example
|
Time Weight
|
timeDistanceWeight
|
The number of days in between the two input dates
|
1979-12-31 and 1980-1-1 look different, but their time difference is very close. They will have a high match score.
|
Year Weight
|
yearDistanceWeight
|
The difference of the year fields
|
Close years will have a high match score.
|
Month Weight
|
monthDistanceWeight
|
The difference of the month fields
|
1 and 12 are far, even if they are close in time. They will have a low match score.
|
Day Weight
|
dayDistanceWeight
|
The difference of the day fields
|
1 and 30 are far, even if they are close in time. They will have a low match score.
|
String Weight
|
stringDistanceWeight
|
The edit difference between the two dates, when converted to a standard string (05021974 for 5/2/1974)
|
1979-12-31 and 1980-1-1 will be 19791231 and 198000101. They will have a low match score.
|
Dates with a high time match score may have a very low string match score. Time finds dates that are close together; string gives high scores to similarly formatted dates.
Rosette Match Studio compares addresses by comparing the fields within each address. You can enter the address as a single field or as separate fields.
-
Select the format of the address - single-field or multi-field. Both address 1 and address 2 must be in the same format.
-
Enter the addresses.
-
Compare.
RMS optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state are matched using an algorithm similar to name matching. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using numerically-based methods.
The Match Score Computation displays the match matrix for the address fields. Each entered field in address 1 is compared with each field in address 2, similar to how name tokens are scored.
Addresses can be defined either as a single field or as a set of address fields. When defined as a single field, the jpostal library is used to parse the address string into address fields.
When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.
Table 4. Supported Address Fields
Field Name
|
Description
|
Example(s)
|
house
|
venue and building names
|
"Brooklyn Academy of Music", "Empire State Building"
|
houseNumber
|
usually refers to the external (street-facing) building number
|
"123"
|
road
|
street name(s)
|
"Harrison Avenue"
|
unit
|
an apartment, unit, office, lot, or other secondary unit designator
|
"Apt. 123"
|
level
|
expressions indicating a floor number
|
"3rd Floor", "Ground Floor"
|
staircase
|
numbered/lettered staircase
|
"2"
|
entrance
|
numbered/lettered entrance
|
"front gate"
|
suburb
|
usually an unofficial neighborhood name
|
"Harlem", "South Bronx", "Crown Heights"
|
cityDistrict
|
these are usually boroughs or districts within a city that serve some official purpose
|
"Brooklyn", "Hackney", "Bratislava IV"
|
city
|
any human settlement including cities, towns, villages, hamlets, localities, etc.
|
"Boston"
|
island
|
named islands
|
"Maui"
|
stateDistrict
|
usually a second-level administrative division or county
|
"Saratoga"
|
state
|
a first-level administrative division
|
"Massachusetts"
|
countryRegion
|
informal subdivision of a country without any political status
|
"South/Latin America"
|
country
|
sovereign nations and their dependent territories, which have a designated ISO-3166 code
|
"United States of America"
|
worldRegion
|
currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean
|
"Jamaica, West Indies"
|
postCode
|
postal codes used for mail sorting
|
"02110"
|
poBox
|
post office box: typically found in non-physical (mail-only) addresses
|
"28"
|
Understanding Name Match Scores
To fully understand the name match scores, you need to understand how the match scores are determined. The match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar the two names are; it is not an absolute value. When comparing different name matches, the relative values of the match scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores.
A value of 1.0 is returned if and only if the two names are identical. Character strings, languages, languages of origin, and entity types must all match for the two names to be considered identical.
Calculating the match score is a complex process that involves multiple steps and algorithms.
-
Identify and normalize the tokens in each name. Each name will usually have multiple tokens.
-
Compare each token from name 1 with each token from name 2, calculating the score for every token pair.
-
Once all the token pairs have been scored, the best combination of tokens is selected to maximize the complete score.
-
Score unmatched tokens as deletions or conflicts.
-
Compute a weighted average score.
-
Adjust the final score. For example, the score is decreased if the gender of the two names does not appear to match.
Tokenize, Normalize, and Transliterate
Before any matching algorithms can be run, the names have to be transformed into tokens that can be compared. This step, as with many of the steps in name matching, often has language-specific components.
This step includes:
-
Removing stop words, such as Mr. or Senator or General. Stop words are language-dependent.
-
Transliterating into English and/or translating if necessary, including:
-
Adding vocalizations, or vowels in the correct location for languages such as Arabic which are often written in an unvocalized form.
-
Adding spaces (segmentation) to languages such as Chinese, Japanese, Korean, and Thai that don't use spaces, separating given and surname tokens.
-
Normalization, including removing diacritical marks, to get a canonical representation of the token.
The resulting token output enhances search accuracy and increases relevancy.
Calculate Scores for Token Pairs
Every token in Name 1 is matched and scored against every token in Name 2 to find the matching pairs that will result in the highest total score for the pair. All candidate token pairs are scored to determine the best match alignment.
Token scorers are modular, allowing different scorers to be chosen for each token pair. The choice of scorer applied will depend on the type of tokens, the type of matches, and the languages of the names.
There are multiple types of scorers used, including:
Each matcher returns a (ts
) score for the pair.
Once all the token pairs have been scored, the best combination of tokens is selected to maximize the complete score. The chosen pairs are displayed along with their scores.
Score Deletions and Conflicts
To be considered a match, token pairs must score higher than both the conflictThreshold
and estimatedConflictOrDeletion
parameters. Tokens not part of a satisfactory pair in this regard will be considered either conflicts or deletions.
A conflict is a score given to a token pair which context suggests should have matched, but whose score was not high enough to be considered a match. For example, when “Johann Sebastian Bach” is matched against “Johann Ambrosius Bach”, Sebastian is in conflict with Ambrosius. The token pair is considered a conflict.
A deletion is a score given to individual tokens that are not part of successful match pairs or conflict pairs.
There are two types of deletions:
Calculate the Weighted Score
The token pair scorer returns 2 values, a (ts
) score and a (cs
) score. The ts
is the score of how well the tokens match, while the cs
score includes the placement of the token in the score calculation. The cs
will be lower if the tokens match but are out of order.
Each token has a weight. The weightings determine how important the token pair match is in calculating the final score. Full names are rated higher than initials, and unusual tokens get a higher weighting than more common names because it is more significant when they match. For example, Andrew is less common than John, so it gets a higher weighting. These weightings are used in calculating the final score, which is a weighted average of the cs
scores.
Tip
We apply specific cultural knowledge to the weighted scoring; for example in Spanish we handle parsing of matrilineal vs patrilineal surnames, with the onomastic understanding that the patrilineal name is likely to be treated as the person's primary surname.
At the end, all selected tokens and token-pairs are considered together and some final adjustments are made to the score. Examples of adjustments include:
-
A penalty is applied if the two names do not appear to have the same gender. This, of course, is language-dependent.
-
Sometimes there is a weighting penalty applied early in the process, which when considered at the end, in the full context of the match, is not as significant as initially determined. These values may be adjusted in the final score.