The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.
The typical process for tuning parameters is as follows:
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
Parameter Configuration Files
Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.
The parameter files are contained in two .yaml files located in /rlpnc/data/etc
. The parameters are defined in parameter_defs.yaml
and modified in parameter_profiles.yaml
.
parameter_defs.yaml
lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum
) and recommended maximum (sane_maximum
) value, which we advise you do not exceed.
parameter_profiles.yaml
is where you change parameter values based on the language pairs in the match.
Important
Do not modify the parameter_defs.yaml
file. All changes should be made in the parameter_profiles.yaml
file.
Do refer to the parameter_defs.yaml
file for definitions and usage of all available parameters.
Commonly Modified Name Parameters
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml
file.
The following examples describe the impact of parameter changes in more detail.
Example 1. Token Conflict Score conflictScore
Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
Example 2. Initials Score (initialsScore
)
Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
Example 3. Token Deletion Score (deletionScore
)
Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
Example 4. Token Reorder Penalty (reorderPenalty
)
This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
Example 5. Right End Boost/Both Ends Boost (boostWeightAtRightEnd
,boostWeightAtBothEndsboost
)
These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.