The default values of the RNI match parameters are tuned to perform well on most queries and dataset. However, every use case uses different data with distinct match requirements. You may need to configure the match parameters to your use case if you are getting undesired results.
The typical process for tuning parameters is as follows:
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
To start tuning the parameters, run the RNI Pairwise Match on the test set and look at the Match Reasons in the response. These Match Reasons will serve as a guide for which parameters to tune, which are defined in parameters_defs.yaml
. For additional support on tuning the parameters, contact support@rosette.com.
An example parameter to tune is reorderPenalty
, which penalizes tokens that are out of order. For some use cases, you may decide the token order should not significantly affect whether two names are a match. This may be because the data combines given-name-first and last-name-first data sources (i.e. "John Kennedy" and "Kennedy John" are both expected to be present). To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. Decreasing the parameter value, the reorderPenalty
will cause the out-of-order tokens to have a smaller effect on the match score.
Another example parameter is deletionScore
, which increases the match score when part of the name is missing. For example, "John Fitzgerald Kennedy" and "John Kennedy" have a higher match score if you increase the deletionScore
parameter value.
Once you define a profile and set a parameter value, rerun the RNI Pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Evaluating Parameter Configuration
To evaluate the newly tuned parameter values, query a large dataset of names that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.
If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.
For further configuration options such as custom language model training and tuning name match parameters please contact support@rosette.com.