Setting entity type and language/script
Where possible, you should always set the entity type and language arguments. While these arguments are optional and RNI is flexible enough to work without them, you will see better results when you use them.
RNI has different algorithms for matching PERSON and ORGANIZATION names. If you don't specify the entity type, RNI will use a simplified rule set of the PERSON entity type. This may lead to scores that do not meet expectations.
Similarly, when language and script are not provided RNI will attempt to guess what the language is. Given the similarity across some languages like Ukrainian and Russian or Japanese and Chinese, the guess may be incorrect. The language determines what model and rules are applied in the scoring and if the language/script are wrong it will likely lead to less accurate RNI scores.
It is strongly recommended that the entity type and language/script are determined and set during indexing and querying processes.
Name tokenization is the process of breaking up a name into pieces such as first name and last name. Use the list of names you have indexed and any historical queries you have saved to determine how often a token appears in your name universe. RNI processes full names, but the underlying algorithms work on a token level. Knowing how often a name token appears can provide insight into how to improve accuracy. For English names tokenization is as simple as finding the white space in a name. For languages that don’t use whitespace, you may need to incorporate Rosette Base Linguistics (RBL) which provides tokenization for 33 languages. The end result of this process should look like the following table.
Token frequency lets us accomplish a few things.
It identifies potential candidates for stopwords. These are words you don’t want to be used in the comparison. In the above example “INC” and “of” are frequently found. These words should have no real impact on the score as they are common and generic. RNI includes lists of stop words which can be modified, leading to a higher quality score.
RNI takes into consideration the uniqueness of a name when it calculates the final score. Using this output you can retrain the model on customer-specific data to help improve results.
Finally, this analysis helps identify data integrity issues. RNI is designed to work on name/entity data. Data migration and ingestion are complicated processes and often don’t work perfectly. It may be the case that web URLs or some other unintended transformations may have found their way into your data set. This will help identify potential issues and define actions to clean up any inconsistencies.
Evaluating your data set will provide insight into the scoring for each name pair and can result in a breakdown of errors for each name pair. This information can show you which parameters have the largest impact in name scoring with your data and direct you towards the parameters to target for adjustment. Given the large number of configurable match parameters in RNI, it is advisable to start with just a few. Once again, this process can be done via automation, as was done with determining the optimal threshold. Take a range of values for a small set of parameters and run your evaluation against all possible parameter configurations to determine the accuracy for each configuration. The end result of this process will be a threshold value along with the optimal parameter configuration. Using the data processing techniques from the previous section, you can extend it to include finding how often some of the following phenomena occur in your data sets.
Commonly Modified Name Parameters
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml
file.
The following examples describe the impact of parameter changes in more detail.
Example 1. Token Conflict Score conflictScore
Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
Example 2. Initials Score (initialsScore
)
Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
Example 3. Token Deletion Score (deletionScore
)
Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
Example 4. Token Reorder Penalty (reorderPenalty
)
This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
Example 5. Right End Boost/Both Ends Boost (boostWeightAtRightEnd
,boostWeightAtBothEndsboost
)
These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.
Users have the ability to classify specific tokens as ‘low weight’. These are tokens that you don’t want removed from the scoring of the name but should not carry the same weight as other name tokens. Given the following organization name example: ‘XYZ Systems’. The name token ‘Systems’ is commonly used in company names. This token will help contribute to matching against other organizations names likely creating false positives hits. Users can identify ‘Systems’ as a low weight token and add it to lowWeighttokens.datafiles
and RNI will decrease its importance when scoring names that contain ‘Systems’. In the above case the result will put more emphasis on the ‘XYZ’ name token.
Incorporating and optimizing field weights
Adding additional search signals to your query greatly impacts the accuracy of your results. RNI can also incorporate search signals of field types that it does not support. Fields like ‘gender’ or ‘occupation’ are not supported by RNI but can be used in the rescoring process to help optimize accuracy. While assessing your data a key step would be to evaluate the various fields and determine how sparse your data is. If many fields are optional when collecting data, this can result in missing or null data. Knowing how sparse your data is will give insight on the best way to craft your rescoring process. Users can control how missing or null data is handled by the rescoring process. By default, if a queried-for field is null in the index, the field is removed from the score calculation, and the weights of the other fields are redistributed. However, you can override this behavior by using the score_if_null option to specify what score should be returned for this field if it is null in the index document.
The next step is to determine how fields should be weighted by the rescoring algorithm. Each field can be given a weight to reflect its importance in the overall matching logic. Determining the ideal weight can be a challenge that requires understanding your data. How much do you trust the quality of each field and what is its importance to your business goals? It is advisable to think about a range of weights for each field, giving primary fields like person name and organization name high ranges and fields of lesser importance and quality lower ranges. Let’s look at an example to illustrate the impact of field weighting.
Given the following two records:
RECORD 1 RECORD 2
NAME: John Doe NAME: John Dowse
DOB: 07-23-1985 DOB: 07-23-1985
If the ‘NAME’ and ‘DOB’ fields are given equal weighting, the score in this case is 87% between the two records.
This is a strong RNI score but likely a false positive as the last names are noticeably different. Now let's give the ‘NAME’ a weight of 85% and ‘DOB’ 15%.
Let's assume we apply a threshold of 80%. With this weighting configuration our match score falls below the threshold, turning the once false positive into a true negative, thus improving the accuracy.
Identify the ranges for your field weights, then update your evaluation framework to include various configurations when identifying an optimal accuracy threshold.