In this section we will dive under the hood of RNI to understand the overall architecture and explain why and how RNI generates its name similarity score. It is critical to understand how a RNI score is computed in order to identify which algorithms were triggered. This insight will help you understand the configuration settings and how to set them.
The fundamental matching function
At the most fundamental level, RNI determines how similar two entities are. These entities can be made up of one or more fields that contain Person, Organization, Address, Location, or Date fields. Identifying fields are further broken down into subcomponents called tokens. For example, the tokens in a name might be the first name, middle name, and last name. Rosette blends machine learning with traditional name matching techniques such as name lists, common key, and rules to determine the best possible alignment and match score between sets of tokens. These granular scores are then combined into an overall record match score. This score can be used to maximize precision or recall depending upon the application.
Pairwise & index matching
There are two common usage patterns in record matching: pairwise and index.
Pairwise: In pairwise matching, you have 2 records (or even just 2 names) that you are comparing directly to one another. This comparison results in a single “similarity score” that reflects how similar the records are. There are many use cases where pairwise matching is all that is needed. For example, you already have a common key between 2 records and you simply want to check to see if the name fields are acceptably similar. This is common in identity verification applications.
Index: With index matching, you have a single record that you are comparing to a list of records. This usage pattern can be thought of as a search problem. You have a record (or maybe just the name of an entity) and want to search a large list of records to find a ‘match’. As in typical text search applications, the result of the search is a list of possible matches sorted by relevance, or in the case of fuzzy matching, record similarity. This is the case primarily discussed in this document.
Your business sets the priorities for optimizing search to find the right balance between accuracy and speed. You also determine which is more critical to your business: reducing potential matches at the expense of returning false matches or reducing false matches, with the increased risk of missing a potential match. RNI’s patented two-step approach provides the best approach for users to make these critical decisions.
Overview of Name Matching with RNI
Create an index containing the names you will be searching against. High recall keys are generated from the original name and are stored as the data/records are indexed.
-
Search against the index - 2 pass approach
Generate Candidates: The first pass is designed to quickly generate a set of candidates for the second pass to consider.
Pairwise Match: The second pass compares every value returned by the first pass against the value in the query and computes a similarity score. Multiple scorers are applied in the second pass, to generate the best possible score.
First Pass - Generate Candidates
The first pass takes the name you want to search with and again generates a list of recall keys. These keys are compared against the indexed keys. This approach casts a wide net that reduces false positives regardless of variation (misspelling, language, etc) of name. The maximum number of candidates to send to the second pass is referred to as the windowSize. This step creates the list of records to be scored by RNI.
Second Pass - Pairwise Match
The second pass acts as the precision filter which takes the original search record and compares it to each of the recall candidates from the first step. It calculates a similarity score for each name pair using RNI’s algorithms and rules. The scores are then sorted into descending order and returned.
The first pass gives the system the speed necessary for high-transaction environments, eliminating values in the index from consideration. The slower second pass re-compares each selected value directly in their original script, using enhanced scoring algorithms.
The relationship between accuracy and speed
Within RNI there is a relationship between accuracy and speed. The windowSize set on the first step of the search query will largely determine not just how accurate the search is, but also how fast it is. If you set your window size too small, you might miss records you intend to match on, increasing your false negatives. Increasing the window size will likely reduce false negatives, as represented in the following chart.
But if you set your window size too large, you will see an impact on performance. A smaller window size results in a smaller list of candidates to consider, improving performance. A larger window size results in a larger candidate pool that requires computation, decreasing performance, as represented in the following chart.
Since windowSize is tied to both accuracy and performance, determining the best value for windowSize is a critical component when performing an evaluation.
Understanding match scores
Newcomers to RNI are often confused why some name pairs generate the result they do. Understanding what goes into an RNI score and how to decode it is a key step in executing an effective evaluation. Rosette Match Studio (RMS) is an interactive tool for evaluating and optimizing name matching with RNI by providing insight into the algorithms and parameters used. RMS provides an easy-to-use UI and displays the details of how an RNI score was calculated. Let's walk through an example of an RNI score using the following:
In this example we compare ‘William M Smith’ to ‘Smyth Bill Michael’. The RNI result of these two names is shown below.
Here RMS shows how we arrived at the score of 0.754. There are several things working together to arrive at this score. Let’s review each step of the process.
Tokenization, normalization, and transliteration: Before any matching algorithms can be run, each name is transformed into individual tokens that can be compared. This step includes removing stop words, such as ‘Mr’., ‘Senator’, or ‘General’, and transliterating names into English. This step is more complicated when you are using languages that do not use white space. Your first step should be to determine if tokenization is being done correctly. In this case it is. Also, look to see if the correct stop words have been removed.
Token alignment: Next, every token in Name 1 is matched and scored against every token in Name 2 to find the matching pairs that produce the highest total score. All candidate token pairs are scored to determine the best match alignment. Token scorers are modular, allowing a different scorer to be chosen for each pair. RNI arranges the tokens from the names into a grid. The aligned tokens are shown in yellow. This is referred to as the token alignment matrix.
Weighted score: In addition to the token match score, the score displayed takes into consideration the order the tokens are in, the uniqueness of the given token, and the algorithm triggered within RNI. In each yellow cell of the grid you see the rule fired in bold print. In the case of tokens ‘smith’ and ‘smyth’ the rule fired is ‘HMM_MATCH’ that generated a score of 0.485. Underneath the tokens themselves you will see how much weight that token has in the scoring process. This grid explains what name tokens were used in the scoring, what was used to score them and the individual scores of the tokens.
Final scoring and bias adjustment: At the end, all selected tokens and token-pairs are considered together and some final adjustments are made to the score. For example, a penalty is applied if there is a gender mismatch between tokens. Adjustments are often language dependent. The scores are combined to form a final score and are adjusted by a final bias. Final bias is a value added to the final score to help names in different languages result in a similar score.
Additional information: RMS provides more detailed match information by checking the Explanation checkbox below the token alignment matrix. This data is also provided in the response of a match query. This information contains low level details of the various tokens and how they were assigned. It will also contain key information regarding the language, script, and language of origin of a name. When dealing with multilingual names this is another area that is critical to understand. Getting the language and script correct is vital to the correctness of a score.
Understanding the token alignment matrix is the key to understanding the RNI similarity score. It allows you to effectively conduct an error analysis for a given name pair. It is important to observe and record how each token was scored. This information can be thought of as error classes and will be important later when you configure the control parameters.
Individual name tokens are scored by a number of algorithms or rules. These algorithms and rules can be manipulated by setting configuration parameters, changing the final RNI similarity score. There are 100 plus RNI configuration parameters. The RNI Application Developer’s Guide describes these parameters in greater detail and how to modify these parameters. In RMS, you can adjust a subset of the most commonly modified parameters, seeing how they change the match scores and optimizing for your specific use cases. In Parameter configuration, we discuss techniques for setting parameters to configure your RNI installation.