There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.
The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.
The typical process for tuning parameters is as follows:
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
Parameter Configuration Files
Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.
The parameter files are contained in two .yaml files located in
BT_ROOT/rlpnc/data/etc. The parameters are defined in
parameter_defs.yaml and modified in
parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (
sane_minimum) and recommended maximum (
sane_maximum) value, which we advise you do not exceed.
parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.
Do not modify the
parameter_defs.yaml file. All changes should be made in the
Do refer to the
parameter_defs.yaml file for definitions and usage of all available parameters.
The parameters in the
parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the
eng_eng profile. There is also an
any profile which applies to all language pairs. The
Parameter profiles have the following characteristics:
Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (
eng), which always comes last. The two languages can be the same. Examples:
They can include the entity type being matched, such as
eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON.
Parameter profiles can inherit mappings from other parameter profiles. The global
any profile applies to all languages; all profiles inherit its values.
any profile can include an entity type;
any_PERSON applies to all PERSON matches regardless of language.
Specific language profiles inherit values from global profiles. The profile matching person names is named
any_PERSON. The profile for matching Spanish person against English person names is named
spa_eng_PERSON. It inherits parameter values from the
spa_eng profile and the
any_PERSON profile. The
any_PERSON profile will not override parameter values from more specific profiles, such as the
Global changes are made with the
Any changes to address parameters should go under the
any profile, and will affect all fields for all addresses.
Modifying Name Parameters
To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in
parameter_defs.yaml. For additional support on tuning the parameters, contact email@example.com.
An example parameter to tune is
reorderPenalty, which penalizes tokens that are out of order. For some use cases, you may decide the token order should not significantly affect whether two names are a match. This may be because the data combines given-name-first and last-name-first data sources (i.e. "John Kennedy" and "Kennedy John" are both expected to be present). To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. Decreasing the parameter value, the
reorderPenalty will cause the out-of-order tokens to have a smaller effect on the match score.
Another example parameter is
deletionScore, which increases the match score when part of the name is missing. For example, "John Fitzgerald Kennedy" and "John Kennedy" have a higher match score if you increase the
deletionScore parameter value.
Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited
Commonly Modified Name Parameters
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the
Listed here are some of the more commonly modififed parameters.
Token Conflict Score (
conflictScore). This parameter controls how paired unmatched tokens get scored. Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
Initials Score (
initialsScore). This parameter controls how tokens that represent initials get scored. Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
Token Deletion Score (
deletionScore). This parameter controls how unpaired name tokens get scored. Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
Token Reorder Penalty (
reorderPenalty). This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
Right End Boost/Both Ends Boost (
boostWeightAtBothEndsboost). These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.
Evaluating Parameter Configuration
To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.
If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.
Configuring Name Overrides
RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:
Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
Name pair matches specify scores to be assigned for specified full-name pairs.
Token pair overrides specify name token pairs that match along with a match score.
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.
The name matching override files are in the
You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.
com.basistech.rni.index.RNIConfiguration provides methods that you can use to define your own override tables (character streams) in place of the tables in the default directory. See the HTML API documentation for
Stop Patterns and Stop Word Prefixes
Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.
For each name, RNI performs the following steps in order:
Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.
RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.
A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java 1.8
java.util.regex.Pattern; see the Javadoc for detailed documentation.
Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
where LANG is a three-letter language code.
Each row in the file, except for rows that begin with
# is a regular expression. Leading and trailing whitespace is removed from regex lines, so use
\s at the beginning and end as needed.
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the
brigadier[-]general stop pattern is applied first, but
general is also a stop pattern and will be applied as well.
RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in
BT_ROOT /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is
stopregexes_eng.txt. For example, the entries
indicate that the common indicators for first-name-unknown and last-name-unknown followed by nothing are to be removed.
You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tab
n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,
indicates that the regex is to be applied to fields 2 and 3 in fielded names.
You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example,
stopregexes_ara.txt would include regular expressions with Arabic text;
stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.
Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
where LANG is a three-letter language code. Each row in the file, except for rows that begin with
#, is a string literal. Prefixes matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the
lieutenant colonel stop word prefix is applied where applicable when
colonel is also a stop word prefix.
RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hungarian, Spanish, and Thai. These files are in
stopprefixes_spa.txt. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example,
stopprefixes_rus.txt would include stop word prefixes for use with Russian text.
Overriding Name Pair Matches
You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:
where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Each row in the file, except for rows that begin with
#, is a tab-delimited full-name pair and score:
name1 Tab name2 Tab score
The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.
Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.
The installation includes a sample file with sample entries commented out:
BT_ROOT/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,
John Doe Joe Bloggs 1.0
indicates that the query name
John Doe matches the index name
Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.
These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is
Joe Bloggs and the index includes a document with an
rni_name field containing
You can add entries for English to English name matches to
fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in
外山恒 Toyama Koichi 1.0
ヒラリークリントン Hillary Clinton 1.0
Overriding Token Pair Matches
You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. Tokens cannot contain whitespace. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if
Abby constitute a token pair, then the match score for
Abigail Harris and
Abby Harris will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with
#, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:
Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]
A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.
BT_ROOT/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:
Peter Pete NICKNAME
Peter Pedro COGNATE
This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION:
When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example
tokens_eng_eng.txt indicates that the contents match English names to English names;
tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities:
We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.
Normalizing Token Variants
You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:
equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.
Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:
BT_ROOT/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to
You can add lists of variants to this file, including the normalized form in square brackets to start each list.
You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.
The file name is
BT_ROOT/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".
Custom Language Model Training
You can train a language model on your own name data. RNI uses language models in which common names score differently than rare names. For example, "John Jingleheimer" should match "Jingleheimer" better than "John", because Jingleheimer is a rarer name than John. RNI already comes with language models for many supported languages, but you might find it best to train a new language model so that it reflects the statistics of your data. Please note that a large amount of full names are required to train an effective language model.
frequencyModelTrainer.zip to any desired location. Ensure that the
JAVA_HOME environment variable is set and points to a Java version of 1.8 or higher.
Simple usage example
bin/buildLM.sh -root rni-rnt -in eng_PER_LM.tsv
-lang eng -script Latn
frequencyModelTrainer.zip for more details including the full description of arguments.