There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.
The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.
The typical process for tuning parameters is as follows:
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
Parameter Configuration Files
Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.
The parameter files are contained in two .yaml files located in
$BT_ROOT/rlpnc/data/etc. The parameters are defined in
parameter_defs.yaml and modified in
parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (
sane_minimum) and recommended maximum (
sane_maximum) value, which we advise you do not exceed.
parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.
Do not modify the
parameter_defs.yaml file. All changes should be made in the
Do refer to the
parameter_defs.yaml file for definitions and usage of all available parameters.
The parameters in the
parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the
eng_eng profile. There is also an
any profile which applies to all language pairs.
Parameter profiles have the following characteristics:
Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (
eng), which always comes last. The two languages can be the same. Examples:
They can include the entity type being matched, such as
eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.
Parameter profiles can inherit mappings from other parameter profiles. The global
any profile applies to all languages; all profiles inherit its values.
any profile can include an entity type;
any_PERSON applies to all PERSON matches regardless of language.
Specific language profiles inherit values from global profiles. The profile matching person names is named
any_PERSON. The profile for matching Spanish person against English person names is named
spa_eng_PERSON. It inherits parameter values from the
spa_eng profile and the
any_PERSON profile. The
any_PERSON profile will not override parameter values from more specific profiles, such as the
Global changes are made with the
Any changes to address parameters should go under the
any profile, and will affect all fields for all addresses.
A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global
any profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:
For example, the MyParameterUniverse universe may include the following parameter profiles:
"name": "MyParameterUniverse/any" applies to all language pairs.
"name": "MyParameterUniverse/spa_eng" applies to English - Spanish name pairs.
"name": "MyParameterUniverse/spa_eng_PERSON" applies to all PERSON English - Spanish name pairs.
Each parameter in the profile must match the name of a parameter declared in the
parameters_defs.yaml file, along with a value. Parameter universes are added to the
You can define multiple named parameter profiles.
Define the parameter universe in the
parameter_profiles.yaml file. Example:
Modifying Name Parameters
To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in
parameter_defs.yaml. For additional support on tuning the parameters, contact firstname.lastname@example.org.
Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the
The following examples describe the impact of parameter changes in more detail.
Example 1. Token Conflict Score
Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
Example 2. Initials Score (
Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
Example 3. Token Deletion Score (
Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
Example 4. Token Reorder Penalty (
This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
Example 5. Right End Boost/Left End Boost/Both Ends Boost (
These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.
boostWeightAtLeftEnd should not be used together.
Language Support Parameters
RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully Supported Text Domains for Rosette Name Indexer and Name Matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.
Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.
To set RNI to behave as it did previously, set
Limited support uses two match score computations:
Two parameters control the level of language support.
Table 4. Language Support Parameters
When set to
true, all languages are supported.
When set to
true, edit distance match scores are enabled for limited support languages.
allLanguageSupport must be
Neural Model for Matching
When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.
To enable the neural model, set
enableSeq2SeqTokenScorer to true in the
jpn_eng profile in the
parameter_profiles.yaml file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.
If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the
language (languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.
To enable Korean readings of names in Han script you need to edit the parameter files as follows:
zho_eng profile in the
internal_param_profiles.yaml file and remove
kor from the list of
zho_eng profile in the
parameter_profiles.yaml file to increase the
alternativePairsToCheck parameter by 1 to compensate for the additional reading.
Matching Names with Han Characters
We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.
The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.
haniFourCornerCodeMismatchPenalty applies a penalty if the names have different four corner codes. By default,
haniFourCornerCodeMismatchPenalty is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.
To enable the feature, add the following line to your
This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.
Matching Turkish and Vietnamese Names
Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.
To enable Turkish detection, add:
To enable Vietnamese detection, add:
Restart the system.
Evaluating Parameter Configuration
To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.
If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.
Configuring Name Overrides
RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:
Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
Name pair matches specify scores to be assigned for specified full-name pairs.
Token pair overrides specify name token pairs that match along with a match score.
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.
The name matching override files are in the
You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.
com.basistech.rni.index.RNIConfiguration provides methods that you can use to define your own override tables (character streams) in place of the tables in the default directory. See the HTML API documentation for
Stop Patterns and Stop Word Prefixes
Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.
For each name, RNI performs the following steps in order:
Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.
RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.
A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java
java.util.regex.Pattern; see the Javadoc for detailed documentation.
Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
where LANG is a three-letter language code.
Each row in the file, except for rows that begin with
# is a regular expression. Leading and trailing whitespace is removed from regex lines, so use
\s at the beginning and end as needed.
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the
brigadier[-]general stop pattern is applied first, but
general is also a stop pattern and will be applied as well.
RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in
$BT_ROOT /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is
stopregexes_eng.txt. For example, the entries
indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.
You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tab
n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,
indicates that the regex is to be applied to fields 2 and 3 in fielded names.
You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example,
stopregexes_ara.txt would include regular expressions with Arabic text;
stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.
Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
where LANG is a three-letter language code. Each row in the file, except for rows that begin with
#, is a string literal. Prefixes matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the
lieutenant colonel stop word prefix is applied where applicable when
colonel is also a stop word prefix.
RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in
$BT_ROOT /rlpnc/data/rnm/ref/override. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example,
stopprefixes_rus.txt would include stop word prefixes for use with Russian text.
Overriding Name Pair Matches
You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:
where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Each row in the file, except for rows that begin with
#, is a tab-delimited full-name pair and score:
name1 Tab name2 Tab score
The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.
Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.
The installation includes a sample file with sample entries commented out:
$BT_ROOT/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,
John Doe Joe Bloggs 1.0
indicates that the query name
John Doe matches the index name
Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.
These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is
Joe Bloggs and the index includes a document with an
rni_name field containing
You can add entries for English to English name matches to
fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in
外山恒 Toyama Koichi 1.0
ヒラリークリントン Hillary Clinton 1.0
Overriding Token Pair Matches
You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. Tokens cannot contain whitespace. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if
Abby constitute a token pair, then the match score for
Abigail Harris and
Abby Harris will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with
#, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:
Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]
A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.
$BT_ROOT/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:
Peter Pete NICKNAME
Peter Pedro COGNATE
This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION:
When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example
tokens_eng_eng.txt indicates that the contents match English names to English names;
tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities:
We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.
Multiple Sets of Token Overrides
There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the
The value of
overrideSelector is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.
The value of
overrideSelector can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the
OverrideGroup1 would be named:
If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.
Normalizing Token Variants
You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:
equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.
Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:
$BT_ROOT/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to
You can add lists of variants to this file, including the normalized form in square brackets to start each list.
You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.
The file name is
$BT_ROOT/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".
Matching Organizations with Real World IDs
Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.
RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name Matching Within a Language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.
Table 5. Real World ID Parameters
Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.
Enables querying with real world IDs; set by language pair.
Sets the match score when two names match due to matching real world IDs. Set by language pair.
Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.
Building a Real World ID File
Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named
<LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.
The input file is a tab separated file (
.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.
Big Blue WE1X92
International Business Machines WE1X92
Unzip the file
realWorldIDBuilder.zip found in the $BT_ROOT directory and run the build command. Instructions on how to run the program are in the
README.md file in the zip file.
You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.
The omit file is a tab separated file (
<LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.
Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.
To enable an omit file in RNI:
Place the omit file in the
omit_ids.datafiles, which is in the
$BT_ROOT/rlpnc/data/real_world_ids/ref/omit_ids directory by default.
Add a new entry for your omit file following the format
<LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:
ara_ORGANIZATION * rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
Custom Language Model Training
You can train a language model on your own name data. RNI uses language models in which common names score differently than rare names. For example, "John Jingleheimer" should match "Jingleheimer" better than "John", because Jingleheimer is a rarer name than John. RNI already comes with language models for many supported languages, but you might find it best to train a new language model so that it reflects the statistics of your data. Please note that a large amount of full names are required to train an effective language model.
frequencyModelTrainer.zip to any desired location. Ensure that the
JAVA_HOME environment variable is set and points to a Java version of 11 or higher.
Simple usage example
bin/buildLM.sh -root rni-rnt -in eng_PER_LM.tsv
-lang eng -script Latn
frequencyModelTrainer.zip for more details, including the full description of arguments.