There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.
The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.
The typical process for tuning parameters is as follows:
-
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
-
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
-
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
-
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
-
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
-
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
Parameter Configuration Files
Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.
The parameter files are contained in two .yaml files located in plugins/rni/bt_root/rlpnc/data/etc
. The parameters are defined in parameter_defs.yaml
and modified in parameter_profiles.yaml
.
-
parameter_defs.yaml
lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum
) and recommended maximum (sane_maximum
) value, which we advise you do not exceed.
-
parameter_profiles.yaml
is where you change parameter values based on the language pairs in the match.
Important
Do not modify the parameter_defs.yaml
file. All changes should be made in the parameter_profiles.yaml
file.
Do refer to the parameter_defs.yaml
file for definitions and usage of all available parameters.
The parameters in the parameter_profiles.yaml
file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng
profile. There is also an any
profile which applies to all language pairs.
Parameter profiles have the following characteristics:
-
Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (eng
), which always comes last. The two languages can be the same. Examples:
-
They can include the entity type being matched, such as eng_eng_PERSON
. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.
-
Parameter profiles can inherit mappings from other parameter profiles. The global any
profile applies to all languages; all profiles inherit its values.
-
The any
profile can include an entity type; any_PERSON
applies to all PERSON matches regardless of language.
-
Specific language profiles inherit values from global profiles. The profile matching person names is named any_PERSON
. The profile for matching Spanish person against English person names is named spa_eng_PERSON
. It inherits parameter values from the spa_eng
profile and the any_PERSON
profile. The any_PERSON
profile will not override parameter values from more specific profiles, such as the spa_eng
profile.
Important
Global changes are made with the any
profile.
Any changes to address parameters should go under the any
profile, and will affect all fields for all addresses.
A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any
profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:
For example, the MyParameterUniverse universe may include the following parameter profiles:
-
"name": "MyParameterUniverse/any"
applies to all language pairs.
-
"name": "MyParameterUniverse/spa_eng"
applies to English - Spanish name pairs.
-
"name": "MyParameterUniverse/spa_eng_PERSON"
applies to all PERSON English - Spanish name pairs.
Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml
file, along with a value. Parameter universes are added to the parameter_profiles.yaml
file.
A parameter universe can also be defined dynamically . We recommend that you use dynamic parameter universes for testing and tuning only. For production use, add all parameter universes to the parameter_profiles.yaml
file.
Tip
You can define multiple named parameter profiles.
Define the parameter universe in the parameter_profiles.yaml
file. Example:
parameterUniverseOne/spa_eng_PERSON:
reorderPenalty: 0.4
HMMUsageThreshold: 0.8
stringDistanceThreshold: 0.1
useEditDistanceTokenScorer: true
parameterUniverseOne/eng_eng:
reorderPenalty: 0.6
Using a Parameter Universe
To use a parameter universe, add it as part of the name_score
function when rescoring names queried from the index. All parameter values defined in the parameter universe will be used, where appropriate.
curl -XPOST "http://localhost:9200/_search" -H 'Content-Type: application/json' -d'{
"query": {
"match": {
"full_name": "A Ely Taylor"
}
},
"rescore": {
"window_size": 3,
"rni_query": {
"rescore_query": {
"rni_function_score": {
"name_score": {
"field": "full_name",
"query_name": "A Ely Taylor",
"score_to_rescore_restriction": 1,
"window_size_allowance": 0.5,
"universe": "parameterUniverseOne"
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}'
Parameter universes can also be used in the query phase. To do so, specify the query name as a json string and include the universe in the body.
curl -XPOST "http://localhost:9200/_search" -H 'Content-Type: application/json' -d'{
"query": {
"match": {
"full_name": "{ \"data\": \"A Ely Taylor\", \"universe\": \"parameterUniverseOne\"}"
}
},
"rescore": {
"window_size": 3,
"rni_query": {
"rescore_query": {
"rni_function_score": {
"name_score": {
"field": "full_name",
"query_name": "A Ely Taylor",
"score_to_rescore_restriction": 1,
"window_size_allowance": 0.5,
"universe": "parameterUniverseOne"
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}'
Dynamic Parameter Universes
When tuning RNI, you can use the Parameters REST API endpoint to dynamically create or update a parameter universe, overriding the existing parameter values without having to restart Elasticsearch. Once the optimum values are determined for each parameter, add the parameter universe to the parameter_profiles.yaml
file for production use.
Tip
Dynamic parameter universes are best suited for testing and tuning the RNI match parameters. Once you determine the best set of parameters, add the parameter universe to the parameter_profiles.yaml
file for production use. Using dynamic parameter universes can slow your system down considerably.
Use the Parameters endpoint to create a parameter universe, with parameters and values.
curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'{
"profiles": [
{
"name": "parameterUniverseOne/spa_eng_PERSON",
"parameters": {
"reorderPenalty": 0.4,
"HMMUsageThreshold": 0.8,
"stringDistanceThreshold": 0.1,
"useEditDistanceTokenScorer": true
}
}
]
}'
The name of the parameter universe is parameterUniverseOne and it applies to matching person names between Spanish and English.
Modifying Name Parameters
To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml
. For additional support on tuning the parameters, contact support@rosette.com.
Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml
file.
The following examples describe the impact of parameter changes in more detail.
Example 1. Token Conflict Score conflictScore
Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
Example 2. Initials Score (initialsScore
)
Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
Example 3. Token Deletion Score (deletionScore
)
Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
Example 4. Token Reorder Penalty (reorderPenalty
)
This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
Example 5. Right End Boost/Left End Boost/Both Ends Boost (boostWeightAtRightEnd
, boostWeightAtLeftEnd
, boostWeightAtBothEndsboost
)
These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.
The parameters boostWeightAtRightEnd
and boostWeightAtLeftEnd
should not be used together.
Language Support Parameters
RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully Supported Text Domains for Rosette Name Indexer and Name Matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.
Note
Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.
To set RNI to behave as it did previously, set allLanguageSupport
to false
.
Limited support uses two match score computations:
Two parameters control the level of language support.
Table 3. Language Support Parameters
Parameter
|
Description
|
Default
|
allLanguageSupport
|
When set to true , all languages are supported.
|
true
|
limitedLanguageEditDistance
|
When set to true , edit distance match scores are enabled for limited support languages. allLanguageSupport must be true .
|
true
|
Neural Model for Matching
When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.
To enable the neural model, set enableSeq2SeqTokenScorer
to true in the jpn_eng
profile in the parameter_profiles.yaml
file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.
To use the neural model:
-
Extract the appropriate library files from the platform-specific tensorflow JAR provided in the rni-es-<version>-seq2seq-libraries.zip
bundle.
-
Elasticsearch must be started with an additional Java property and point to the directory containing the extracted libraries:
ES_JAVA_OPTS="-Dorg.bytedeco.javacpp.cacheLibraries=false -Djava.library.path=<path-to-extracted-libraries>"
Note
The neural model is currently only available on MacOS and Linux platforms in RNI-ES versions 7.10.2.x and all plugins including RNI-RNT 7.38.1.67.0 or later.
If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language
(languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.
To enable Korean readings of names in Han script you need to edit the parameter files as follows:
-
Edit the zho_eng
profile in the internal_param_profiles.yaml
file and remove kor
from the list of ignoreTranslationOrigins
parameter.
-
Edit the zho_eng
profile in the parameter_profiles.yaml
file to increase the alternativePairsToCheck
parameter by 1 to compensate for the additional reading.
Matching Names with Han Characters
We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.
The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.
The parameter haniFourCornerCodeMismatchPenalty
applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty
is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.
To enable the feature, add the following line to your parameter_profiles.yaml
file:
zho_zho_PERSON:
haniFourCornerCodeMismatchPenalty: 1
Note
This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.
Matching Turkish and Vietnamese Names
Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.
-
Edit the parameter_profiles.yaml
file.
-
To enable Turkish detection, add:
detectableLanguages:
[tur]
To enable Vietnamese detection, add:
detectableLanguages:
[vie]
-
Restart the system.
Ignore malformed and null value parameters for RNI types
You can index null values and empty strings by updating the allowNullValue
parameter. If the allowNullValue
parameter is enabled, any document containing null values and empty strings for the fields rni_name
, rni_address
, and rni_date
types will be successfully indexed, but search capabilities will be limited to valid values.
You can direct RNI to index documents with malformed strings of language by updating the ignoreBadData
parameter. If the ignoreBadData
parameter is enabled, any document containing a malformed language string will be successfully indexed, but search capabilities will be limited to valid languages.
By default these parameters are disabled. These features are useful when performing bulk operations in Elasticsearch.
The file name is parameter_profiles.yaml
, located in plugins/rni/bt_root/rlpnc/data/etc/
.
To turn any of these features on, set the value of the parameter ignoreBadData
or allowNullValue
in the above file to true
.
Evaluating Parameter Configuration
To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.
If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.
Configuring Name Overrides
RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:
-
Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
-
Name pair matches specify scores to be assigned for specified full-name pairs.
-
Token pair overrides specify name token pairs that match along with a match score.
-
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
-
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.
The name matching override files are in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override
directory.
You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.
Stop Patterns and Stop Word Prefixes
Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.
For each name, RNI performs the following steps in order:
-
Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
-
Stop patterns are applied.
-
Stop words are applied.
RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.
A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern
; see the Javadoc for detailed documentation.
Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopregexes_LANG[_TYPE].txt
where LANG is a three-letter language code.
Each row in the file, except for rows that begin with #
is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s
at the beginning and end as needed.
Tip
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general
stop pattern is applied first, but general
is also a stop pattern and will be applied as well.
RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
. The generic (non-entity-specific) English file is stopregexes_eng.txt
. For example, the entries
^fnu\b
\blnu$
indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.
You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn
, where n
is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,
\blnu$ 2
\blnu$ 3
indicates that the regex is to be applied to fields 2 and 3 in fielded names.
You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt
would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt
would include regular expression to remove elements from PERSON names in English text.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.
Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopprefixes_LANG[_TYPE].txt
where LANG is a three-letter language code. Each row in the file, except for rows that begin with #
, is a string literal. Prefixes matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel
stop word prefix is applied where applicable when colonel
is also a stop word prefix.
RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hungarian, Spanish, and Thai. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
: stopprefixes_eng.txt
and stopprefixes_spa.txt
. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt
would include stop word prefixes for use with Russian text.
Overriding Name Pair Matches
You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:
fullnames_LANG1_LANG2[_TYPE].txt
where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.
Tip
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Each row in the file, except for rows that begin with #
, is a tab-delimited full-name pair and score:
name1 Tab name2 Tab score
The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.
Tip
Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.
The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt
. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,
John Doe Joe Bloggs 1.0
indicates that the query name John Doe
matches the index name Joe Bloggs
(both used in different regions to indicate 'person unknown') with a score of 1.0.
These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs
and the index includes a document with an rni_name
field containing John Doe
.
You can add entries for English to English name matches to fullnames_eng_eng.txt
, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt
:
外山恒 Toyama Koichi 1.0
ヒラリークリントン Hillary Clinton 1.0
Overriding Token Pair Matches
You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. Tokens cannot contain whitespace. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail
and Abby
constitute a token pair, then the match score for Abigail Harris
and Abby Harris
will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
tokens_LANG1_LANG2_[TYPE].txt
where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:
Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]
A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt
, which contains a list of English/English token pairs. For example:
Peter Pete NICKNAME
Peter Pedro COGNATE
This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt
, tokens_zho_eng_ORGANIZATION.txt
.
When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt
indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt
indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt
.
We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.
Multiple Sets of Token Overrides
There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector
parameter.
-
The value of overrideSelector
is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.
-
The value of overrideSelector
can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the overrideSelector
of OverrideGroup1
would be named:
tokens_eng_eng_PERSON-OverrideGroup1.txt
-
If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.
Normalizing Token Variants
You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:
equivalenceclasses_LANG_[TYPE].txt
For example, equivalenceclasses_jpn.txt
would contain entries for normalizing Japanese token variants for any entity type to a normalized form.
Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:
[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt
, which contains a list of variant renderings to normalize to muhammad
:
[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd
You can add lists of variants to this file, including the normalized form in square brackets to start each list.
You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.
The file name is lowWeightTokens_LANG.txt
.
For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt
contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".
Matching Organizations with Real World Ids
Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.
RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name Matching Within a Language lists the languages with provided real-world id dictionaries. Customers can also generate their own real-world id dictionaries to supplement the provided dictionaries.
Table 4. Real World Id Parameters
Parameter
|
Description
|
Default
|
useRealWorldIds
|
Enables real world ids, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.
|
true (enabled)
|
doQueryRealWorldIds
|
Enables querying with real world ids; set by language pair.
|
true (enabled)
|
realWorldIdScore
|
Sets the match score when two names match due to matching real world ids. Set by language pair.
|
0.98
|
nameRealWorldQueryBoost
|
Boosts the value of the real world id results from the first pass. Increases the likelihood of real world id matches being returned from the first pass. Set by language pair.
|
35
|
Building a Real World Id File
Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world ids provided in RNI and build your own file of real world ids. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin
where <LANG> is the three-letter language code of the file.
The input file is a tab separated file (.tsv
). Each line contains an organization name and a corresponding alphanumeric id. The file can only contain a single language and script. You must create a separate file for each language.
IBM WE1X92
Big Blue WE1X92
International Business Machines WE1X92
Unzip the file realWorldIDBuilder.zip
found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md
file in the zip file.