RNI includes UTF-8 text files in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override
subdirectory that designate name elements to strip during indexing and queries, sample full-name pairs with match scores, token pairs to receive enhanced scores during queries, and token variants to normalize to the same form during queries. This directory also contains sample files for performing these operations on designated entity types.
You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.
Stop Patterns and Stop word Prefixes
Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and queries. Using string literals to strip prefixes is faster than applying stop patterns (regular expressions), so you should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.
For each name, RNI performs the following steps in order:
Character-level normalization, stripping punctuation (with the exception of periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.
RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.
A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java 1.8 java.util.regex.Pattern
; see the Javadoc for detailed documentation.
Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopregexes_LANG[_TYPE].txt
where LANG is a three-letter language code. Each row in the file, with the exception of rows that begin with #
,
is a regular expression. Leading and trailing whitespace is removed from regex lines, so use
\s
at beginning and end as needed.
Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[- ]general
stop pattern is applied first, but general
is also a stop pattern and will be applied as well.
RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
. The generic (non-entity-specific) English file is stopregexes_eng.txt
. For example, the entries
indicate that the common indicators for first-name-unknown and last-name-unknown followed by nothing are to be removed.
You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn
, where n
is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,
indicates that the regex is to applied to fields 2 and 3 in fielded names.
You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt
would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt
would include regular expression to remove elements from PERSON names in English text.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and queries.
Stopword prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopprefixes_LANG[_TYPE].txt .
where LANG is a three-letter language code. Each row in the file, with the exception of rows that begin with #
, is a string literal.
Prefixes matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel
stop word prefix is applied where applicable when colonel
is also a stop word prefix.
RNI includes files with generic stop word prefixes for names in English and Spanish. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
: stopprefixes_eng.txt
and stopprefixes_spa.txt
. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt
would include stop word prefixes for use with Russian text.
Overriding Name Pair Matches
You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:
fullnames_LANG1_LANG2[_TYPE].txt
where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Each row in the file, with the exception of rows that begin with #
, is a tab-delimited full-name pair and score:
name1 Tab name2 Tab score
The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.
The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt
. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,
indicates that the query name John Doe
matches the index name Joe Bloggs
(both used in different regions to indicate 'person unknown') with a score of 1.0.
These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs
and the index includes a document with an rni_name
field containing John Doe
.
You can add entries for English to English name matches to fullnames_eng_eng.txt
, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt
:
外山恒 Toyama Koichi 1.0
ヒラリークリントン Hillary Clinton 1.0
Overriding Token Pair Matches
You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. Tokens cannot contain whitespace. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail
and Abby
constitute a token pair, then the match score for Abigail Harris
and Abby Harris
will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
tokens_LANG1_LANG2_[TYPE].txt
where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, with the exception of rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:
Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]
A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt
, which contains a list of English/English token pairs. For example:
Peter Pete NICKNAME
Peter Pedro COGNATE
This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt
, tokens_zho_eng_ORGANIZATION.txt
.
When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt
indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt
indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt
.
We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.
Normalizing Token Variants
You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:
equivalenceclasses_LANG_[TYPE].txt
For example, equivalenceclasses_jpn.txt
would contain entries for normalizing Japanese token variants for any entity type to a normalized form.
Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:
[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt
, which contains a list of variant renderings to normalize to muhammad
:
[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd
You can add lists of variants to this file, including the normalized form in square brackets to start each list.
You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.
The file name is lowWeightTokens_LANG.txt
.
For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt
contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".
Ignore malformed and null value parameters for RNI types
You can update the behavior of RNI to index documents with unsupported languages by updating the ignoreBadData
parameter and you can also index null values by updating the allowNullValue
parameter. By default these parameters are disabled. If ignoreBadData
parameter is enabled, any document containing a name of an unsupported language will be successfully indexed but search capabilities will be limited to supported languages (the same applies when allowNullValue
parameter is enabled and we are dealing with documents that contain null values). These features are useful when performing bulk operations on Elasticsearch.
The file name is parameter_profiles.yaml
, located in plugins/rni/bt_root/rlpnc/data/etc/
.
To turn any of these features on, set the value of the parameter ignoreBadData
or allowNullValue
in the above file to true
.