Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.
There are two types of override files for addresses:
File Directories
-
The parameters are modified in the plugins/rni/bt_root/rlpnc/data/etc/parameter_profiles.yaml
file.
-
The address matching override files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/overrides
directory.
-
The address stop word files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
directory.
Modifying Address Parameters
To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml
. The parameter files are described in Parameter Configuration Files.
Note
Changes made to the any
profile apply to all supported languages.
An example parameter to tune is addressJoinedTokenLimit
, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit
will be allowed to merge more tokens.
Another example parameter is houseNumberAddressFieldWeight
, which controls the weight of the houseNumber
score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight
controls the weight of the city field when matching addresses.
Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Stop Patterns and Stop Word Prefixes
RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.
For each address field, RNI performs the following steps in order:
-
Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.
-
Stop patterns are applied.
-
Stop words are applied.
A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern
class; see the Javadoc for detailed documentation.
Stop patterns for a given address field are specified in a UTF-8 file with the AddressField
name:
stopregexes_LANG_ADDRESS_FIELD__FIELD.txt
where LANG is a three-letter language code and FIELD is an AddressField
name. Currently, the only supported values for LANG are eng
and zho
. Each row in the file, except for rows that begin with #
, is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s
at beginning and end where needed.
Note
The delimiter before FIELD is a double underscore (__
)
Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.
Stop pattern files are arranged by field in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt
would include regular expressions to remove elements from the CITY
address field for English.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.
Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField
name:
stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt
where LANG is a three-letter language code and FIELD is an AddressField
name. Currently, the only supported values for LANG are eng
and zho
. Each row in the file, except for rows that begin with #
, is a string literal.
Note
The delimiter before FIELD is a double underscore (__
)
Prefixes in the address field matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.
RNI includes files with stop word prefixes for selected address fields in English and Chinese. These files are in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt
would include stopword prefixes for use on CITY
address field for English.
Overriding Token Pair Matches
You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. Tokens cannot contain whitespace. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road
and rd
constitute a token pair, then the match score for Stuart Road
and Stuart Rd
will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
LANG1_LANG2_FIELD.txt
where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField
name. Each entry in the file, except for rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore
parameter value will be used.
Token1 Tab Token2 Tab [0.0-1.0]
A token pair override score serves as a minimum score, but you can write /force
after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [0.0-1.0]/force
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".
RNI includes plugins/rni/bt_root/rlpnc/data/addresses/ref/override/eng_eng_state.txt
, which contains a list of U.S. state abbreviations. For example:
Massachusetts MA
California CA
When you create an additional file in the same location, use the respective AddressField
name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt
indicates that the contents match Chinese - English cityDistrict address fields.