Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.
There are two types of override files for addresses:
File Directories
The parameters are modified in the BT_ROOT/rlpnc/data/etc/parameter_profiles.yaml
file.
The address matching override files are in the BT_ROOT/rlpnc/data/addresses/ref/overrides
directory.
The address stop word files are in the BT_ROOT/rlpnc/data/addresses/ref/stopwords
directory.
Modifying Address Parameters
To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml
. The parameter files are described in Parameter Configuration Files.
Note
Only the any
profile is supported for address parameters; language-specific profiles are not supported for addresses.
An example parameter to tune is addressJoinedTokenLimit
, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit
will be allowed to merge more tokens.
Another example parameter is houseNumberAddressFieldWeight
, which controls the weight of the houseNumber
score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight
controls the weight of the city field when matching addresses.
Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Stop Patterns and Stop Word Prefixes
RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.
For each address field, RNI performs the following steps in order:
Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.
Stop patterns are applied.
Stop words are applied.
A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java 1.8 java.util.regex.Pattern
class; see the Javadoc for detailed documentation.
Stop patterns for a given address field are specified in a UTF-8 file with the AddressField
name:
stopregexes_FIELD.txt
where FIELD is an AddressField
name. Each row in the file, except for rows that begin with #
, is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s
at beginning and end where needed.
Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.
Stop pattern files are arranged by field in BT_ROOT/rlpnc/data/addresses/ref/stopwords
. You can add patterns to existing files, or if the file doesn't exist, create an UTF-8 file in the directory and include the respective AddressField
name in the filename. For example, stopregexes_stateDistrict.txt
would include regular expressions to remove elements from the stateDistrict
address field.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.
Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField
name:
stopprefixes_FIELD.txt
where FIELD is an AddressField
name. Each row in the file, except for rows that begin with #
, is a string literal.
Prefixes in the address field matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.
RNI includes files with stopword prefixes for addresses in English (for house
, houseNumber
, road
, unit
, city
, state
, country
, postCode
and poBox
fields). These files are in BT_ROOT/rlpnc/data/addresses/ref/stopwords
. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the respective AddressField
name in the filename. For example, stopprefixes_stateDistrict.txt
would include stopword prefixes for use on stateDistrict
address field.
Overriding Token Pair Matches
Note
Token pair overrides are only supported for English-English token pairs.
You can create text files that specify token (address field element) pairs that match. Tokens cannot contain whitespace. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road
and rd
constitute a token pair, then the match score for Stuart Road
and Stuart Rd
will be higher than it would be if the token pair had not been specified.
The token pairs file name format is:
FIELD.txt
where FIELD is the AddressField
name. Each entry in the file, except for rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore
parameter value will be used.
Token1 Tab Token2 Tab [0.0-1.0]
A token pair override score serves as a minimum score, but you can write /force
after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [0.0-1.0]/force
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".
RNI includes BT_ROOT/rlpnc/data/addresses/ref/override/state.txt
, which contains a list of U.S. state abbreviations. For example:
Massachusetts MA
California CA
This directory also contains English to English token overrides for unit
, level
, road
, city
, state
fields: unit.txt
, level.txt
, road.txt
, city.txt
, state.txt
.
When you create an additional file in the same location, use the respective AddressField
name in the filename to identify the address field each token element in the pair pertains to. For example cityDistrict.txt
indicates that the contents match cityDistrict address fields.