RNI provides a Java API for matching addresses in English, Traditional Chinese, and Simplified Chinese.
In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.
Note
Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.
Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.
When entered as a set of fields, the address may include any of the fields in Table 5, “Supported Address Fields”. At least one field must be specified, but no specific fields are required.
RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields, such as house number, postal code, and unit, are matched using character-based methods.
Table 5. Supported Address Fields
Field Name
|
Description
|
Example(s)
|
house
|
venue and building names
|
"Brooklyn Academy of Music", "Empire State Building"
|
houseNumber
|
usually refers to the external (street-facing) building number
|
"123"
|
road
|
street name(s)
|
"Harrison Avenue"
|
unit
|
an apartment, unit, office, lot, or other secondary unit designator
|
"Apt. 123"
|
level
|
expressions indicating a floor number
|
"3rd Floor", "Ground Floor"
|
staircase
|
numbered/lettered staircase
|
"2"
|
entrance
|
numbered/lettered entrance
|
"front gate"
|
suburb
|
usually an unofficial neighborhood name
|
"Harlem", "South Bronx", "Crown Heights"
|
cityDistrict
|
these are usually boroughs or districts within a city that serve some official purpose
|
"Brooklyn", "Hackney", "Bratislava IV"
|
city
|
any human settlement including cities, towns, villages, hamlets, localities, etc.
|
"Boston"
|
island
|
named islands
|
"Maui"
|
stateDistrict
|
usually a second-level administrative division or county
|
"Saratoga"
|
state
|
a first-level administrative division
|
"Massachusetts"
|
countryRegion
|
informal subdivision of a country without any political status
|
"South/Latin America"
|
country
|
sovereign nations and their dependent territories, which have a designated ISO-3166 code
|
"United States of America"
|
worldRegion
|
currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean
|
"Jamaica, West Indies"
|
postCode
|
postal codes used for mail sorting
|
"02110"
|
poBox
|
post office box: typically found in non-physical (mail-only) addresses
|
"28"
|
When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.
When matching two fields, the following penalties are applied:
-
If the fields are the same, no penalty is applied. (street - street)
-
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
-
If the fields are in different field groups, a large penalty is applied. (road - city)
Table 6. Address Groups
Group
|
Fields
|
house
|
house
|
house_number
|
houseNumber
|
road
|
road
|
unit
|
unit
level
staircase
entrance
|
city
|
suburb
cityDistrict
city
|
state
|
island
stateDistrict
state
|
country
|
countryRegion
country
worldRegion
|
post_code
|
postCode
|
po_box
|
po_box
|
Address Matching Usage Model
Identify two addresses to compare.
Use MatchScorer
to score the similarity of two AddressSpec
objects. MatchScorer
and AddressSpec
are in the com.basistech.rni.match
and com.basistech.rni.match.address
packages respectively.
// Use MatchScorer to match two addresses.
void match2Addresses(AddressSpec addr1, AddressSpec addr2) {
MatchScorer ms = new MatchScorer();
double score = ms.score(addr1, addr2);
// Handle the score.
System.out.println("Score: " + score);
// Release resources used by the match scorer.
ms.close();
}
How Rosette Calculates Address Match Scores
The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.
-
Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.
For example, most addresses are formatted from specific to general:
houseNumber road city state postCode
-
The parser would provide predictable results for an address in an expected order:
38 Concord Road, Apt. B Arlington MA
-
The parser would have more difficulty if the address format was in an unexpected order:
Arlington MA Concord Road #38 Apt B
If you are getting unexpected match values, check how the addresses are being parsed into address fields.
-
Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.
-
Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:
-
Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.
-
Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.
-
Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.
-
Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.
-
Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.