RNI provides a Java API for matching addresses in English.
In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with different mismatched address fields or as indicated in Address Variations.
Note
RNI is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.
Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.
When entered as a set of fields, the address may include any of the fields in Table 5, “Supported Address Fields”. At least one field must be specified, but no specific fields are required.
RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using numeric-based methods.
Table 5. Supported Address Fields
Field Name |
Description |
Example(s) |
house |
venue and building names |
"Brooklyn Academy of Music", "Empire State Building" |
houseNumber |
usually refers to the external (street-facing) building number |
"123" |
road |
street name(s) |
"Harrison Avenue" |
unit |
an apartment, unit, office, lot, or other secondary unit designator |
"Apt. 123" |
level |
expressions indicating a floor number |
"3rd Floor", "Ground Floor" |
staircase |
numbered/lettered staircase |
"2" |
entrance |
numbered/lettered entrance |
"front gate" |
suburb |
usually an unofficial neighborhood name |
"Harlem", "South Bronx", "Crown Heights" |
cityDistrict |
these are usually boroughs or districts within a city that serve some official purpose |
"Brooklyn", "Hackney", "Bratislava IV" |
city |
any human settlement including cities, towns, villages, hamlets, localities, etc. |
"Boston" |
island |
named islands |
"Maui" |
stateDistrict |
usually a second-level administrative division or county |
"Saratoga" |
state |
a first-level administrative division |
"Massachusetts" |
countryRegion |
informal subdivision of a country without any political status |
"South/Latin America" |
country |
sovereign nations and their dependent territories, anything with an ISO-3166 code |
"United States of America" |
worldRegion |
currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean |
"Jamaica, West Indies" |
postCode |
postal codes used for mail sorting |
"02110" |
poBox |
post office box: typically found in non-physical (mail-only) addresses |
"28" |
Address Matching Usage Model
Identify two addresses to compare.
Use MatchScorer
to score the similarity of two AddressSpec
objects. MatchScorer
and AddressSpec
are in the com.basistech.rni.match
and com.basistech.rni.match.address
packages respectively.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_2addresses.java