Rosette's address matching capability takes advantage of the advanced algorithms and fuzzy matching that Rosette uses for name matching, while adding address-specific features, including:
Automatic parsing of address strings into address fields
Intelligent matching of US, UK, and Canadian postal codes
Edit distance scores for alphanumeric fields
Address field groups to detect and correct commonly misidentified address data
Recognizing common address abbreviations in the U.S., Canada, and the UK.
Rosette performs textual address matching as opposed to other tools which use geolocation to compare the physical distance between addresses. With geolocation, what happens when there is a mistake in the address? A typo in a street name or house number could place the address across town. A mistake in the city or country could place it even further away. Textual address matching accounts for all of these common phenomena.
Rosette provides a matching score based on the similarity of the addresses, using a number of matching algorithms optimized for the different address field types. Textual address fields, such as street name, city, and state, are matched using the linguistic, statistically-based fuzzy matching algorithms that are used in by RNI for name matching. Numeric and alphanumeric fields, such as house number, postal code, and unit, are matched using character-based methods.
Rosette understands and handles the variations commonly found in addresses.
Table 1. Common Address Variations
Variation |
Example |
Phonetics and spelling differences |
100 Montvale Ave vs. 100 Montvail Av |
Missing address field components |
100 Montvale Ave vs. 100 Montvale |
Differences in upper and lowercase |
100 Montvale Ave vs. 100 MONTVALE AVE |
Reordered address components within a field |
100 Montvale Ave. vs. 100 Avenue Montvale |
Address field abbreviations |
Montvale St. vs Montvale Street |
Address matches are scored on a scale between 0.0 (no match) and 1.0 (perfect match), indicating relative similarity rather than an absolute value.
Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.
Each address consists of a set of address fields. When you specify an address, you specify the individual fields, or a string which Rosette parses into fields. At least one field must be specified for each address, but no specific fields are required,
Table 2. Supported Address Fields
Field Name |
Description |
Example(s) |
house
|
venue and building names |
"Brooklyn Academy of Music", "Empire State Building" |
houseNumber
|
usually refers to the external (street-facing) building number |
"123" |
road
|
street name(s) |
"Harrison Avenue" |
unit
|
an apartment, unit, office, lot, or other secondary unit designator |
"Apt. 123" |
level
|
expressions indicating a floor number |
"3rd Floor", "Ground Floor" |
staircase
|
numbered/lettered staircase |
"2" |
entrance
|
numbered/lettered entrance |
"front gate" |
suburb
|
usually an unofficial neighborhood name |
"Harlem", "South Bronx", "Crown Heights" |
cityDistrict
|
these are usually boroughs or districts within a city that serve some official purpose |
"Brooklyn", "Hackney", "Bratislava IV" |
city
|
any human settlement including cities, towns, villages, hamlets, localities, etc. |
"Boston" |
island
|
named islands |
"Maui" |
stateDistrict
|
usually a second-level administrative division or county |
"Saratoga" |
state
|
a first-level administrative division |
"Massachusetts" |
countryRegion
|
informal subdivision of a country without any political status |
"South/Latin America" |
country
|
sovereign nations and their dependent territories, which have a designated ISO-3166 code |
"United States of America" |
worldRegion
|
currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean |
"Jamaica, West Indies" |
postCode
|
postal codes used for mail sorting |
"02110" |
poBox
|
post office box: typically found in non-physical (mail-only) addresses |
"28" |
When an address is parsed into address fields, values can get put into the wrong field. Rosette created address field groups to encapsulate common transpositions between fields. When scoring matching values in fields, Rosette uses address field groups to group related or similar fields. Thus, even if two field values match, if they are in dissimilar field types, Rosette applies a penalty to that match. reducing the score for that pair.
When matching two fields, the following penalties are applied:
If the fields are the same, no penalty is applied. (street - street)
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
If the fields are in different field groups, a large penalty is applied. (road - city)
Table 3. Address Groups
Group |
Fields |
house |
house |
house_number |
houseNumber |
road |
road |
unit |
unit
level
staircase
entrance
|
city |
suburb
cityDistrict
city
|
state |
island
stateDistrict
state
|
country |
countryRegion
country
worldRegion
|
post_code |
postCode |
po_box |
po_box |
How Rosette Calculates Address Match Scores
The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.
-
Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.
For example, most addresses are formatted from specific to general:
houseNumber road city state postCode
-
The parser would provide predictable results for an address in an expected order:
38 Concord Road, Apt. B Arlington MA
-
The parser would have more difficulty if the address format was in an unexpected order:
Arlington MA Concord Road #38 Apt B
If you are getting unexpected match values, check how the addresses are being parsed into address fields.
Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.
-
Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:
Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.
Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.
Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.
Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.
Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.