As with name and date matching, the process is to create an index containing addresses, then query an address against the index.
Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.
When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.
RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using character-based methods.
Table 5. Supported Address Fields
Field Name
|
Description
|
Example(s)
|
house
|
venue and building names
|
"Brooklyn Academy of Music", "Empire State Building"
|
houseNumber
|
usually refers to the external (street-facing) building number
|
"123"
|
road
|
street name(s)
|
"Harrison Avenue"
|
unit
|
an apartment, unit, office, lot, or other secondary unit designator
|
"Apt. 123"
|
level
|
expressions indicating a floor number
|
"3rd Floor", "Ground Floor"
|
staircase
|
numbered/lettered staircase
|
"2"
|
entrance
|
numbered/lettered entrance
|
"front gate"
|
suburb
|
usually an unofficial neighborhood name
|
"Harlem", "South Bronx", "Crown Heights"
|
cityDistrict
|
these are usually boroughs or districts within a city that serve some official purpose
|
"Brooklyn", "Hackney", "Bratislava IV"
|
city
|
any human settlement including cities, towns, villages, hamlets, localities, etc.
|
"Boston"
|
island
|
named islands
|
"Maui"
|
stateDistrict
|
usually a second-level administrative division or county
|
"Saratoga"
|
state
|
a first-level administrative division
|
"Massachusetts"
|
countryRegion
|
informal subdivision of a country without any political status
|
"South/Latin America"
|
country
|
sovereign nations and their dependent territories, which have a designated ISO-3166 code
|
"United States of America"
|
worldRegion
|
currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean
|
"Jamaica, West Indies"
|
postCode
|
postal codes used for mail sorting
|
"02110"
|
poBox
|
post office box: typically found in non-physical (mail-only) addresses
|
"28"
|
When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.
When matching two fields, the following penalties are applied:
-
If the fields are the same, no penalty is applied. (street - street)
-
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
-
If the fields are in different field groups, a large penalty is applied. (road - city)
Table 6. Address Groups
Group
|
Fields
|
house
|
house
|
house_number
|
houseNumber
|
road
|
road
|
unit
|
unit
level
staircase
entrance
|
city
|
suburb
cityDistrict
city
|
state
|
island
stateDistrict
state
|
country
|
countryRegion
country
worldRegion
|
post_code
|
postCode
|
po_box
|
po_box
|
-
Create an index.
curl -XPUT 'http://localhost:9200/rni-test'
-
Define a mapping for fields that will contain addresses. The type for each of these fields is "rni_address"
.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
"properties" : {
"primary_name" : { "type" : "rni_name" },
"residence" : { "type" : "rni_address" }
}
}'
-
Index documents containing an address field.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
"primary_name" : "Joe Schmoe",
"residence" : {
"houseNumber" : "123",
"road" : "Main St",
"city" : "Boston",
"state" : "Massachusetts",
"postCode" : "02110"
}
}'
The address in the document can also be defined as a string.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
"primary_name" : "Joe Schmoe",
"residence" : "123 Main St, Boston, Massachusetts, 02110"
}'
RNI compares the fields in the query with the fields in the index, matching each non-blank field. Addresses do not have to contain all the same fields to be compared and matched.
As with other objects, the query for an address consists of two parts: the base query and the RNI pairwise address match rescore query.
Base Query. The base query is a standard query against the address field. Refer to Query the Index.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {
"residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}"
}
}
}'
RNI Rescore with Addresses. Refer to Rescoring with RNI Pairwise Name Match.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : { "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}" }
},
"rescore" : {
"query" : {
"rescore_query" : {
"function_score" : {
"address_score" : {
"field" : "residence",
"query_address" : {
"road" : "Main",
"state" : "MA"
}
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'
The query returns a hit with the RNI address match score.
"hits": {
"total" : 1,
"max_score" : 0.6057692,
"hits" : [
{
"_index" : "rni-test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6057692,
"_source" : {
"primary_name" : "Joe Schmoe",
"residence" : {
"houseNumber" : "123",
"road" : "Main St",
"city" : "Boston",
"state" : "Massachusetts",
"postCode" : "02110"
}
}
}
]
}
The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.
The address can be structured as a string for queries. The address structure for the query is independent of the format of the address in the original document. A string can be used in the query regardless of whether the indexed address was formatted with fields or as a string.
Base Query. The base query constructed with an address string.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {"residence" : "Main, MA"}
}
}'
RNI Rescore with Addresses. The rescore query with an address string.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : { "residence" : "Main, MA" }},
"rescore" : {
"query" : {
"rescore_query" : {
"function_score" : {
"address_score" : {
"field" : "residence",
"query_address" : "Main, MA"
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'
The response displayed here returns the address as a string because the indexed document used in this example represented the address as strings. The response will return the address in the same format as the indexed document. The format of the query does not have to match the format of the indexed documents.
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.4552421,
"hits" : [
{
"_index" : "rni-test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.4552421,
"_source" : {
"primary_name" : "Joe Schmoe",
"residence" : "123 Main St, Boston, Massachusetts, 02110"
}
}
]
}
The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.
Advanced Rescorer for Address Matching
RNI includes a customized RNI rescorer query parameter, rni_query
, which utilizes RNI advanced features. The RNI custom rescorer uses the parameters above as well as the following parameters to determine the number of documents that will be rescored. Rescoring fewer documents increases speed, but can be at the cost of accuracy if the best documents are not passed to the rescorer.
-
score_to_rescore_restriction
(a float, defaults to 0.4, cannot be negative) dynamically controls the minimum query score a document needs to be passed to the RNI rescorer.
A value of 0.0 will not cut off any documents from being rescored. Higher values rescore fewer documents, increasing speed at the cost of accuracy.
-
window_size_allowance
(a float, defaults to 0.5, must be in interval (0, 1]) dynamically controls the window size for rescoring. No more than window_size
names will be scored.
A value of 1.0 will not cut off any documents from being rescored. Higher values rescore more documents, increasing accuracy at the cost of speed.
In the following example, pairwise matching is performed on the top 200 names returned by the base query.
Example with the Advanced Rescorer
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {
"primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}"
}
},
"rescore" : {
"window_size" : 200,
"rni_query" : {
"rescore_query" : {
"rni_function_score" : {
"name_score" : {
"field" : "primary_name",
"query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"},
"score_to_rescore_restriction": 1.0,
"window_size_allowance": 0.5
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'