At this point you've created an index and loaded data. Now you can start using RNI to search for matches.
A query searches the index and returns a match score. In RNI, the query for a name consists of two parts, a base query and a rescorer.
The base query is a standard Elasticsearch query against a name field. The rescorer takes the results of the base query, and uses Elasticsearch rescoring to select the top candidates and perform pairwise matching on the top candidates.
The query returns an RNI match score (max_score
), the score of the top scoring document.
Important
The entityType
(PERSON, LOCATION, ORGANIZATION) should always be added to a name query to utilize all RNI features. If you don't specify an entityType
, the type NONE
will be used and RNI may return less accurate results.
The base query is a standard query against a name field:
"query" : {
"match" : {
"primary_name" : "Jo Shmoe"
}
}
Querying supports the same name properties that you may use when indexing documents. Unlike during document creation, you must pass the JSON object containing the name fields as a string. You should always include the entityType
property in your query.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {
"primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}"
}
}
}'
Much like during indexing, RNI creates a set of keys based on the name and then generates a more complex internal query to match against the indexed keys.
Rescore with the RNI Pairwise Name Match
The base query returns a ranked list of matching documents. The rescorer takes the top documents from the list and performs pairwise matching algorithms on those documents, and returns a re-ranked list. RNI has a custom rescorer which allows you to further tune the candidates passing to RNI pairwise matcher. Since the pairwise matcher is a computationally intensive process, you want to rescore just enough documents to find the best matches.
Elasticsearch Rescoring includes the following parameters:
-
window_size
(an integer, defaults to 10) specifies how many documents from the base query should be passed to the RNI pairwise matcher.
Use this parameter to limit the number of compute-intensive name matches that need to be performed. If you set the value too high, the query will take too long, but if you set the value too low, you will increase the number of false negatives.
Tip
A good starting point for window_size
is to make it the square root of the size of the index. For example, an index of 10,000 entries would use a window_size
of 100.
-
query_weight
(a float, defaults to 1.0) specifies the weighting of the score returned by the base query.
In the context of RNI pairwise matching, the base query score has little meaning, so we suggest you set it to 0.0.
-
rescore_query_weight
(a float, defaults to 1.0) specifies the weighting of the maximum RNI pairwise match score.
If query_weight
0.0 and rescore_query_weight
is 1.0, the score that is returned by rescoring is the RNI pairwise match score.
-
score_mode
controls how the query and rescore query scores are combined. The default value is total
meaning that both scores are added together after being multiplied by their respective weights.
In the following example, pairwise matching is performed on the top 200 names returned by the base query.
Example with RNI Rescorer:
"rescore" : {
"window_size" : 200,
"query" : {
"rescore_query" : {
"function_score" : {
"name_score" : {
"field" : "primary_name",
"query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
The "name_score
" function matches every name in the given field against the query name and returns the maximum score to the rescorer.
The "name_score
" function score query must be given at least one object that specifies:
The object passed to the name_score
function can also include any of the name properties.
This example illustrates the full query incorporating both match and rescore, using RNI query parameters.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {
"primary_name" : "{\"data\" : \"Jo Shmoe\",\"entityType\" : \"PERSON\"}"
}
},
"rescore" : {
"window_size" : 200,
"query" : {
"rescore_query" : {
"function_score" : {
"name_score" : {
"field" : "primary_name",
"query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'
This query returns an RNI match score against "Joe Shmoe" in the "_score" field:
{
"_index": "rni-test",
"_type": "_doc",
"_id": "1",
"_score": 0.80217975,
"_source": {
"primary_name": "Joe Shmoe",
"aka": "Bossman",
"occupation": "business owner"
}
}
RNI includes a customized RNI rescorer query parameter, rni_query
, which utilizes RNI advanced features. The RNI custom rescorer uses the parameters above as well as the following parameters to determine the number of documents that will be rescored. Rescoring fewer documents increases speed, but can be at the cost of accuracy if the best documents are not passed to the rescorer.
-
score_to_rescore_restriction
(a float, defaults to 0.86 for name matching and 0.0 [off] for address matching, cannot be negative) is used to calculate the minimum query score a document needs to be passed to the RNI rescorer. The larger the value, the more selective RNI is in passing documents to be rescored. This parameter only takes effect if more than 50 documents are returned from the first pass.
A value of 0.0 means the parameter will not have any effect on the number of documents being rescored. Higher values rescore fewer documents, increasing speed at the cost of accuracy.
Note
score_to_rescore_restriction
does not apply to rescore queries for nested fields.
-
window_size_allowance
(a float, defaults to 0.5 for name matching and 1.0 [off] for address matching, must be in the interval [0,1]) dynamically controls the window size for rescoring. No more than window_size
names will be scored.
A value of 1.0 will not cut off any documents from being rescored. Higher values rescore more documents, increasing accuracy at the cost of speed.
-
score_if_null
(a float, defaults to -1.0 indicating the feature is off). Set to a value between 0 and 1 to enable it. When set, that value is returned when the field is missing from the document.
-
filter_out_scores_below
(a float, must be in the interval [0,1], defaults to 0, indicating that the feature is off). It sets the minimum score required for a given field match to be returned by the rescorer.
Note
Even if every rescored document fails to exceed the filter_out_scores_below
threshold, Elasticsearch expects to return a non-empty list of results. In this case, the highest scoring document will be returned with a score of 0 (unless processed by another rescorer).
Note
When using the advanced rescorer, the default value for score_mode is max
, not total
. In this mode, the maximum of the original score and rescore query score is used.
In the following example, pairwise matching is performed on the top 200 names returned by the base query.
Example with the Advanced Rescorer
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
"query" : {
"match" : {
"primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}"
}
},
"rescore" : {
"window_size" : 200,
"rni_query" : {
"rescore_query" : {
"rni_function_score" : {
"name_score" : {
"field" : "primary_name",
"query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"},
"score_to_rescore_restriction": 1.0,
"window_size_allowance": 0.5
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'
Advanced Rescore Parameters
The following parameters determine the number of documents that will be rescored. Rescoring fewer documents increases speed, but can be at the cost of accuracy if the best documents are not passed to the rescorer.
-
score_to_rescore_restriction
(a float, defaults to 0.4, cannot be negative) dynamically controls the minimum query score a document needs to be passed to the RNI rescorer.
A value of 0.0 will not cut off any documents from being rescored. Higher values rescore fewer documents, increasing speed at the cost of accuracy.
-
window_size_allowance
(a float, defaults to 0.5, must be in interval (0, 1]) dynamically controls the window size for rescoring. No more than window_size
names will be scored.
A value of 1.0 will not cut off any documents from being rescored. Higher values rescore more documents, increasing accuracy at the cost of speed.
In the following example, pairwise matching is performed on the top 200 names returned by the base query.
"rescore" : {
"window_size" : 200,
"query" : {
"rescore_query" : {
"function_score" : {
"name_score" : {
"field" : "primary_name",
"query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
"score_to_rescore_restriction": 1.0,
"window_size_allowance": 0.5
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
Representing Arrays in Elasticsearch
If the name field in your documents is structured as an array, such as first name and last name fields, wrap the field in a nested object. The nested
datatype allows arrays of objects to be indexed and queried independently of each other.
Since Elasticsearch flattens object hierarchies into a simple list of field names and values, if you don't use the nested
type, you can lose the relationship between the fields. For example, the following document:
"names" : [
{
"first" : "Joe",
"last" : "Smith"
},
{
"first" : "Mike",
"last" : "Shmoe"
}
]
would be transformed internally into a document that looks more like this:
{
"names.first" : [ "mike", "joe" ],
"names.last" : [ "smith", "shmoe" ]
}
The names.first
and names.last
fields are flattened into multi-value fields, and the association between Joe and Smith is lost. This document would incorrectly match a query for mike and smith.
If you wrap an array field in a nested object, you will get more accurate search results.
Include a field of type "nested
" containing the name field in the mapping:
"nested_names" : {
"type" : "nested",
"properties" : {
"name" : { "type" :"rni_name" }
}
}
Multiple names can be added to the nested field:
{
"nested_names" : [
{
"name" : "Joe Smith"
},
{
"name" : "Mike Shmoe"
}
]
}
Update the query to refer to the nested object. Set the "score_mode
" to "max
".
{
"query" : {
"nested" : {
"path" : "nested_names",
"query" : {
"match" : {
"nested_names.name" : "Mike Shmoe"
}
}
}
},
"rescore" : {
"query" : {
"rescore_query" : {
"nested" : {
"path" : "nested_names",
"score_mode" : "max",
"query" : {
"function_score" : {
"name_score" : {
"field" : "nested_names.name",
"query_name" : "Mike Shmoe"
}
}
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}
Please see the Elasticsearch documentation for more detailed information on nested objects and queries.
Let's consider an example of a database that includes alias names along with a primary name.
Nested Mapping:
"properties" : {
"primary_name" : {"type" : "rni_name"},
"aliases" : {
"type" : "nested",
"properties" : {
"alias_name" : { "type" : "rni_name" }
}
}
}
The curl command to create the mapping:
curl -XPUT "http://localhost:9200/rni-test/_mapping" -H 'Content-Type: application/json' -d '{
"properties" : {
"primary_name" : { "type" : "rni_name"},
"aliases" : {
"type" : "nested",
"properties": { "alias_name" : { "type" : "rni_name" }
}
}
}
}'
Each record includes a primary name. Each primary name can have multiple aliases.
"primary_name" : "John Smith",
"aliases": [
{"alias_name": "John Shark"},
{"alias_name": "Smithy"},
{"alias_name": "Johnny boy"}
]
The curl command to add the data:
curl -XPUT "http://localhost:9200/rni-test/_doc/null" -H 'Content-Type: application/json' -d '{
"primary_name" : "John Smith",
"aliases" : [
{"alias_name": "John Shark"},
{"alias_name": "Smithy"},
{"alias_name": "Johnny boy"}
]
}'
The query will try to match one of the aliases. Specify score_mode: max
to return the highest match score of the aliases.
curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d '{
"query" : {
"nested" : {
"path" : "aliases",
"query" : {
"match" : { "aliases.alias_name": "Johnny" }
}
}
},
"rescore" : {
"query" : {
"rescore_query" : {
"nested" : {
"path" : "aliases",
"score_mode" : "max",
"query" : {
"function_score" : {
"name_score" : {
"field" : "aliases.alias_name",
"query_name" : "Johnny"
}
}
}
}
},
"query_weight" : 0.0,
"rescore_query_weight" : 1.0
}
}
}'
Sorting Results by rni_name
Elasticsearch supports the ability to sort search results by the values of their document fields. In the case of RNI, one may want to sort on an rni_name field. Because these fields are internally composed of many subfields, it is necessary to specify the subfield to sort on. Below are a couple of subfields that you may be interested in:
As an example, if your field's name is primaryName, you can sort on the original name data by referring to primaryName.bt_rni_name_original
in your sort specification.
In the Java API, these fields can be referenced through the IndexFields
enum. Regarding the previous example, one could refer to the same subfield in Java:
"primaryName." + IndexFields.ORIGINAL_NAME_FIELD.fieldName()