Important
Unless otherwise specified, all inputs to RNI need to be UTF-8 encoded.
Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.
Elasticsearch provides real-time search and analytics for all kinds of data. The data is stored in documents, each having a set of fields, some of which are defined as search fields. An Elasticsearch index is a collection of these documents.
The RNI-Elasticsearch plugin uses an Elasticsearch index to store documents containing names, dates, addresses, or other fields to be matched.
Before using RNI to search for matches, you must create the index, define mappings, and load the index with documents.
-
Create an Index, or a searchable container for your documents.
-
Define a Mapping for fields that contain person, location, organization, or identifier entity types. The type of a name field to be searched by RNI is "rni_name"
. A mapping defines the data types of each of the searchable fields in a document. The mapping does not have to include every field in the document, just the searchable fields.
-
Index Documents that contain one or more name fields along with other fields of interest. This step loads the documents into the index.
-
Test the RNI integration before continuing on.
Once you've completed the above steps, you are ready to query the index.
The following snippets use the cURL command-line tool to illustrate the Elasticsearch commands for running the plugin. You can also use Kibana, an open source dashboard for Elasticsearch.
An Elasticsearch index consists of one or more documents, and a document contains one or more fields. A name index is an indexed list of names.
The default port for running Elasticsearch locally is localhost:9200
.
The following cURL statement creates an index named rni-test.
curl -XPUT 'http://localhost:9200/rni-test'
A mapping defines how a document, along with the fields it contains, is stored and indexed and sets the types of the search fields. For name search fields, set the "type" of the name fields to "rni_name".
The following statement maps the "primary_name" and "aka" (also known as) fields in the document to the "rni_name" type in the "rni-test" index.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
"properties" : {
"primary_name" : { "type" : "rni_name" },
"aka" : { "type" : "rni_name" },
"occupation" : { "type" : "text" }
}
}'
Previous to the RNI-RNT 7.35.1.c65.0 release (September 2021), entityType
was not considered when querying. For example, a name with an entityType
of PERSON could match a name with an entityType
of ORGANIZATION. To return to this behavior, set the mapping parameter testEntityType
to false. This will allow indexed names with any or no entityType
to be returned, regardless of the entityType
in the search.
This is the step where you add your data, or documents, to the index. A document is a JSON object containing one or more fields. Each field in a document is defined as a key-value pairs, where the key is the field and the value is the data.
Documents may include fields other than name fields.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
"primary_name" : "Joe Schmoe",
"aka" : "Bossman",
"occupation" : "business owner"
}'
Name fields can include properties in addition to the name string (or "data
" property). Properties are used when searching to optimize the search algorithms for the data. The "entityType
" property is particularly important for name searching and customizations.
Example:
curl -XPUT 'http://localhost:9200/rni-test/_doc/3' -H'Content-Type: application/json' -d '{ "primary_name" : { "data" : "Joe Schmoe", "language" : "eng", "script" : "Latn", "entityType" : "PERSON" } }'
Tip
When creating a large set of documents, use the Bulk Insert for optimal performance.
Tip
You may need to wait a few minutes for the documents to be ready to query. Documents are not always immediately available from Elasticsearch after being added to the index.
The entityType
field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.
Important
The entityType
should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType
, the type PERSON
will be used.
Table 2. Entity Types
Type
|
Description
|
Features
|
PERSON
|
A human identified by name, nickname, or alias.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, frequency and gender models are supported.
|
LOCATION
|
A city, state, country, region or other location.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, and frequency models are supported.
Real World Ids are supported.
|
ORGANIZATION
|
A corporation, institution, government agency, or other group of people defined by an established organizational structure.
|
Values are tokenized and token pairs are compared.
Stop words, overrides, frequency models, and embeddings are supported.
|
IDENTIFIER
IDENTIFIER:DRIVERS_LICENSE
IDENTIFIER:LICENSE_PLATE
IDENTIFIER:NATIONAL_ID_NUM
|
An alphanumeric identifier.
|
Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.
|
You can process fielded names by separating the fields with "|". RNI assigns no explicit semantics to each field (such as given name or surname), but it does pay attention to the order of the fields when comparing two fielded names. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in name1 matches the second field in name2). Fields within a name can be empty.
When scoring a potential match between a name with fields and a name without fields, RNI treats the name without fields as if it were a name with a single field.
RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".
Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*
.
Names Containing Special Characters
When using JSON objects with RNI, special characters must be properly escaped when used in strings. RNI requires a backslash to escape the special character and then JSON requires another backslash to escape the first backslash. Thus, In RNI, the proper escape character for names containing a special character is a double backslash (\\).
The | used in fielded names is one example of a special character embedded within a name, where | is used to separate the fields. For proper processing of the vertical bar character, RNI needs to be able to distinguish when the user intends to build a fielded name and a name which contains the vertical bar character.
Let's assume we have a name that includes a |; it is not indicating a fielded name: "John|Smith". RNI requires that you escape the vertical bar with a backslash; e.g. "John\|Smith". Then, JSON requires that the backslash character be escaped with a backlash. The correct syntax for the name "John|Smith" is "John\\|Smith". If the entry were representing a fielded name, the correct syntax would be "John|Smith" without any backslashes.
Verify the RNI SDK Version
To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version:
curl -XGET 'localhost:9200/rni-test/rni_plugin/_get_version'
This call also verifies that the RNI plugin is installed and running successfully.
Bulk insert allows you to add multiple documents to Elasticsearch in a single API call, improving the throughput for uploading documents by orders of magnitude. We recommend you use bulk indexing to create and index your data wherever possible.
-
Create the index.
-
Define the mapping.
-
Run Bulk Insert.
Tip
Do not perform any queries or searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.
The structure for all Elasticsearch bulk API calls is:
{ action_to_be_performed: { metadata_related_to_action}}\n
{ request_body_data_to_index }\n
We're going to continue the example that we started. The index is rni-test
. The mapping defines a primary_name
, aka
, and occupation
.
-
Create the index.
curl -X PUT http://localhost:9200/rni-test
-
Define the mapping.
The previously defined mapping:
{
"properties" : {
"primary_name" : { "type" : "rni_name" },
"aka" : { "type" : "rni_name" },
"occupation" : { "type" : "text" }
}
}
You can put the mapping in a JSON file and create it from the command line. The following curl command creates the mapping using a file (es_mapping.json
in this example):
curl -X PUT -H"Content-Type:application/json" -d @es_mapping.json http://localhost:9200/rni-test/_mapping
-
Create a data file in newline delimited JSON (NDJSON) format. Save the file as bulknames.json
. The file MUST end with a newline after the final record.
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Joaquín Guzmán","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "René Lindström Jones","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Guadalupe Hernandez","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Chris Joseph Arsenault","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "ABC","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Basis Technology","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Australian Boradcasting Corporation","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Amazon","entityType":"ORGANIZATION"}}
-
Use the _bulk
method to load the data file with curl using the following command:
curl -X POST -H"Content-Type:application/json" --data-binary @bulknames.json http://localhost:9200/rni-test/_bulk
Note
If you're providing text file input to curl, use the --data-binary
flag instead of plain -d
to preserve the newlines.