Elasticsearch provides real-time search and analytics for all kinds of data. The data is stored in documents, each having a set of fields, some of which are defined as search fields. An Elasticsearch index is a collection of these documents.
The RNI-Elasticsearch plugin uses an Elasticsearch index to store documents containing names, dates, addresses, or other fields to be matched.
Before using RNI to search for matches, you must create the index, define mappings, and load the index with documents.
Create an Index, or a searchable container for your documents.
Define a Mapping for fields that contain person, location, or organization names. The type of a name field to be searched by RNI is "rni_name"
. A mapping defines the data types of each of the searchable fields in a document. The mapping does not have to include every field in the document, just the searchable fields.
Index Documents that contain one or more name fields along with other fields of interest. This step loads the documents into the index.
Test the RNI integration before continuing on.
Once you've completed the above steps, you are ready to query the index.
The following snippets use the cURL command-line tool to illustrate the Elasticsearch commands for running the plugin. You can also use Kibana, an open source dashboard for Elasticsearch.
An Elasticsearch index consists of one or more documents, and a document contains one or more fields. A name index is an indexed list of names.
The default port for running Elasticsearch locally is localhost:9200
.
The following cURL statement creates an index named rni-test.
curl -XPUT 'http://localhost:9200/rni-test'
A mapping defines how a document, along with the fields it contains, is stored and indexed and sets the types of the search fields. For name search fields, set the "type" of the name fields to "rni_name".
The following statement maps the "primary_name" and "aka" (also known as) fields in the document to the "rni_name" type in the "rni-test" index.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
"properties" : {
"primary_name" : { "type" : "rni_name" },
"aka" : { "type" : "rni_name" },
"occupation" : { "type" : "text" }
}
}'
This is the step where you add your data, or documents, to the index. A document is a JSON object containing one or more fields. Each field in a document is defined as a key-value pairs, where the key is the field and the value is the data.
Documents may include fields other than name fields.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
"primary_name" : "Joe Schmoe",
"aka" : "Bossman",
"occupation" : "business owner"
}'
Name fields can include properties in addition to the name string (or "data
" property). Properties are used when searching to optimize the search algorithms for the data. The "entityType
" property is particularly important for name searching and customizations.
Example:
curl -XPUT 'http://localhost:9200/rni-test/_doc/3' -H'Content-Type: application/json' -d '{
"primary_name" : {
"data" : "Joe Schmoe",
"language" : "eng",
"script" : "Latn",
"entityType" : "PERSON"
}
}'
Important
The entityType
(PERSON, LOCATION, ORGANIZATION) should always be added to an index to utilize all RNI features. If you don't specify an entityType
, the type NONE
will be used and RNI may return less accurate results.
Tip
When creating a large set of documents, use the Bulk Insert for optimal performance.
You can process fielded names by separating the fields with "|". RNI assigns no explicit semantics to each field (such as given name or surname), but it does pay attention to the order of the fields when comparing two fielded names. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in name1 matches the second field in name2). Fields within a name can be empty.
When scoring a potential match between a name with fields and a name without fields, RNI treats the name without fields as if it were a name with a single field.
RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".
Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*
.
Names Containing Special Characters
When using JSON objects with RNI, special characters must be properly escaped when used in strings. JSON requires a backslash to escape the control character and then RNI requires another backslash to escape the JSON backslash. Thus, In RNI, the proper escape character for names containing a control character is a double backslash (\\).
The | used in fielded names is one example of a control character embedded within a name, where | is used to separate the fields. For proper processing of the vertical bar character, RNI needs to be able to distinguish when the user intends to build a fielded name and a name which contains the vertical bar character.
Let's assume we have a name that includes a |; it is not indicating a fielded name: "John|Smith". JSON requires that a control character be escaped with a backlash; e.g. "John\|Smith". But for RNI the backslash must be escaped, so the correct syntax for the name "John|Smith" is "John\\|Smith". If the entry were representing a fielded name, the correct syntax would be "John|Smith" without any backslashes.
Now let's assume a name includes a double backslash: "John\\Smith" In this case, the double backslash (\\) must also be escaped with the RNI escape character, \\. The proper format for this name is "John\\\\Smith".
Verify the RNI SDK Version
To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version:
curl -XGET localhost:9200/rni-test/rni_plugin/_get_version
This call also verifies that the RNI plugin is installed and running successfully.
Bulk insert allows you to add multiple documents to Elasticsearch in a single API call, improving the throughput for uploading documents by orders of magnitude. We recommend you use bulk indexing to create and index your data wherever possible.
Create the index.
Define the mapping.
Run Bulk Insert.
Tip
Do not perform any queries or searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.
The structure for all Elasticsearch bulk API calls is:
{ action_to_be_performed: { metadata_related_to_action}}\n
{ request_body_data_to_index }\n
We're going to continue the example that we started. The index is rni-test
. The mapping defines a primary_name
, aka
, and occupation
.
-
Create the index.
curl -X PUT http://localhost:9200/rni-test
-
Define the mapping.
The previously defined mapping:
{
"properties" : {
"primary_name" : { "type" : "rni_name" },
"aka" : { "type" : "rni_name" },
"occupation" : { "type" : "text" }
}
}
You can put the mapping in a JSON file and create it from the command line. The following curl command creates the mapping using a file (es_mapping.json
in this example):
curl -X PUT -H"Content-Type:application/json" -d @es_mapping.json http://localhost:9200/rni-test/_mapping
-
Create a data file in newline delimited JSON (NDJSON) format. Save the file as bulknames.json
. The file MUST end with a newline after the final record.
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Joaquín Guzmán","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "René Lindström Jones","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Guadalupe Hernandez","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Chris Joseph Arsenault","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "ABC","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Basis Technology","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Australian Boradcasting Corporation","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}}
{"primary_name":{"data": "Amazon","entityType":"ORGANIZATION"}}
-
Use the _bulk
method to load the data file with curl using the following command:
curl -X POST -H"Content-Type:application/json" --data-binary @bulknames.json http://localhost:9200/rni-test/_bulk
Note
If you're providing text file input to curl, use the --data-binary
flag instead of plain -d
to preserve the newlines.