RNI-Elasticsearch is an Elasticsearch plugin for building fuzzy name retrieval and name matching applications for persons, locations, and organizations. It uses Rosette Name Indexer (RNI), implementing high-speed, scalable, cross-language, and cross-script searches with the Elasticsearch full-text search engine to store the names and search keys.
This guide describes how to use the RNI-Elasticsearch plugin and RNI features, and is not intended to be a complete guide to Elasticsearch.
RNI-Elasticsearch is supported on the following operating systems and CPUs.
Table 1. Supported Platforms
OS |
CPU |
MAC OS X v10.9+ (Darwin 13) |
AMD64 |
Linux |
AMD64 |
Linux |
AARCH64 |
Windows |
AMD64 |
Java Only |
Any OS and CPU with 64-bit Java SDK 11 through 18
Important
Java 19 SDK is not supported. Using Java 19 may result in unexpected matching behavior.
|
Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.
RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).
The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding Name Match Scores.
Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:
Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.
You can configure RNI to customize how it scores different match phenomena.
The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).