RNI-Elasticsearch is an Elasticsearch plugin for building fuzzy name retrieval and name matching applications for persons, locations, and organizations. It uses Rosette Name Indexer (RNI), implementing high-speed, scalable, cross-language, and cross-script searches with the Elasticsearch full-text search engine to store the names and search keys.
RNI performs searches across a large set of languages and writing scripts. Refer to Supported Text Domains for Name Indexing and Name Matching for the complete list of supported languages.
This guide describes how to use the RNI-Elasticsearch plugin and RNI features, and is not intended to be a complete guide to Elasticsearch.
Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but are not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.
RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).
The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding Name Match Scores.
Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:
Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.
You can configure RNI to customize how it scores different matching phenomena.
The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency it appears in the language model (the more frequent, the lower the weighting).