The Rosette Name Indexer (RNI) enables high-speed, scalable, cross-language, and cross-script searches for names.
RNI uses the Apache Lucene full-text search engine to store names with their search keys and a key index. RNI updates and queries with Lucene are transactional.
When you search for a name, RNI generates a search key for each component of the name, locates all the names indexed by those search keys, and uses linguistic matching algorithms to filter that set of names down to the most similar names.
For a list of the languages and writing scripts that RNI supports, see Fully Supported Text Domains for Rosette Name Indexer and Name Matching.
RNI provides a Java API that you can use to embed it in your applications. The RNI classes are in com.basistech.rni.index
. Unqualified class names that appear in this section are in com.basistech.rni.index
.
For detailed information about the API, see the Java API Reference shipped with RNI.
Constructing a Name Index
A name index is an indexed list of names. The list includes a collection of Name
objects and associated keys.
The Name
object includes the name, language, script, (script and language will be inferred if not included in the name definition) and may include entity type (such as person or place), language of origin, and additional information (with place names, for example, you may want to store the geocoordinates).
Tip
You can also create an index in memory that is never stored on disk.
To create an indexed list of names on disk, you must specify a pathname for the data store, and you must use a IndexStoreDataModelFlags
object (the default is fine).
Example:
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_index.java
Once the index is created, use NameBuilder
to create Name
objects and add them to the index. NameBuilder
provides a fluent interface that supports method chaining. The following fragment illustrates the syntax for creating and adding a name to the index.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_name.java
When you are finished adding names, close the name index, as in the preceding fragment.
Note
NameBuilder
also includes static methods that you can use for determining the language and script for a name prior to creating the Name
object: guessLanguage(String nameData)
and guessScript(String nameData)
.
You can use hintLanguage(com.basistech.util.LanguageCode hintLanguage)
to suggest the language when you create a Name
. The NameBuilder
uses the suggestion if it is compatible with the script, otherwise it uses its own language guess.
When you are adding a large number of names to an index, you can use an INameIndexSession
object to batch these additions into a single transaction. A single transaction is faster than adding each name in a separate transaction. For information, see RNI Sessions and Transactions, and for a sample application that adds multiple names in a single transaction, see AddNamesSample.
Once you have an index created, you can use queries to search the index for similar names.
The primary role of a name index is to perform queries. You can also perform updates (insertions and deletions).
StandardNameIndex
provides a static
method for opening a name index.
INameIndex index = StandardNameIndex.open(String indexPathname);
indexPathname
is the path to the directory that contains the name index.
To optimize the index for more efficient queries, call
index.optimize();
When you are done using the name index, you must close it:
index.close();
Defining a Name Search Query
A query includes a Name
object and may also include settings to constrain the query. For example, the query can specify the entity type, language, and/or script of the names that it returns. For the details, see the Javadoc for com.basistech.rni.index.IndexStoreDataModelFlags
.
You can also define a query to return all the names associated with a specified entity.
Set up a NameIndexQuery
object. For example:
// Define a query.
NameIndexQuery defineQuery(Name queryName)
throws NameIndexException, NameIndexStoreException, RNTException {
NameIndexQuery query = new NameIndexQuery(queryName);
query.setNameDataMinimumMatchScore(.30);
return query;
}
Running the Query and Accessing the Query Results
INameIndex
includes a query
method that takes as its parameter the defined NameIndexQuery
.
The query returns a NameIndexQueryResult
iterator. Each NameIndexQueryResult
object provides a Name
object and a similarity score. As the following fragment illustrates, you can obtain and process each name and its score. The higher the score (greater than 0 and less than or equal to 1), the greater the confidence that this is a relevant match. A score of 1.0 indicates that the query name string and result name string are identical. The types of variations matched by RNI are described in Name Variations. Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/query_index.java
SpanMatches. Each query result may contain information about spans (one or more tokens) in the query name that match or do not match spans in each result name. The NameIndexQueryResult
provides a MatchResult
object, which in turn provides match type and a list of SpanMatch
objects. For more information, see the Javadoc for com.basistech.rni.match.SpanMatch
and com.basistech.rni.match.Span
. The Javadoc for MatchResult#getSpanMatches()
provides information about the scope and limitations on what is returned for names in various text domains.
When you are done running queries, close the index:
index.close();
For a sample Java application that defines a query, runs the query, and reports the results, see IndexQuerySample.
Retrieving Groups of Names
You may want to retrieve a group of names that share some common characteristic other than name similarity. Perhaps you even want to retrieve all the names in an RNI index.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/name_groups.java
The query returns all the names for which the Extra
field contains the token used in the query.
Optimizing Query Performance
By adjusting NameIndexQuery
parameters, you can optimize queries for your use case.
Tradeoffs Between Accuracy and Speed
RNI passes a subset of the highest scoring names from the first-pass high-recall search to the second-pass high-precision filter. The namesToCheckAllowance
and maximumNamesToCheck
parameters can be adjusted to control how many names are included in that subset.
- maximumNamesToCheck
-
The maximumNamesToCheck
parameter sets a hard limit on the number of names passed to the high-precision filter for each query. Use it to control the maximum query latency. The appropriate value is largely determined by the size of your index and should increase as your index grows.
- namesToCheckAllowance
-
The namesToCheckAllowance
parameter is a value between 0.0 and 1.0 used at query time to dynamically calculate the most efficient number of names to pass to the high-precision filter based on the commonality of the query name in the index. When set to 1.0, the value of maximumNamesToCheck
is used for every query. After determining a good value for maximumNamesToCheck
, adjust this parameter to fine-tune the performance.
In general, for greater speed and less accuracy (particularly recall), decrease the value of these parameters using:
For greater recall and less speed, increase those settings.
To pass all names found by the high-recall search to the high-precision filter, set:
Optimizing for Duplicate Names. If your index contains duplicate names, you should use setMaximumNamesToConsider(int maxNamesToConsider)
to set the maximum number of names to consider to a value higher than the maximum number of names to check. RNI returns the maximum names to consider in the first-pass high-recall search and sends the maximum names to check to the second-pass high-precision filter. If there are any duplicates in the names returned by the first pass, the duplicates are not passed to the second-pass. In other words, the score assigned by the second pass to the first instance of a given name is assigned to its duplicates without spending time sending them through the second pass. For optimal behavior, the ratio of maximumNamesToConsider
to maximumNamesToCheck
should be approximately the same as the average number of times that a name is repeated in the RNI index. So, for example, if each name is entered twice (on average), maximumNamesToConsider
should be twice as big as maximumNamesToCheck
. If your index does not include duplicates, you can use IndexStoreDataModelFlags
to set optimizeDuplicateNames
to false (the default setting is true), in which case RNI does not perform this optimization procedure.
Constraints on maximum settings. maximumNamesToCheck
and maximumResultsToReturn
must be less than or equal to maximumNamesToConsider
. As described above, maximumNamesToCheck
may be less than maximumResultsToReturn
. Accordingly, the order in which you make these settings is important. For example, you cannot set maximumResultsToReturn
to a value higher than maximumNamesToConsider
, so you may need to reset maximumNamesToConsider
before you can reset maximumResultsToReturn
.
To simulate a high-recall search with perfect recall:
This is not recommended for a production environment due to the high amount of computation such a procedure requires, but it can be useful during development to identify recall errors (false negatives) made by the high-recall search but not the high-precision filter.
Tradeoffs Between False Positives and False Negatives
For fewer false positives (bad matches) and more false negatives (missing good matches) in your query results, you can:
The default minimum match score is NameIndexQuery.DEFAULT_MINIMUM_MATCH_SCORE
. To reset this threshold, use setNameDataMinimumMatchScore(double nameDataMinimumMatchScore)
, where nameDataMinimumMatchScore
is greater than 0 and less than or equal to 1.
The default maximum number of results to return is NameIndexQuery.DEFAULT_MAXIMUM_RESULTS_TO_RETURN
. To reset this value, use setMaximumResultsToReturn(int maximumResultsToReturn)
.
To return an unlimited number of results, use setMaximumResultsToReturn(NameIndexQuery.UNLIMITED_RESULTS)
.
RNI Sessions and Transactions
In addition to using the INameIndex
API for performing operations on an RNI Index, you can use the INameIndexSession
API for finer-grained control. Sessions allow a set of operations to happen atomically (all occur or nothing occurs), and, especially for write operations, more efficiently. For those familiar with relational databases and SQL, the RNI concept of a session is similar to the JDBC concept of a connection with auto-commit mode off.
-
To start a session, call INameIndex.openSession()
.
-
To end the session, call close()
on the resultant INameIndexSession
object.
While INameIndexSession
provides many of the same operations as INameIndex
, such as query()
and addName()
, the difference is when changes to the index become permanent. INameIndex
update operations are immediately flushed to disk, but INameIndexSession
operations are not made permanent until you call commit()
. At any time, you can invoke rollback()
to undo all the operations since the last commit()
. If you call rollback()
before ever calling commit()
, all of the operations of the session are undone.
You can run multiple sessions concurrently by having multiple threads call openSession()
on the same INameIndex
object. When multiple sessions are acting concurrently in separate threads, they are logically isolated from each other in order to not interfere with each other's operations. The isolation level is equivalent to READ COMMITTED, as outlined in the SQL-1992 Specification. This guarantees that one session will not see any uncommitted changes to the index performed by another session. In addition, a session will not see any uncommitted changes that it has made itself. For example, if a session adds a name to the index and then searches for that name before committing, it will not find the name it has added. You can also perform INameIndex
auto-commit operations in the midst of one or more sessions; each INameIndex
update or query is performed in its own session.
The session objects themselves are thread-safe; a session object may be shared by multiple threads.
The INameIndexSession
API is recommended for doing bulk adds to the index. It is much more efficient to create a single session for adding all the names of a bulk add than to use the INameIndex
API. The following fragment shows an example.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_names.java
A Sample. For a sample application that adds multiple names in a single transaction, see AddNamesSample.
Local vs. Distributed Transactions. A local transaction is a set of operations performed atomically (all occur or nothing occurs) on a single index. A distributed transaction is a set of operations performed atomically on multiple data sources, such as a relational database and an RNI index. All the operations on all the data sources must take place, or none of the operations take place.
For local transactions, use the INameIndexSession
API, as illustrated above. The transaction object is managed internally and is not visible to the user.
In order to participate in a distributed transaction, an INameIndexTransaction
object must be created from the session by calling INameIndexSession.startTransaction()
. This transaction object is linked with the session internally. There is a division of labor between the two objects: the session object can only be used for adding/removing/searching, and the transaction object can only be used for committing or rolling back. A typical use case would be to provide the session object to the user application while handing over the transaction object to a transaction manager.
One side effect of this division of labor between the session and transaction objects is that a session cannot call commit()
or rollback()
once it is associated with a distributed transaction. These operations are only allowed by the linked transaction object. Specifically, after calling INameIndexSession.startTransaction()
, you should not call INameIndexSession.commit()
. You must call INameIndexTransaction.commit()
instead.
A session can be associated with multiple distributed transactions, one at a time. When the work for one transaction is finished, you may call INameIndexSession.startTransaction()
again to start a new one.
Two-Phase Commit. INameIndexTransaction
supports two-phase commits, a standard protocol for managing transactions robustly among multiple data sources. INameIndexTransaction
provides the prepare()
, commit()
, and rollback()
operations necessary for a transaction manager to effectively execute the protocol. RNI does not include a transaction manager.
The following simplified example illustrates the use of INameIndexTransaction
in a distributed transaction with a two-phase commit. In this example, both transactions are RNI transactions.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/distributed_transaction.java
A Sample. For a sample application that illustrates a distributed transaction with a two-phase commit involving two RNI indexes, see DistributedTransactionSample.
No more than one INameIndex
object may exist for a given name index on disk at any time.
Queries and updates may be performed in multiple threads on a single INameIndex
object.
One Write Session at a Time
While a write session (which may be shared by multiple threads) is open, all other writing sessions (including optimization) are blocked. If there is an operation that is expected to take a long time (e.g., batch document adds or calls to optimize), care should be taken to ensure it is the only active writing session. If a write attempt needs to wait too long, a timeout exception is thrown, and the transaction is aborted.