RMS supports two different types of search:
-
Ad hoc Search Returns all matching records from the index for a single query. The query can include one or more fields.
-
Batch Search Performs multiple queries based on an uploaded file, using each record in the file as the search terms against the index.
Prerequisite: Before searching, you must have created an index by importing data to search against.
The search panel is on the left side of the window. By default, the fields in your index are displayed. Use the Search Configuration to remove fields from the search panel.
Ad hoc search returns a list of records from your index which are potential matches, as determined by the calculated match score. Scores higher than the display threshold are shown.
-
Enter one or more values into the search fields.
-
You can put in a partial name, or an initial.
-
You can enter partial dates in date fields. 1955-12-30, 1955--03, 12/30, -12, 1955 are all supported date formats.
-
Run Search
Search returns all records with a match score greater than the display threshold. They are listed by match score, with the highest scores at the top.
-
Match values greater than the match threshold are highlighted in green.
-
Use the down arrow to exand a record to see more detail.
-
Under Action, select
to go to the Compare window. The name fields are preloaded with the search and selected values. Compare shows how the scores were calculated. You can modify match parameters and see the resulting change in match scores.
Each Rosette Match Studio query is processed in two passes to provide the best combination of speed and accuracy.
-
The first pass is designed to quickly generate a set of candidates for the second pass to consider.
-
The second pass compares every value returned by the first pass against the value in the query and computes a similarity score. Multiple scorers are applied in the second pass, to generate the best possible score.
The first pass gives the system the speed necessary for high-transaction environments, eliminating values in the index from consideration. The slower, second pass, re-compares each selected value directly in their original script, using enhanced scoring algorithms.
The scores from the first pass are discarded and the match candidates are re-ranked according to the similarity scores returned by the second pass. The match scores for all search terms are combined to generate a match score. All entries with a match score equal to or greater than the display threshold are displayed in a list. Those values that are equal to or greater than the match threshold will be highlighted with their match score.
Batch search performs multiple searches in a single task, using each record in a file as the query value against the index.
Notice
The index must be loaded before starting the batch search. Batch search works best when the batch dataset has similar mappings as the index.
-
Upload Query Data to select the file with your query data, a dataset of records similar to the index records.
-
Assign fields types to columns.
-
Next
-
The batch search will run automatically. The most recent results are displayed.
-
To run a different dataset, select Upload New Batch to start a new batch search.
-
To see the results from another set, click on the box in the Query Data list on the left.
Tip
By following a few simple guidelines, you can get the best performance for Batch Search.
-
Elasticsearch (ES) restricts parallel searches to a maximum of 1000, so you should limit data sets to 1000 records at a time. Lower numbers will get faster results. You can process a larger data set, but it will take a much longer time to process.
-
To easily review and analyze the results, export the results.
For each value in the batch file, Batch Search displays the search term from the file. To see the matching records from the index, click on the plus sign next to a name.
Select Export to download a .csv
file with the match results for each record in the evaluation set.
To perform a search, enter your search terms in the Search pane. All the fields in your index will by default be displayed in the search pane. You can configure the Search pane to display only some fields, if desired.
To remove a field from the Search Pane, slide the Display toggle switch to the off position. Update to see the results.
When searching for a match, some fields are more important in determining a match than others. For example, the person name is likely more important in determining a match than the location name. Adjust the weight slider for each field based on its relative importance. The weights are the field's relative portion of the final match score. By default, the weights are distributed equally among all fields.
The match score is then calculated by performing a weighted arithmetic mean over the match scores calculated for each field. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.
To rename a field or create a new field, select Create/Edit Fields under the Field Configuration.
Individual name tokens are scored by a number of algorithms. These algorithms can be optimized by modifying configuration parameters, thus changing the final match score.
A match configuration contains a set of parameters. Each named match configuration contains parameter values for a specific language pair and entity type. A single named match configuration can contain multiple language pairs and entity types.
Select a match configuration to use for ad hoc search and batch search. You can use the default configuration (RMS-<version> Default)
or create a new match configuration. Configurations are created in New Configuration.
Once the match score is calculated for all values in the index, those with scores greater than or equal to the match threshold are highlighted in the search results.
Once the match score is calculated for all values in the index, only those with scores greater than or equal to the display threshold are returned in the search results. If you aren't seeing results you expect, try lowering the display threshold value to return more results.
All fields have a type which is one of the predefined set of Field Types. You can add new fields to the set or rename existing fields,
Adding a New Field
-
Select Create Field.
-
Enter a name for the field in the left-hand box.
-
Select a type from the drop-down list on the right-hand box.
-
Update.
Once you've added a new field, you can use it when creating a Mapping.
Tip
If you want to use custom fields in your mapping, add them before importing your data files.
Renaming a Field
-
Select the field name from the left side list.
-
Enter the new name.
-
Update.
Before using Rosette Match Studio for searching, you must create the index by uploading a recordset containing your searchable data. If you upload multiple data files, they are concatenated into a single index of data to be searched.
Rosette Match Studio imports structured data. Supported file formats are:
For .csv files, the first row must be a header row, containing the names of the fields in the source file. For other file types the key names are the field names. The field names must be unique. The RMS upload does not currently support nested fields available in Elasticsearch
Tip
Reset System
At the bottom of the window is the Reset System button. This will clear all imported data from your application.
To upload data:
-
Select or drag a file to import.
-
The file name will appear in the Files to Import list.
-
The Mapping window is displayed.
Mapping is the process of assigning fields to the columns in your dataset. Each column must have a field type assigned to it.
-
For each of the fields in the input file:
-
The Column Header name is taken from the source file. It is either the first row of the file (column headers) or the key values of the file, depending on the data file format.
-
Select the Field type from the drop-down for each column you want to import and search on. To not import a column, leave the default, Do Not Import.
Tip
If you want to use custom fields in your mapping, add them before importing your source files.
-
Save the mapping.
-
Import File to import the source file into RMS.
Note
If a file with the same columns has already been imported and mapped, the previous mapping is displayed. Select Edit to modify the mapping.
The following field types are predefined in RMS and can be selected in the mapping definition.
Table 1. Field Types
Field Type
|
Data Type
|
Entity Type
|
Description
|
Match Score Algorithm
|
Person Name
|
rni_name
|
PERSON
|
The name, nickname, or alias of an individual.
|
Name matching algorithms
|
Company Name
|
rni_name
|
ORGANIZATION
|
The name of a corporation, institution, government agency, or other group of people defined by an established organizational structure.
|
Name matching algorithms
|
Location
|
rni_name
|
LOCATION
|
The name of a geographic location such as a city, state, country, region, mountain, park, lake, or address.
|
Name matching algorithms
|
Date
|
rni_date
|
|
A date contains a year, month, and day. All common delimiters for English dates are supported. Dates can be expressed in various orderings, and months can be written as a numeral, their full English name, or the common three-letter abbreviation.
|
Mathematical difference between two dates
|
Address
|
rni_address
|
|
A postal address of a location.
|
Depends on the fields being compared
|
Age
|
integer
|
|
Age in years
|
Mathematical difference between the numbers
|
Gender
|
keyword
|
|
The search string is either an exact match to a value in the index, or doesn't match at all.
|
0: no match
1: exact match
|
Id
|
text
|
|
Unstructured text content. Can be used for phone number, social security number, passport, etc.
|
Edit distance
|
Description
|
text
|
|
Unstructured text field
|
Not used in matching
|
The following additional data types can be used to define new fields.
Table 2. Additional Data Types
Data Type
|
Description
|
long
|
A signed 64-bit integer.
|
short
|
A signed 16-bit integer.
|
double
|
A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.
|
float
|
A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.
|
boolean
|
true or false
|
geo_point
|
latitude-longitude pair
|