Data used to measure accuracy should include a wide variety of phenomena that make name matching challenging, including misspellings, aliases or nicknames, initials, and non-Latin scripts. Applying organizational domain knowledge to curating name data that contains specific phenomena found in your real world cases is an ideal starting point for crafting this data set.
Your data for testing accuracy should contain labeled or annotated data. This is often called gold data, referring to the accuracy of the training set's classification for supervised learning techniques. For name matching, it is a list of name pairs, where each pair is labeled as a match or not a match. You can’t calculate accuracy without labeled data. Since assigning classification labels to data can be subjective, you should use multiple annotators on the same data set, determining positive and negative name matches. Establishing a set of annotation guidelines for scoring a classification is necessary, as it provides consistency when classifying the data.
Once you've collected and annotated your gold data, create an evaluation file of name pairs to be imported and used in evaluation.
The evaluation data file is a .csv
file of annotated name pairs (gold data). It should include both positive (the names are considered a match) and negative (the names are not considered a match) name pairs. All name pairs must be the same entity type.
Table 4. Evaluate Import File
Column Name
|
Description
|
Required?
|
Example
|
Name1
|
First name in the name comparison
|
Yes
|
John R. Smith
|
Name1_Lang
|
3 letter ISO 693-3 language code
|
No
|
eng
|
Name2
|
Second name in the name comparison
|
Yes
|
Smith John
|
Name2_Lang
|
3 letter ISO 693-3 language code
|
No
|
eng
|
Entity_Type
|
What type of name is this? Person, Organization, Location, Date, or Address
|
Yes
|
PERSON
|
Match
|
Does Name1 match Name2?
|
Yes
|
Y
|
The first row of the file is a header row, containing the column names of the fields in the file. Each column is separated by a comma; if a value is not provided, that field is left blank but the column must still be included.
Sample File - PERSON
NAME1,NAME1_LANG,NAME2,NAME2_LANG,ENTITY_TYPE,MATCH
Peter Harding,eng,Pete Harding,eng,PERSON,Y
Peter Harding,eng,Harding Peter,eng,PERSON,Y
Peter Harding,,Pete Michael Harding,eng,PERSON,Y
Peter Harding,eng,P. M. Harding,eng,PERSON,N
Peter Harding,eng,Pat Harding,,PERSON,N
Peter Harding,eng,P. B. Harding,eng,PERSON,N
Peter Harding,eng,Pietro Hardin,eng,PERSON,N
Sample File - Address
123 Fake Street Springfield MO,,123 Fake St Springfield IL,,ADDRESS,N
820 Forest Road,,820 Forrest Rd,,ADDRESS,Y
To upload an evaluation data file:
-
Select the Evaluate tab from the navigation bar.
-
Drag or browse for the desired evaluation data file. When it has finished uploading, it will appear in the file list.
Precision, recall, and F1 score are metrics used to evaluate NLP tools. Accuracy is measured as a combination of the three values.
-
Precision answers the question "of the answers you found, what percentage were correct?" Precision is sensitive to false positives; higher is more precise.
-
Recall answers the question "of all possible correct answers, what percentage did you find?" Recall is sensitive to false negatives; higher is better recall.
-
F1 measure is the harmonic mean of precision and recall. The F1 measure is sensitive to both false positives and false negatives; a higher value means better accuracy. It isn't quite an average of the two scores, as it penalizes the case where the precision or recall scores are far apart. For example, if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn't want the F1 measure to be misleadingly high.
Export Match Configuration
Exported match configurations can be re-imported into RMS or used as a reference for making a configuration in RNI.
Important
The exported .yaml file cannot be imported directly into RNI.
-
Select Configure from the navigation bar.
-
Select Export in the Options column for the desired match configuration.
Calculating Precision, Recall, and F1
Let's look at how precision, recall, and F1-score are calculated. We have a set of name comparisons. From this we can calculate:
-
TPs: True positives. Number of matching name pairs that were labeled a match.
-
FPs: False positives. Number of matching name pairs that were not labeled a match.
-
FNs: False negatives. Number of name pairs that did not match, that were labeled a match.
Pairs: Number of rows in the gold data file; the number of name pairs compared = TP + FP
Matches: Number of records that are marked as matches in the gold data file. = TP + FN
Precision is an indication of how many of the matches are correct. If there are 2 correct matches, but 6 were identified as matches, P = .33. The name pair is correctly matched 1/3 of the time. If there are no false positives, the precision is 1.
Recall is an indication of how many matches were found. If 2 pairs are identified as matches, but there are 4 pairs that are actual matches, R = .5. This means that RMS found the correct match 1/2 the time. If there are no false negatives, the recall is 1.
F1-score is the harmonic mean of precision and recall
A major benefit of RMS is that you can define a threshold for name matching, optimizing for what is most relevant to your use case. For name matching, consider the case where a query name is expected to match one and only one name in the index. If there are three names returned above the threshold, including the correct match, then one name is a true positive (TP), two names are false positives (FP), and there are no false negatives (FN). If the correct match is not returned above the threshold, the number of false negatives will be one.
In this example, where the correct match is returned along with two other matches:
-
The precision is 1/3, since there is one correct match returned with two other matches.
-
The recall is 1.0, since there are no false negatives.