Note
This version of Rosette Identity Resolution is a pre-release beta, intended for evaluation purposes only and is not intended for general release.
This document describes the scope and functionality of the 0.4.6 beta version of Rosette Identity Resolution.
Rosette Identity Resolution ingests unstructured documents, extracting and linking entities to previously-identified entities. You can use a preloaded Wikidata KB, load your own KB, or resolve entities without a KB.
The current version of Rosette Identity is focused more on precision than recall.
In future releases, it will be possible to configure and control the tradeoff between precision and recall. For now the system is tuned towards the safer option of avoiding incorrectly linked entities, at the cost of possibly creating new identities that should have been linked.
Table 1. Wikidata Accuracy
Precision |
0.79 |
Recall |
0.77 |
F1 |
0.78 |
The processing speed you experience with Rosette Identity depends on your hardware resources as well as the KB you're testing. Speed factors include:
The host machine's hard drive speed. We strongly recommend using a machine with an SSD.
The host machine's CPU speed
The number of CPU cores on the host machine
Network speed when calling Rosette Cloud
Current KB size
Rosette Identity supports the extraction and linking of the following entity types:
Person
Location
Organization
Pre-Loaded Wikidata Knowledge Base
This version of Rosette Identity Resolution allows you to choose a Wikidata Knowledge Base (KB) to load. The options are:
Full Wikidata KB: 66GB Full Wikidata ES Docker image
Pruned Wikidata KB: A 6GB pruned Wikidata ES Docker image. This image contains 1/10 of the entities compared to the full Wikidata KB and requires less disk space and RAM. Your accuracy will be decreased since some of the entities might not appear in the KB.
No Wikidata KB: Identity is started without a KB. This can be useful if you are interested in how well the system clusters together entity mentions.
External Wikidata KB: Identity is started without a KB, and then connects to external Elasticsearch and DGraph services containing the KB. This option allows you to save entities and have them persist across runs. The external Elasticsearch and Dgraph services must be running before you start the Identity application.
Note
Rosette Identity Resolution will not save data across restarts except when using the external Wikidata KB. Each time you restart, the Elasticsearch index containing the Wikidata is reloaded.
The pruned Wikidata is provided for convenience; users should be able to install and run this evaluation release of Rosette Identity on a laptop without any special resources. For more on system requirements, see the Prerequisites section.
In the UI, extracted entities that have been linked to the pre-loaded Wikidata KB are indicated by a KB ID
that starts with the letter Q. The system also provides URL links to the corresponding Wikidata web pages.
Using a Custom Knowledge Base
In addition to the pre-loaded pruned Wikidata KB, you can also upload a custom KB, such as the UN Sanctions list, or a list of customers.
Once a custom KB is loaded in the system, when processing a text file the linking model will attempt to link against the entries in the custom KB.
In the UI, extracted entities that have been linked to a custom KB are indicated by a KB ID
that corresponds to the KB ID
in the custom csv file. No URLs will be returned.
There is a third KB in the system which is used to store new identities - internally we refer to the new, previously encountered entities as "ghosts".
If the system fails to link an entity mention in a text to either the preloaded Wikidata KB or a custom KB, it will attempt to resolve that entity with a pre-existing "ghost". If no pre-existing "ghost" is a suitable match, the system will add a new ghost.
In the UI, ghosts are indicated by an empty KB ID
.