Entity linking provides a mechanism for disambiguating the identity of similarly named entities mentioned in a document. For example, “Rebecca Cole” is the second African-American woman to become a doctor in the United States and also the name of an Australian professional basketball player. Linking helps establish the identity of the entity by disambiguating common names and matching a variety of names, such as nicknames and formal titles, with an entity ID.
To link entities to a knowledge base, REX uses a statistical disambiguation model trained on a knowledge base. The linker processor is delivered with a model based on a default Wikidata knowledge base. If the entity exists in Wikidata, then REX returns the Wikidata QID, such as Q1 for the Universe, in the entityId
field. Once enabled, the linker can also return:
If the linker is disabled (the default), a random string is returned as the entityId. The string starts with a "T" (temporary id) followed by a random number, which is unique per document.
In addition to the default Wikidata knowledge base, you can train a disambiguation model for a custom knowledge base. The custom knowledge base model can replace or run in parallel with the default knowledge base.
Once the custom model has been trained, you can add new entries without retraining the model, as long as the new entries are similar to the ones used for training.
Linker Processor Files The linker processor is packaged as part of the standard REX distribution. The linker files are in the subdirectory data/flinx
.
By default, the linker processor both extracts and links entity candidates. These functions are separate from the default REX entity extraction performed by the statistical, pattern-matching, and exact-matching processors.
You can choose to link the candidates from the statistical, pattern-matching, and exact-matching processors instead of using the linker processor to extract candidates. Set the parameter linkMentionMode
to entities
to use the other processors, not the linker processor. By default, linkMentionMode
is set to text
, in which case the linker processor extracts the candidate entities from the text.
Important
If you use the linker processor to extracts entities, the entities from the linker processor may differ from those returned by the statistical, pattern-matching, and exact-matching processors. The redactor will resolve any overlapping or conflicting entity results.
Table 12. Linker Configuration Parameters
Parameter
|
Description
|
Default
|
kbs
|
Custom list of Knowledge Bases for the linker, in order of priority
|
null
|
linkEntities
|
The option to link mentions to knowledge base entities with disambiguation model. Enabling this option also enables calculateConfidence .
|
false
|
calculateConfidence
|
If true, entity confidence values are calculated. Can be overridden by specifying calculateConfidence in the API call.
|
false
|
useDefaultConfidence
|
The option to assign default confidence value 1.0 to non-statistical entities instead of null.
|
false
|
linkingConfidenceThreshold
|
The confidence value threshold below which linking results by the kbLinker processor are ignored.
|
-1.0
|
linkMentionMode
|
If set to entities , the linker processor uses the statistical, pattern-matching, and exact-matching processors. When set to text , the linker extracts its own candidates.
|
text
|
Entity linking is enabled by setting the EntityExtractor linkEntities(boolean lnk)
to true
, and disabled by setting it to false
.
By default, the linker processor finds candidate mentions in text and then attempts to link them with a knowledge base entry using the linking disambiguation model. To use pre-existing mentions extracted by the statistical, patteren-matching and exact-matching processors, set linkMentionMode
to entities
. Entity linking must be enabled.
Selecting a Knowledge Base for Linking
By default, all knowledge bases under the data/flinx/data/kb
directory inside the REX installation will automatically be used for linking. Any custom knowledge bases placed in this directory will be loaded each time REX launches.
You can enable dynamic loading, controlling which custom knowledge bases will be loaded with EntityExtractor setKbDirs
, which takes a List
of Path
s to knowledge bases.
The list is in priority order; the match from the highest knowledge base on the list will be returned.
Important
Setting the list of knowledge bases completely overwrites the list of knowledge bases the linker uses. If you want the default Wikidata knowledge base to be included, it must be on the list of knowledge bases.
DBpedia Types for Linked Entities
The linker processor can associate entities with types drawn from the DBpedia ontology, which provides over 700 types at up to seven levels of granularity.
Providing both primary and secondary entity types increases the usability of the linker processor’s results for many NLP applications. For example, classifying Pheobe Buffay (QID: Q682396) as PERSON is a necessary first step towards effective pronominal resolution, whereas the secondary type path Agent/FictionalCharacter/SoapCharacter paves the way for identifying the relationship of Pheobe Buffay with Lisa Kudrow (Q179041).
Turning on the includeDBpediaType
flag increases the recall of the linker processor’s results. When the flag is enabled, the linker will return both non-named entities like "guitar" (Q6607), type MISC, or named entities of new types, such as "2018 World Cup" (Q170645), type EVENT.
By default, providing DBpedia types is turned off. To enable it; EntityExtractor includeDBpediaTypes (boolean includeDBpediaTypes)
must be set to true
.
The linker processor can return the Refinitiv PermID for a subset of entities which are identified with a QID. By default, linking to PermIDs is turned off.
Entity linking to PermIDs is enabled by setting the EntityExtractor includePermID(boolean includePermID)
to true, and disabled by setting it to false. In order to activate PermID linking, both EntityExtractor includePermID(boolean includePermID
) and EntityExtractor linkEntities(boolean lnk)
must be set to true
. When PermID linking isn't explicitly set, its default value is false
.
Creating a Custom Knowledge Base for Linking
Note
Linking to a custom knowledge base is licensed separately. Contact Rosette support for a license for this functionality. Custom knowledge base linking is currently only available in the Java Edition of REX.
The linker supports linking from multiple knowledge bases. In addition to the default Wikidata knowledge base, you can train a disambiguation model for a custom knowledge base. Your custom knowledge base model can replace or run in parallel with the default Wikidata knowledge base.
To create a new disambiguation model:
-
Define the custom knowledge base.
-
Train the disambiguation model with the REX FTK.
To use the new model for linking:
-
Copy the new knowledge base folder <rex-field-training-home>/asset/custom-kb
to be a subdirectory of <rex-home>/data/flinx/data/kb/
. Restart REX to use the new model.
-
Customize (optional) the set and priority of linking knowledge bases. REX does not have to be restarted to use the new model, it will be uploaded dynamically.
Once the model has been trained, you can add new entries without retraining the model, as long as the new entries are similar to the ones used for training.
You can add Custom Knowledge Base Connectors that retrieve linking information directly from an external data source.
Overview of Linking and Custom Seeding
How does entity linking work?
The linker processor first identifies possible entity mentions (“candidates”) through exact matching. It walks through the input text token by token, looking for matches in an aliases.bin
file, which contains the alias phrases in the json input files. Then each candidate mention is resolved to a known entity with an ID or labeled as “NONE". NONE here means the candidate mention could not be linked to an entry in the knowledge base.
The disambiguation phase applies a machine-learned SVM ranking model on each candidate mention in isolation. That is, the linking of one candidate doesn’t affect the linking of another. Features for each mention can look at context, but the model is still applied separately to each mention. Context is provided by contextWords
and relatedEntityIds
.
Generating binaries
In this phase we read and process the knowledge base’s json files, generating easy-to-load binary files. The binary files are used both for candidate generation and selecting features for the disambiguation model.
Different knowledge bases may provide different details about the entities. For example, most knowledge bases can provide information about the prevalence score of each entity, but very few knowledge bases indicate if a phrase is likely to be an entity or not. Those details are used for disambiguation, so a byproduct of this phase is the config
file, which lists the available features for the training process.
Training the disambiguation model
After producing a list of the available features, the actual training is performed on a pre-annotated corpus provided by BasisTech for training the disambiguation model. The first step of the process extracts all available features for each annotated entity sample. While there are over a dozen features that could be used, not all may be usable if the required data is not available in the user’s knowledge base. The second step trains the SVM model.
At the end of this phase, both the binaries and the new model are found in the /asset/custom-kb
directory, ready to be copied to <rex-home>/data/flinx/data/kb/
.
Custom knowledge bases without a disambiguation model
Passing -d
as an argument to train-linker-model
creates a custom knowledge base without a disambiguation model. Such knowledge bases act as "enhanced gazetteers"; like a gazetteer they will always extract any aliases contained in them. They will also add an ID provided by the knowledge base. Entities extracted from a knowledge base without a disambiguation model will have 1.0
as their linking confidence value. If several entities in the knowledge base have the same alias, one will be returned if the alias is encountered in the text, but which is returned is indeterminate. Overlapping aliases will return the longest one.
Defining a Custom Knowledge Base
A custom knowledge base for the linker processor requires the following files:
-
A file kb.json
containing language-agnostic information, such as entity ID or entity type. Each knowledge base typically has a single file of this type.
-
One or more files <lang>-kb.json
containing language-dependent information, such as an entity’s primary name. <lang>
is the three-letter ISO 639-3 code indicating the language of the file contents.
For example, if the knowledge base contains entities with information for Japanese and Korean languages, the expected input files are: kb.json
, jpn-kb.json
and kor-kb.json
. If a knowledge base contains only a single language, the kb.json file is not required; all data can go into a single file, the language-specific file (e.g. jpn-kb.json
).
The kb.json file lists language-agnostic entity definitions.
Example 10. Language-Agnostic Definitions (kb.json
)
[
{
"entityId": "B1",
"entityType": "ORGANIZATION",
"prevalence": 100,
"relatedEntityIds": [
"B2"
]
},
{
"entityId": "B2",
"entityType": "PERSON",
"prevalence": 95,
"relatedEntityIds": [
"B1"
]
}
]
This file provides a list of entity definitions, not necessarily corresponding to the list in kb.json. Information for each entity may duplicate any of the language-agnostic fields in kb.json, and also includes language-dependent information:
Example 11. Language-Specific Definitions (eng-kb.json
):
[
{
"entityId": "B1",
"entityName": "Basis Technology",
"entityType": "ORGANIZATION",
"aliases": [ {
"phrase": "Basis Technology"
}, {
"phrase": "BasisTech"
}
],
"contextWords": [ "Software", "NLP", "Rosette", "Text Analytics" ]
}, { "entityId": "B2",
"entityName": "Carl W. Hoffman",
"entityType": "PERSON",
"aliases": [ {
"phrase": "Carl Hoffman"
}, {
"phrase": "Carl"
}],
"contextWords": [ "Software", "NLP", "Rosette", "Text Analytics" ]
} ]
If the knowledge base contains only a single language, all necessary information may be represented in a single file. This file, eng-kb.json
, is the only file necessary.
Generating Linker Files with the FTK
The FTK converts your knowledge base definition files into the linker input format, generating binaries from the data, and training the disambiguation model to your custom knowledge base.
Note
The REX field training kit (FTK) is not part of the standard REX distribution. Contact <support@rosette.com>
to obtain the FTK package.
Set up the FTK and Training Files
Once you have defined the custom knowledge base in the prescribed format, you are ready to set up the FTK in preparation for training the model. These directions assume you have successfully installed REX and have the docker version of the REX FTK package.
Load the Docker image
-
Create a working directory, which we will refer to as <rex-field-training-home>
.
-
Load the image:
docker load -i rex-training-docker-<version>.tar.gz
-
Validate that the image is loaded:
docker images
Confirm there is an image named rex-field-training
in the images list.
Prepare the training files directory
-
Create the directory asset/seeding-input/
in <rex-field-training-home>
.
mkdir -p <rex-field-training-home>/asset/seeding-input
The asset
directory is used for transferring data in and out of it. In the case of docker container FTK, this is your mounted asset
directory.
-
Copy your knowledge base json input files to asset/seeding-input/
. After placing them, the directory should look like this:
ls <rex-field-training-home>/asset/seeding-input
eng.json jpn.json kb.json kor.json
Run the Docker Container
docker run -it --rm -v <local-asset-dir-full-path>:/asset -v <rex-installation-path>:/basis/rex rex-field-training
A startup message similar to the one below, should now appear in your console window, confirming that your Docker is set up correctly:
REX field training tools (docker) version
Please refer to the legal notices in /basis/ftk/dependencies/ThirdPartyLicenses.txt.
Copyright (c) 2016 Basis Technology Corporation All Rights Reserved.
Support@basistech.com
http://www.basistech.com
brat instance started on port 8080 in this container. brat data: /asset/bratdata
Available commands:
Statistical model:
generate-ngram, cluster-ngram, train-model, evaluate
Advanced tools: learning-curve
Linker model:
generate-linker-binaries, train-linker-model
Binary gazetteer:
build-binary-gazetteer
The following is a brief rundown of the steps for custom seeding. For additional information about what is happenings behind the scenes as these steps are carried out, please refer to Overview of Linking and Custom Seeding.
-
From your console, generate the binary files:
generate-linker-binaries
-
Edit the existing configuration file for each language located at <rex-field-training-home>/asset/config/features/
to show only the features that you will be using for training your disambiguation model. The feature set will depend partly on what data is available inside your knowledge base’s json files.
NO_ENTITY_THRESHOLD
DO NOT REMOVE this required item. This item does not represent a feature, but rather is the minimum required score for linking to happen.
IS_ENTITY_PROBABILITY
Use this feature if your json file has the isEntityProbability
field filled out. This feature is the probability that a matching candidate is an entity.
ALIAS_LOWERCASE_PROBABILITY
Use this feature if your json file has the isEntityProbability
field filled out. This feature uses the probability that a matching candidate name in lowercase is still an entity.
ALIAS_TITLECASE_PROBABILITY
Use this feature if you want your model to be case-sensitive in disambiguation. This feature uses the probability that the original candidate is titlecased, based on all aliases (pre-computed per candidate).
MENTION_TOKEN_COUNT
An optional feature that may or may not be helpful in your model. Given the number of tokens in a mention, how likely is it to be linked to an entity.
IS_MENTION_TITLECASE
A feature helpful in languages which have case sensitivity (e.g., English but not Japanese). Looks to see if the mention at runtime is in title case.
IS_MENTION_ACRONYM
A feature helpful in languages which use acronyms, or if your knowledge base contains acronyms. Looks to see if a mention is an acronym.
CANDIDATE_PREVALENCE
Use this feature if your json file has the prevalence
field filled out. There are cases where you may not want to use this feature, such as if you don’t want your model to be biased against entities which have low prevalence.
contextWords = 8
This feature enables context-sensitive entity linking. This feature is required if you have the contextWords field filled out for 5% or more of your knowledge base. If contextWords data doesn’t exist or is sparse, turn this feature off. A high value for this feature indicates your context words are reliable, i.e., high quality context words that really help identify the entity; while a low value indicates lower quality, i.e., every place this person ever visited or person they ever met.
Choosing a value for the CONTEXT_WORDS
feature The entities in the knowledge base are in a continuum from an entity that has very few context words (for whom every connection is thus precious and distinguishing), such as a near hermit, to someone that has a great many context words (for whom every connection is less precious and distinguishing), such as for Barack Obama. During training, context words are divided into “bins” and more bins means a finer grained distinction in this area. We recommend starting with 8 and then trying different values to see what produces the best results for you.
-
From your console, train the disambiguation model
train-linker-model
-
Copy the new knowledge base folder <rex-field-training-home>/asset/custom-kb
to be a subdirectory of <rex-home>/data/flinx/data/kb/
. Restart REX to use the new model.
-
Run REXCmd
to observe new model results on sample documents from your domain, with entities that exist in the knowledge base. If there are too few results, update the NIL_BIAS
coefficent value in custom-kb/<lang_code>/params
file to with a number between 0 and 1. If you want to improve the results further, try to update the feature list in /asset/config/features/<lang_code>
and train again.
Note
We recommend that after you run through the whole process once, try repeating steps 2 and 3 to see how different features and configurations affect the performance of your disambiguation model, until you arrive at a model that seems to work for you. Feel free to reach out to the customer engineering group at <support@rosette.com>
if you have questions.
Adding Entries to a Custom Knowledge Base
Once the disambiguation model for a custom knowledge base has been created, you can add entries without retraining the disambiguation model. The new entries must be similar to the entries that the model was trained on.
-
Create JSON files containing the new entries.
-
Add the files to the directory: <rex-je-root>/data/flinx/data/kb/<custom_kb_name>/
-
Modify the entity extractor at runtime to use the files:
EntityExtractor#addKbEntities(String disambiguationModel,
LanguageCode language, File file)
EntityExtractor#addKbEntities(String disambiguationModel, Path path)
Note
If the new knowledge base entries have different attributes (such as having a prevalence score that the existing knowledge base entries do not have), then you should retrain the disambiguation model with the new entries.
Custom Knowledge Base Connectors
You can add a custom Knowledge Base Connector that retrieves linking information from a knowledge base backed by any external data source or code. Knowledge Base Connectors are an advanced customization feature, and should be used with care; unoptimized connector implementations can greatly affect REX's performance.
Custom Knowledge Base Connectors are created by implementing the com.basistech.rosette.flinx.api.internal.KnowledgeBase interface
. The interface contains functions that provide information about entities to the linker disambiguation model features, as described in Train the Model. At a minimum, a connector must implement the findCandidates
candidates that identifies potential candidates for linking in a document, the lookupEntityType
and getLabels
functions that return basic information about entities supported by the Knowledge Base, and the getContextVector
function that provides a general context vector for entities.
A complete example of a custom Knowledge Base Connector that uses a knowledge base backed by a SQLite database is provided in the samples
directory of your REX installation.
Warning
The sample uses a SQLite database which is simple and easy to install for demonstration purposes, but is not optimized for performance. Your installation will depend on your requirements and existing knowledge bases.
The file SQLiteKnowledgeBase.java
is well commented and can be used as an example for how to build your own connector.
To build and run the sample, follow the directions in the README.md
file in the repository. The files are dependent on the version of Rosette Server installed.
To build and run the sample, make sure you have a local version of SQLite installed, and then:
-
From the samples/sqlite-kb-connector
directory, run mvn install
.
-
Copy the built sqlite-kb-connector-1.0.jar
from the target
folder to {rex-installation}/lib
.
-
Create a SQLite knowledge base: copy the contents of the sample's kb
directory into a new directory under {rex-installation}/data/flinx/data/kb
. You may name this new folder anything you like.
-
Inside the knowledge base directory, run create-kb-database.sh custom-kb.db
. This will create a new SQLite database with the correct schema.
-
To put data in the knowledge base, use a database browser or connect to the database with the SQLite command-line utility, then add lines to the tables as follows to add an entity. The example code provides the commands using the SQLite command-line utility.
-
From a command-line, start the sqlite3 program and open the sample database.
sqlite3
.open custom-kb.db
-
In the entities
table, add a line with your desired entity ID in the entityId
field (for example, E1
). The entityNum
column should auto increment.
insert into entities(entityNum, entityId) VALUES(1, 'E1');
-
In the aliases
table, add lines for any alias you want the entity to have. For example, add a line with entityId
set to E1
and alias
set to John Doe
.
insert into aliases(alias, entityId) VALUES ('John Doe', 'E1');
-
In the canonicalNames
table, set the canonical name for the entity. For example, add a line with entityId
set to E1
and canonicalName
set to John Doe
.
insert into canonicalNames(entityId, canonicalName) VALUES ('E1', 'Jonathan Doe');
-
In the entityTypes
table, set the entity type for the entity. For example, add a line with entityId
set to E1
and entityType
set to PERSON
.
insert into entityTypes(entityId, entityType) VALUES ('E1', 'PERSON');
-
In the contextWords
table, set context words to help the linker recognize the entity, separated by spaces. For example, if John Doe works at BasisTech and you expect them to appear in documents related to the company, you might add a line with entityId
set to E1
and contextWords
set to basistech nlp language rex
.
insert into contextWords(entityId, contextWords) VALUES ('E1', 'basistech nlp language rex');
-
Once the database is in place, any extractions with the linker activated will use the connector to search the new SQLite-based knowledge base in addition to the standard Wikidata knowledge base.