Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.
RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name Matching Within a Language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.
Table 1. Real World ID Parameters
Parameter |
Description |
Default |
useRealWorldIds
|
Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing. |
true (enabled)
|
doQueryRealWorldIds
|
Enables querying with real world IDs; set by language pair. |
true (enabled)
|
realWorldIdScore |
Sets the match score when two names match due to matching real world IDs. Set by language pair. |
0.98 |
nameRealWorldQueryBoost |
Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair. |
35 |
Building a Real World ID File
Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin
where <LANG> is the three-letter language code of the file.
The input file is a tab separated file (.tsv
). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.
IBM WE1X92
Big Blue WE1X92
International Business Machines WE1X92
Unzip the file realWorldIDBuilder.zip
found in the $BT_ROOT directory and run the build command. Instructions on how to run the program are in the README.md
file in the zip file.
You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.
The omit file is a tab separated file (.tsv
) named <LANG>_ORGANIZATION_ids.tsv
where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.
Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.
Example:
IBM Q37156
Nintendo *
* Q45700
To enable an omit file in RNI:
Place the omit file in the BT_ROOT
directory.
Open omit_ids.datafiles
, which is in the $BT_ROOT/rlpnc/data/real_world_ids/ref/omit_ids
directory by default.
-
Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>
, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:
ara_ORGANIZATION * rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
Save omit_ids.datafiles
.