User dictionaries are supplementary dictionaries that change the default linguistic analyses. These dictionaries can be static or dynamic.
Static dictionaries are compiled ahead of time and passed to a factory.
Dynamic dictionaries are created and configured at runtime. Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the factory becomes unused, the contents are lost.
In all dictionaries, the entries should be Form KC normalized. Japanese Katakana characters, for example, should be full width, and Latin characters, numerals, and punctuation should be half width. Analysis dictionaries can contain characters of any script, while for most consistent performance in Chinese, Japanese, and Thai, token dictionaries should only contain characters in the Hanzi (Kanji), Hiragana, Katakana, and Thai scripts.
In Chinese, Japanese, and Thai, text in foreign scripts (such as Latin script) in the input that equals or exceeds the length specified by minNonPrimaryScriptRegionLength
(the default is 10) is passed to the standard Tokenizer and not seen by a user segmentation dictionary.
Types of User Dictionaries
Analysis Dictionary: An analysis dictionary allows users to modify the analysis or add new variations of a word. The analysis associated with a word includes the default lemma as well as part-of-speech tag and additional characteristics for some languages. Use of these dictionaries is not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
-
Segmentation Dictionary: A segmentation dictionary allows users to specify strings that are to be segmented as tokens.
Chinese and Japanese segmentation user dictionary entries may not contain the ideographic full stop.
Many-to-one Normalization Dictionaries: Users can implement a many-to-one normalization dictionary to map multiple spelling variants to a single normalized form.
CLA/JLA Dictionaries: The Chinese Language Analyzer and Japanese Language Analyzer both include the capability to create and use one or more segmentation (tokenization) user dictionaries for vocabulary specific to an industry or application. A common usage for both languages is to add new nouns like organizational and product names. These and existing nouns can have a compounding scheme specified if, for example, you wish to prevent an otherwise compound product name from being segmented as such. When the language is Japanese, you can also create user reading dictionaries with transcriptions rendered in Hiragana. The readings can override the readings returned from the JLA reading dictionary and override readings that are otherwise guessed from segmentation (tokenization) user dictionaries.
CSC Dictionary: Users can specify conversion for use with the Chinese Script Converter (CSC).
Prioritization of User Dictionaries
All static and dynamic user dictionaries, except for many-to-one normalization dictionaries, are consulted in reverse order of addition. In cases of conflicting information, dictionaries added later take priority over those added earlier. Once a token is found in a user dictionary, RBL stops and will consult neither the remaining user dictionaries nor the RBL dictionary.
Many-to-one normalization dictionaries are consulted in the following order:
All dynamic user dictionaries, in reverse order of addition.
Static dictionaries in the order that they appear in the option list for normalizationDictionaryPaths
.
Example of non-many-to-one user dictionary priority:
User adds dynamic dictionary named dynDict1
User adds static dictionary named statDict2
User adds static dictionary named statDict3
User adds dynamic dictionary named dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, statDict3
, statDict2
, dynDict1
Example of many-to-one normalization user dictionary priority:
User adds dynamic dictionary named dynDict1
User sets normalizationDictionaryPaths = "statDict2;statDict3"
User adds dynamic dictionary named dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, dynDict1
, statDict2
, statDict3
The Chinese and Japanese language analyzers load all dictionaries with the user dictionaries loaded at the end of the list. To prioritize the user dictionaries and put them at the front of the list, guaranteeing the matches in the user dictionaries will be used, set the option favorUserDictionary
to true
.
The following formatting rules apply to user dictionary source files.
The source file is UTF-8 encoded.
The file may begin with a byte order mark (BOM).
Each entry is a single line.
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Dynamic User Dictionaries
A dynamic user dictionary allows users to add user dictionary values at runtime. Instead of creating and compiling the dictionary in advance, the values are added dynamically. Dynamic dictionaries are available for all types of user dictionaries, except the CSC dictionary.
The process for using dynamic dictionaries is the same for each dictionary type:
Create an empty dynamic dictionary for the dictionary type.
Use the appropriate add method to add entries to the dictionary.
Dynamic dictionaries use the same structure as the compiled user dictionaries, but instead of having a single tab-delimited string, they are composed of separate strings. As an example, let's look at a many-to-one normalization dictionary entry:
Caution
Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the annotator, tokenizer, or analyzer becomes unused, the contents are lost.
User dictionary lookups are case-sensitive. RBL provides an option, caseSensitive
, to control whether the analysis phase is case-sensitive.
If caseSensitive
is true
, (the default), then the token itself is used to query the dictionary.
If caseSensitive
is false
, the token is lowercased before consulting the dictionary. If the analyses is intended to be case-insensitive then the words in the user dictionary must all be in lowercase.
If you are using the BaseLinguisticsTokenFilterFactory
, then the value for AnalyzerOption.caseSensitive
both turns on the corresponding analysis and associates the dictionary with that analysis.
For Danish, Norwegian, and Swedish, the provided dictionaries are lowercase and caseSensitive
is automatically set to false
.
Valid Characters for Chinese and Japanese User Dictionary Entries
An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin script. In this listing, ..
indicates an inclusive range of valid code points:
0022..007E, 00A2, 00A3, 00A5, 00A6, 00A9, 00AC, 00AF, 00B7, 0370..04FF, 2010..206F, 2160..217B, 2190..2193, 2200..22FF, 2460..24FF, 2502, 25A0..26FF, 2985, 2986, 3001..3007, 300C, 300D, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FE, 3200..33FF, 4E00..9FFF, D800..FA2D, FF00, FF02..FFEF
Compiling a User Dictionary
In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-user-dictionary
and a .bat file for Windows.
rbl-build-user-dictionary.bat
To compile a user dictionary, from the RBL root directory:
tools/bin/rbl-build-user-dictionary -type TYPE_ARGUMENT LANG INPUT_FILE OUTPUT_FILE
where TYPE_ARGUMENT is the dictionary type, LANG is the language code, INPUT_FILE is the pathname of the source file you have created, and OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-user-dictionary -type analysis jpn jpn_lemmadict.txt jpn_lemmadict.bin
Table 12. Type Arguments
Dictionary Type |
TYPE_ARGUMENT |
Analysis |
analysis
|
Segmentation |
segmentation
|
Many-to-one |
m1norm
|
CLA or JLA segmentation |
cla or jla
|
JLA reading |
jla-reading
|
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Segmentation Dictionaries
The format for a segmentation dictionary source file is very simple. Each word is written on its own line, and that word is guaranteed to be segmented as a single token when seen in the input text, regardless of context. Japanese example:
三菱UFJ銀行
酸素ボンベ
Table 13. User Segmentation Dictionary API
Class |
Method |
Task |
BaseLinguisticsFactory
|
addUserSegDictionary
|
Adds a user segmentation dictionary for a given language. |
addDynamicSegDictionary
|
Create and load new dynamic segmentation dictionary |
TokenizerFactory
|
addUserDefinedDictionary
|
Adds a user segmentation dictionary |
addDynamicUserDictionary
|
Create and load new dynamic segmentation dictionary |
Note
Analysis dictionaries are not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
Each entry is a word, followed by a tab and an analysis. The analysis must end with a lemma and a part-of-speech (POS) tag.
word lemma[+POS]
For those languages for which RBL does not return POS tags, use DUMMY
.
Variations. You can provide more than one analysis for a word or more than one version of a word for an analysis.
The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog" for the same analysis (noun).
telephone telephone[+NOUN]
telephone telephone[+VI]
dog dog[+NOUN]
Dog dog[+NOUN]
For some languages, the analysis may include special tags and additional information.
Contracted forms. For English, French, Italian, and Portuguese, ^=
is a separator for a contraction or elision.
English example:
doesn't does[^=]not[+VDPRES]
Multi-Word Analysis. For English, Italian, Spanish, and Dutch, ^_
indicates a space.
English example:
IBM International[^_]Business[^_]Machines[+PROP]
Compound Boundary. For Danish, Dutch, Norwegian, German, and Swedish, ^#
indicates the boundary between elements in a compound word. For Hungarian, the compound boundary tag is ^CB+
.
German example:
heimatländern Heimat[^#]Land[+NOUN]
Compound Linking Element. For German ^/
, indicates a compound linking element. For Dutch, use ^//
.
German example:
arbeitskreis Arbeit[^/]s[^#]Kreis[+NOUN]
Derivation Boundary or Separator for Clitics. For Italian, Portuguese, and Spanish, ^|
indicates a derivation boundary or separator for clitics.
Spanish example with derivation boundary:
duramente duro[^|][+ADV]
Italian example with separator for clitics:
farti fare[^|]tu[+VINF_CLIT]
Japanese Readings and Normalized Forms. For Japanese, [^r]
precedes a reading (there may be more than one), and [^n]
precedes a normalization. For example:
行わ 行う[^r]オコナワ[+V]
tv テレビ[^r]テレビ[^n]テレビ[+NC]
アキュムレータ アキュムレーター[^r]アキュムレータ[^n]アキュムレーター[+NC]
Korean Analysis. A Korean analysis uses a different pattern than the analysis for other languages. The pattern for an analysis in a user Korean dictionary is as follows:
Token Mor1[/Tag1][^+]Mor2[/Tag2][^+]Mor3[/Tag3]
Where each MorN
is a morpheme, consisting of one or more Korean characters, and TagN
is the POS tag for that morpheme. [^+]
indicates the boundary between morphemes.
Here's an example:
유전자이다 유전자[/NPR][^+]이[/CO][^+]다[/ECS]
If the analysis contains one noun morpheme, that morpheme is the lemma and the POS tag is the POS tag for that morpheme. If more than one of the morphemes are nouns, the lemma is the concatenation of those nouns (a compound). Example:
정보검색 정보[/NNC][^+]검색/[NNC]
Otherwise, the lemma is the first morpheme, and the POS tag is the POS tag associated with that morpheme.
You can override this algorithm for identifying the lemma and/or POS tag in a user dictionary entry by placing [^L]lemma
and/or [^P][/Tag] at the end of the analysis. The lemma may or may not correspond to one of the morphemes in the analysis. For example:
유전자이다 유전자[/NNC][^+]이[/CO][^+]다[/ECS][^L]유전[^P][/NPR]
The KoreanAnalysis
interface provides access to the morphemes and tags associated with a given token in either the standard Korean dictionary or a user Korean dictionary.
Table 14. User Analysis Dictionary API
Class |
Method |
Task |
BaseLinguisticsFactory
|
addUserLemDictionary
addUserAnalysisDictionary
|
Add a user analysis dictionary |
addDynamicAnalysisDictionary
|
Add a dynamic analysis dictionary |
AnalyzerFactory
|
addUserDefinedDictionary
|
Add a user analysis dictionary |
addDynamicUserDictionary
|
Add a dynamic analysis dictionary |
Many-to-one Normalization Dictionaries
A many-to-one normalization dictionary maps one or more variants to a normalized form. The first value on each line is the normalized form. The remainder of the entries on the line are the variants to be mapped to the normalized form. All values on the line are separated by tabs.
Example:
norm1 var1 var2
norm1 var3 var4 var5
Table 15. User Many-to-one Normalization Dictionary API
Class |
Method |
Task |
BaseLinguisticsFactory
|
addDynamicNormalizationDictionary
|
Create and load new dynamic normalization dictionary |
AnalyzerFactory
|
addDynamicNormalizationDictionary
|
Create and load new dynamic normalization dictionary |
Use the option normalizationDictionaryPaths
to specify the static user normalization dictionaries.
The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries). The file may begin with a byte order mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese dictionary may begin !DICT_LABEL
followed by Tab and an arbitrary string to set the dictionary's name, which is not currently used anywhere.
Each entry in the dictionary source file is a single line:
word Tab POS Tab DecompPattern Tab Reading1,Reading2,...
where word is the entry or surface form, POS is one of the user-dictionary part-of-speech tags listed below, DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify the number of characters from word to include in each component of the compound (0 for no decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered in Hiragana or Katakana (only applicable to Japanese).
The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.
Chinese User Dictionary POS Tags
ABBREVIATION
ADJECTIVE
ADVERB
AFFIX
CONJUNCTION
CONSTRUCTION
DERIVATIONAL_AFFIX
DIRECTION_WORD
FOREIGN_PERSON
IDIOM
INTERJECTION
MEASURE_WORD
NON_DERIVATIONAL_AFFIX
NOUN
NUMERAL
ONOMATOPE
ORGANIZATION
PARTICLE
PERSON
PLACE
PREFIX
PREPOSITION
PRONOUN
PROPER_NOUN
PUNCTUATION
SUFFIX
TEMPORAL_NOUN
VERB
VERB_ELEMENT
Japanese User Dictionary POS Tags
NOUN
PROPER_NOUN
PLACE
PERSON
ORGANIZATION
GIVEN_NAME
SURNAME
FOREIGN_PLACE_NAME
FOREIGN_GIVEN_NAME
FOREIGN_SURNAME
AJ (adjective)
AN (adjectival noun)
D (adverb)
HS (honorific suffix)
V1 (vowel-stem verb)
VN (verbal noun)
VS (suru-verb)
VX (irregular verb)
Note: For examples of standard (non-user-dictionary) use of the one and two-letter POS tags in the preceding list, see Japanese POS Tags – BT_JAPANESE_RBLJE_2
.
Examples (the last three entries include readings)
!DICT_LABEL New Words 2014
デジタルカメラ NOUN
デジカメ NOUN 0
東京証券取引所 ORGANIZATION 2,2,3
狩野 SURNAME 0
安倍晋三 PERSON 2,2 あべしんぞう
麻垣康三 PERSON 2,2 あさがきこうぞう
商人 NOUN 0 しょうにん,あきんど
The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:
東京証券取引所 organization 2,2,3
The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into
東京
証券
取引所
Table 16. CLA and JLA User Analysis Dictionary API
Class |
Method |
Task |
BaseLinguisticsFactory
|
addUserLemDictionary
addUserAnalysisDictionary
|
Add a user analysis dictionary |
addDynamicAnalysisDictionary
|
Add a dynamic analysis dictionary |
AnalyzerFactory
|
addUserDefinedDictionary
|
Add a user analysis dictionary |
addDynamicUserDictionary
|
Add a dynamic analysis dictionary |
Table 17. JLA Readings Dictionary API
Class |
Method |
Task |
BaseLinguisticsFactory
|
addUserReadingDictionary
|
Adds a JLA readings dictionary. |
addDynamicReadingDictionary
|
Create and load new dynamic JLA readings dictionary |
TokenizerFactory
|
addUserReadingDictionary
|
Adds a JLA readings dictionary |
addDynamicReadingDictionary
|
Create and load new dynamic JLA readings dictionary |