The source file for a user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM). Each entry is a single line. Empty lines are ignored. The source file must be compiled into a binary format, as described below.
Each entry is a word, followed by a tab and an analysis. The analysis must end with a lemma and a part-of-speech (POS) tag.
For POS tags, see Part-of-Speech Tags. For those languages for which RBL-JE does not return POS tags, use
Case. User dictionary lookups are case sensitive and RBL-JE provides an option,
AnalyzerOption.caseSensitive, to control whether or not the analysis phase is case sensitive. If this option is "true", which is the default, then the token itself is used to query the dictionary. If it is "false", the token is lowercased before consulting the dictionary. This requires that the words in a user dictionary intended for use in a case-insensitive analysis be in lowercase. See Activating User Dictionaries to learn how to associate a dictionary with the appropriate analysis. For Danish, Norwegian, and Swedish, the dictionaries we provide are lowercase.
AnalyzerOption.caseSensitive is automatically set to "false" for these languages.
Variations. You may want to provide more than one analysis for a word or more than one version of a word for an analysis. Note: The shortcut for including lines with empty analyses or empty words to repeat a word or analysis is deprecated.
The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog" for the same analysis (noun).
For some languages, the analysis may include special tags and additional information.
Contracted forms. For English, French, Italian, and Portuguese,
^= is a separator for a contraction or elision. English example:
Multi-Word Analysis. For English, Italian, Spanish, and Dutch,
^_ indicates a space. English example:
Compound Boundary. For Danish, Dutch, Norwegian, German, and Swedish,
^# indicates the boundary between elements in a compound word. For Hungarian, the compound boundary tag is
^CB+. German example:
Compound Linking Element. For German
^/, indicates a compound linking element. For Dutch, use
Derivation Boundary or Separator for Clitics. For Italian, Portuguese, and Spanish,
^| indicates a derivation boundary or separator for clitics.
Spanish example with derivation boundary:
Italian example with separator for clitics:
Japanese Readings and Normalized Forms. For Japanese,
[^r] precedes a reading (there may be more than one), and
[^n] precedes a normalization. For example:
Korean Analysis. A Korean analysis uses a different pattern than the analysis for other languages. The pattern for an analysis in our Korean dictionary or a user Korean dictionary is as follows:
MorN is a morpheme, consisting of one or more Korean characters, and
TagN is the POS tag for that morpheme.
[^+] indicates the boundary between morphemes.
Here's an example:
If the analysis contains one noun morpheme, that morpheme is the lemma and the POS tag is the POS tag for that morpheme. If more than one of the morphemes are nouns, the lemma is the concatenation of those nouns (a compound). Example:
Otherwise, the lemma is the first morpheme, and the POS tag is the POS tag associated with that morpheme.
You can override this algorithm for identifying the lemma and/or POS tag in a user dictionary entry by placing
[^L] lemma and/or [^P][/Tag] at the end of the analysis. The lemma may or may not correspond to one of the morphemes in the analysis. For example (for illustration only):
com.basistech.rosette.bl.KoreanAnalysis interface provides access to the morphemes and tags associated with a given token in either the standard Korean dictionary or a user Korean dictionary.
The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries). The file may begin with a byte order mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese dictionary may begin
!DICT_LABEL followed by Tab and an arbitrary string to set the dictionary's name, which is not currently used anywhere.
Each entry in the dictionary source file is a single line:
word Tab POS Tab DecompPattern Tab Reading1,Reading2,...
where word is the noun, POS is one of the user-dictionary part-of-speech tags listed below, DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify the number of characters from word to include in each component of the compound (0 for no decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered in Hiragana or Katakana (only applicable to Japanese).
The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.
Chinese User Dictionary POS Tags
Japanese User Dictionary POS Tags
AN (adjectival noun)
HS (honorific suffix)
V1 (vowel-stem verb)
VN (verbal noun)
VX (irregular verb)
Note: For examples of standard (non-user-dictionary) use of the one and two-letter POS tags in the preceding list, see Japanese POS Tags.
Examples (the last three entries include readings)
!DICT_LABEL New Words 2014
デジカメ NOUN 0
東京証券取引所 ORGANIZATION 2,2,3
狩野 SURNAME 0
安倍晋三 PERSON 2,2 あべしんぞう
麻垣康三 PERSON 2,2 あさがきこうぞう
商人 NOUN 0 しょうにん,あきんど
The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:
東京証券取引所 ｏｒｇａｎｉｚａｔｉｏｎ ２,２,３
The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into
Valid Characters for Chinese and Japanese User Dictionary Entries
An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin script. In this listing,
.. indicates an inclusive range of valid code points:
0025..0039, 0040..005A, 005F..007A, 007E, 00B7, 0370..03FF, 0400..04FF, 2010..206F, 2160..217B, 2200..22FF, 2460..24FF, 25A0..25FF, 2600..26FF, 3003..3007, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FA, 30FC..30FE, 3200..32FF, 3300..33FF, 4E00..9FFF, D800..DBFF, DC00..DFFF, E000..F8FF, F900..FA2D, FF00, FF02..FFEF
For example, the full stop 。 (3002), indicates a sentence break and must not be included in a dictionary entry. The Katakana middle dot ・ (30FB) must not appear in a dictionary entry; input strings with this character match the corresponding dictionary entries without the character.