There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in the People’s Republic of China (PRC), normally employing the GB2312-80 or GBK character set. TC is used in Taiwan, Hong Kong, and Macau, normally employing the Big Five character set.
Conversion from one script to another is a complex matter. The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with vocabulary differences and context-dependence.
The Chinese Script Converter converts text in simplified script to text in traditional script, or vice versa. The conversion can be on any of three levels:
Codepoint Conversion. Codepoint conversion uses a mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified form 头发 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or 發. Using this approach, however, there is no recognition that 头发 is a word, so the choice could be 發, in which case the end result 頭發 is nonsense. On the other hand, the choice of 髮 leads to errors for other words. So while conversion mapping is straightforward, it is unreliable.
Orthographic Conversion. The second level of conversion is orthographic. This level relies upon identification of the words in the input text. Within each word, orthographic variants of each character may be reflected in the conversion. In the above example, 头发 is identified as a word and is converted to a traditional variant of the word, 頭髮. There is no basis for converting it to 頭發, because the conversion considers the word as a whole rather than as a collection of individual characters.
Lexemic Conversion. The third level of conversion is lexemic. This level also relies upon identification of words. But rather than converting a word to an orthographic variant, the aim here is to convert it to an entirely different word. For example, "computer" is usually 计算机 in SC but 電脳 in TC. Whereas codepoint conversion is strictly character-by-character and orthographic conversion is character-by-character within a word, lexemic conversion is word-by-word.
If you ask for a lexemic conversion, and none is available for a given token, CSC provides the orthographic conversion unless it is not available, in which case CSC provides a codepoint conversion.
Options. When you create a script converter, you must define three options: source script, target script, and conversion level.
Output. For each token in the input, the Chinese Script Converter posts the conversion.
Mixed Input. The Chinese input may contain a mixture of TC and SC, and even some non-Chinese text. The Chinese Script Converter converts to the target (SC or TC), leaving any tokens already in the target form and any non-Chinese text unchanged.
Using CSC with RBL-JE Core
In conjunction with the RBL-JE
CSCAnalyzer as described below.
Set up a
TokenizerFactory to create a
com.basistech.rosette.bl.Tokenizer to tokenize Chinese text.
Set up a
com.basistech.rosette.csc.CSCAnalyzerFactory with a conversion level.
CSCAnalyzerFactory to create a
com.basistech.rosette.csc.CSCAnalyzer to convert from TC to SC or vice versa.
CSCAnalyzer to analyze each
com.basistech.rosette.bl.Token found by the
Get the conversion (SC or TC) from each
Example: convert the tokens in TC text to SC, using the orthographic conversion level.
RBL-JE Core Distribution Sample. The RBL-JE distribution includes a sample (
CSCAnalyze) that you can compile and run with an
ant build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to
rbl-je-<rblversion>/samples/csc-analyze and call
The sample reads an input file in SC and prints each token with its TC conversion to standard out.
Set up a
TokenizerFactory to create a
com.basistech.rosette.lucene.BaseLinguisticsTokenizer, which contains a Lucene
Set up a
BaseLinguisticsCSCTokenFilterFactory to create a
com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilter to convert from TC to SC or vice versa.
BaseLinguisticsCSCTokenFilter to convert each
Token found by the
Example: convert the tokens in TC text to SC, using the orthographic conversion level:
RBL-JE/Lucene Distribution Sample. For supported versions of Lucene, the RBL-JE distribution includes a sample (
CSCCharTermAttributeSample) that you can compile and run with an
ant build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to the samples directory (
rbl-je-<rblversion>/samples) and the file for your version of lucene (/csc-analyze-<luceneversion> and call:
The sample reads an input file in SC and prints the TC conversion for each token to standard out.
You may create and use user dictionaries for converting from TC to SC and vice versa. A CSC user dictionary supports orthographic and may support lexemic conversion. It is not used for codepoint conversion.
Creating a CSC User Dictionary
The source file for a CSC user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM). Each entry is a single line. Empty lines are ignored. The source file must be compiled into a binary format, as described below.
Each entry contains two or three tab-delimited elements:
input_token orthographic_translation [lexemic_translation]
If the input_token is TC, then the orthographic_translation and optional lexemic_translation should be SC. Or vice versa.
Sample entries for a TC to SC user dictionary:
Compiling a CSC User Dictionary. In the
tools/bin directory, RBL-JE includes a shell script for Unix
and a .bat file for Windows
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the
JAVA_OPTS environment variable. For example, to provide 8 GB of heap size, set
Windows command prompt:
Compile the CSC user dictionary from the RBL-JE root directory:
tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE
INPUT_FILE is the pathname of the source file you have created, and
OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin
Activating CSC User Dictionaries
Using RBL-JE Core. The
com.basistech.rossette.csc.CSCAnalyzerFactory provides a method for loading a user-defined CSC dictionary:
void addUserDefinedDictionary(LanguageCode language, LanguageCode targetLanguage, String path);
CSCAnalyzerFactory caf = new CSCAnalyzerFactory();
Using RBL-JE Lucene. You can use the
com.basistech.rosette.lucene.BaseLinguisticsCSCFilterFactory constructor to add user-defined CSC dictionaries.
public BaseLinguisticsCSCTokenFilterFactory(Map<String, String> args);
args Map may include a
Map<String, String> args = new HashMap<String, String>();
BaseLinguisticsCSCTokenFilterFactory cscTokenFilterFactory =
Dynamic CSC User Dictionaries
You can create or add values to the CSC user dictionaries at runtime, instead of creating and compiling the dictionaries in advance. Dynamic CSC user dictionaries support both orthographic and lexemic conversions. Dynamic dictionaries follow the same structure as the compiled user-defined dictionaries.
CSCCandidateGeneratorFactory#addDynamicCscDictionary to create a new empty
Example of creating and populating a CSC
CSCAnalyzerFactory factory = new CSCAnalyzerFactory();
LanguageCode language = LanguageCode.TRADITIONAL_CHINESE;
LanguageCode targetLanguage = LanguageCode.SIMPLIFIED_CHINESE;
DynamicUserDictionary dictionary = factory.addDynamicCscDictionary(language, targetLanguage);
CSCAnalyzer analyzer = factory.create(language, targetLanguage);
dictionary.add("電腦", "电脑", "计算机");