There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in Mainland China. TC is used in Taiwan, Hong Kong, and Macau.
Conversion from one script to another is a complex matter. The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with vocabulary differences and context-dependence.
The Chinese Script Converter converts text in simplified script to text in traditional script, or vice versa. The conversion can be on any of three levels, listed here from simplest to the most complex:
-
Codepoint Conversion: Codepoint conversion uses a mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified form 头发 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or 發. Using this approach, however, there is no recognition that 头发 is a word, so the choice could be 發, in which case the end result 頭發 is nonsense. On the other hand, the choice of 髮 leads to errors for other words. So while conversion mapping is straightforward, it is unreliable.
-
Orthographic Conversion: The second level of conversion is orthographic. This level relies upon identification of the words in the input text. Within each word, orthographic variants of each character may be reflected in the conversion. In the above example, 头发 is identified as a word and is converted to a traditional variant of the word, 頭髮. There is no basis for converting it to 頭發, because the conversion considers the word as a whole rather than as a collection of individual characters.
-
Lexemic Conversion: The third level of conversion is lexemic. This level also relies upon identification of words. But rather than converting a word to an orthographic variant, the aim here is to convert it to an entirely different word. For example, "computer" is usually 计算机 in SC but 電腦 in TC. Whereas codepoint conversion is strictly character-by-character and orthographic conversion is character-by-character within a word, lexemic conversion is word-by-word.
Note
If the converter cannot provide the level of conversion requested (lexemic or orthographic), the next simpler level of conversion is provided.
For example, if you ask for a lexemic conversion, and none is available for a given token, CSC provides the orthographic conversion unless it is not available, in which case CSC provides a codepoint conversion.
The Chinese input may contain a mixture of TC and SC, and even some non-Chinese text. The Chinese Script Converter converts to the target (SC or TC), leaving any tokens already in the target form and any non-Chinese text unchanged.
Table 24. CSC Options
Option
|
Description
|
Type
(Default)
|
Supported Languages
|
conversionLevel
|
Indicates most complex conversion level to use
|
CSConversionLevel
(lexemic )
|
Chinese
|
language
|
The language from which the CSCAnalyzer is converting
|
LanguageCode
|
Chinese, Simplified Chinese, Traditional Chinese
|
targetLanguage
|
The language to which the CSCAnalyzer is converting
|
LanguageCode
|
Chinese, Simplified Chinese, Traditional Chinese
|
See Initial and Path Options for additional options
Enum Classes:
-
BaseLinguisticsOption
-
CSCAnalyzerOption
Using CSC with the ADM API
This example uses the BaseLinguisticsFactory
and CSCAnalyzer
.
-
Create a BaseLinguisticsFactory
and set the required options.
final BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
factory.setOption(BaseLinguisticsOption.licensePath, licensePath);
factory.setOption(BaseLinguisticsOption.conversionLevel, CSConversionLevel.orthographic.levelName());
factory.setOption(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3());
factory.setOption(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3());
-
Create an annotator to get translations.
final EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class);
options.put(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3());
options.put(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3());
final Annotator cscAnnotator = factory.createCSCAnnotator(options);
-
Annotate the input text for tokens and translations.
final AnnotatedText annotatedText = cscAnnotator.annotate(inputText);
final Iterator<Token> tokenIterator = annotatedText.getTokens().iterator();
for (final String translation : annotatedText.getTranslatedTokens().get(0).getTranslations()) {
final String originalToken = tokenIterator.hasNext() ? tokenIterator.next().getText() : "";
outputData.format(OUTPUT_FORMAT, originalToken, translation);
}
Using CSC with the Classic API
The RBL distribution includes a sample (CSCAnalyze
) that you can compile and run with an ant
build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/csc-analyze
and call
ant run
The sample reads an input file in SC and prints each token with its TC conversion to standard out.
This example tokenizes Chinese test and converts from TC to SC.
-
Set up a BaseLinguisticsFactory
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory();
factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory);
factory.setOption(BaseLinguisticsOption.licensePath, licensePath);
-
Use the BaseLinguisticsFactory
to create a Tokenizer
to tokenize Chinese text.
Tokenizer tokenizer = factory.create(new StringReader(tcInput), LanguageCode.CHINESE);
-
Use the BaseLinguisticsFactory
to create a CSCAnalyzer
to convert from TC to SC.
CSCAnalyzer cscAnalyzer =
factory.createCSCAnalyzer(LanguageCode.TRADITIONAL_CHINESE, LanguageCode.SIMPLIFIED_CHINESE);
-
Use the CSCAnalyzer
to analyze each Token
found by the Tokenizer
.
Token token;
while ((token = tokenizer.next()) != null) {
String tokenIn = new String(token.getSurfaceChars(),
token.getSurfaceStart(),
token.getLength());
System.out.println("Input: " + tokenIn);
cscAnalyzer.analyze(token);
-
Get the conversion (SC or TC) from each Token
.
System.out.println("SC translation: " + token.getTranslation());
CSC user dictionaries support orthographic and lexemic conversions between Simplified Chinese and Traditional Chinese. They are not used for codepoint conversion.
CSC user dictionaries follow the same format as other user dictionaries:
-
The source file is UTF-8 encoded.
-
The file may begin with a byte order mark (BOM).
-
Each entry is a single line.
-
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Each entry contains two or three tab-delimited elements:
input_token orthographic_translation [lexemic_translation]
The input_token is the form you are converting from and the orthographic_translation and optional lexemic_translation are the form you are converting to.
Sample entries for a TC to SC user dictionary:
電腦 电脑 计算机
宇宙飛船 宇宙飞船
Compiling a CSC User Dictionary. In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-csc-dictionary
and a .bat file for Windows
rbl-buld-csc-dictionary.bat
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Compile the CSC user dictionary from the RBL root directory:
tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE
INPUT_FILE
is the pathname of the source file you have created, and OUTPUT_FILE
is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin
Table 25. CSC User Dictionary API
Class
|
Method
|
Task
|
BaseLinguisticsFactory
|
addUserCscDictionary
|
Add a user CSC dictionary for a given language.
|
addDynamicCscDictionary
|
Add dynamic CSC dictionary
|