Rosette Name Translator (RNT) supports name translation in complex, non-Latin languages, such as Arabic and Chinese. See Supported Languages of Origin for the complete list of supported languages and scripts. RNT supports multiple transliteration standards for translating from non-Latin scripts to English.
The type of translation depends on the characteristics of the source and target domains and the language of origin of the name to be translated. RNT supports the following types of translations:
Translation of a Person Name to English
How the name is translated depends on whether the language of origin of the name is the source language.
-
If the language of origin of the person name is the same as the source language, the name is translated according to the specified target transliteration scheme. For example, a Japanese name which is in Japanese.
-
If the language of origin is not the source language, the name is translated to its conventional English form. For example, a non-Japanese name that appears in Japanese.
RNT supports the following translations of names to their conventional English representation:
-
non-Arabic names that appear in the Arabic language
-
non-Chinese names that appear in the Chinese language
-
non-Hebrew names that appear in the Hebrew language
-
non-Japanese names that appear in the Japanese language
-
non-Korean names that appear in the Korean language
-
non-Russian names that appear in the Russian language
Use the languageOfOrigin
Name field to inform RNT that the language of origin is not the language of use in which the name appears.
-
If the language of origin is Unknown (the default), the language model may classify the name as foreign (for Japanese, the script must be Katakana).
-
If the language of use is Japanese, the script is Kanji, and the language of origin is Chinese or Korean, RNT attempts to translate the name, using Pinyin for Chinese, and Revised Romanization of Korean for Korean.
-
If the language of use is Chinese and the language of origin is anything other than Chinese, RNT attempts to translate the name to its standard English representation.
-
If the language of use is Korean, the script is Hangul, and the language of origin is any language other than Korean, RNT attempts to translate the name to its standard English representation.
-
For other languages, RNT uses the specified target transliteration scheme to transliterate the name to Latin script, regardless of whether or not the name is etymologically native to the respective source language.
Example - Arabic:
Source domain: Arabic language, Arabic script, native transliteration scheme.
Target domain: English language, Latin script, IC transliteration scheme.
The translation of جورج بوش is George Bush. Note: The IC transliteration is Jwrj Bwsh.
The translation of صفية طالب السهيل (an Arabic name) is the IC transliteration: Safiyyah Talib al-Suhayl.
Example - Pashto with IC transliteration scheme:
For Pashto, if you are using the IC transliteration scheme and the language of origin is Afghan Persian, RNT provides special handling of two short vowels, using 'e' and 'o' in place of 'i' and 'u', as designated in the IC Pashto Standardized Transliteration System for Personal Names.
Source domain: Pashto language, Arabic script, native transliteration scheme.
Target domain: English language, Latin script, IC transliteration scheme.
The standard translation of اسحاق is Ishaq. If the language of origin is Afghan Persian, the translation is Eshaq.
Example - Japanese, Katakana:
Source domain: Japanese language, Katakana script, native transliteration scheme.
Target domain: English language, Latin script, Hebon transliteration scheme.
The translation of ウィリアム・シェイクスピアー is William Shakespeare. Note: The Hebon transliteration is Iriamu Shieikusupiaa.
Example - Japanese, Kanji:
Source domain: Japanese language, Kanji script, native transliteration scheme.
Target domain: English language, Latin script, Hebon transliteration scheme.
With Chinese as the language of origin, the translation (Pinyin transliteration) of 温家宝 is Wen Jiabao. Note: The Hebon transliteration of 温家宝 is On Kahou.
Example - Russian:
Source domain: Russian language, Cyrillic script, native transliteration scheme.
Target domain: English language, Latin script, BGN transliteration scheme.
The translation of Маргарет Этвуд is Margaret Atwood. Note: The BGN transliteration is Margaret Etvud.
The translation of Алекса́ндр Солжени́цын (a Russian name) is the BGN transliteration: Aleksándr Solzhenítsyn.
Example - Thai
Source domain: Thai language, Thai script, native transliteration scheme.
Target domain: English language, Latin script, ISO11940_2_2007 transliteration scheme.
The translation of นายก รัฐมนตรี (a Thai name) is the ISO11940_2_2007 transliteration: Nayok Ratthamontri.
Example - Greek
Source domain: Greek language, Greek script, native transliteration scheme.
Target domain: English language, Latin script, ISO843_1997 transliteration scheme.
The translation of Γεώργιος Αθανασιάδης-Νόβας (a Greek name) is the ISO843_1997 transliteration: Geōrgios Athanasiadīs-Novas.
Example - Hebrew
Source domain: Hebrew language, Hebrew script, English language of origin, native transliteration scheme.
Target domain: English language, Latin script, ISO259_2_1994 transliteration scheme.
The translation of ברברה סטרייסנד is Barbara Streisand. Note: The ISO259_2_1994 transliteration is Brbrah Sṭriysnd.
Note that the translation to Barbara Streisand will be returned only if the user specifies the language of origin as English.
Translation from Native Script to Latin Script
This is used when the source script and the transliteration scheme are native while the target script is Latin, the transliteration scheme is something other than native, and the language of origin of the name is native.
Examples:
Source domain: Arabic language, Arabic script, native transliteration.
Target domain: English language, Latin script, IC transliteration.
The translation of صفية طالب السهيل is Safiyyah Talib al-Suhayl.
Reverse Transliterations from Latin Script to Native Script
Some transliteration schemes provide enough information to enable reverse transcription, going from English and Latin script to a native script.
Examples:
Source domain: English language, Latin script, Basis transliteration.
Target domain: Arabic language, Arabic script, native transliteration.
The translation of naayif abuu sharkh is نَايِف أَبُو شَرْخ.
Source domain: English language, Latin script, Basis transliteration.
Target domain: Russian language, Cyrillic script, native transliteration.
The translation of Dmitry Medvedev is Дмитрий Медведев.
Standardization of Arabic-origin Names in English
This translation takes a name in English that is of Arabic-origin and translates the Arabic components according to the specified transliteration scheme.
Example:
Source domain: English language, Latin script, native transliteration.
Target domain: English language, Latin script, IC transliteration.
The IC standardization of Moustephah Ehmed ben Samire is Mustafa Ahmad Bin-Samir.
This is available if the source and target languages are Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu, the source and target scripts are Arabic or Hebrew, and the source and target transliteration schemes are native.
-
Language: Source and target language is Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu.
-
Script: Source and target script is Arabic or Hebrew script.
-
Transliteration: Source and target transliteration scheme is native.
In conventional Arabic and Hebrew script, short vowels and other diacritics are not included. In orthographic completion, the translator attempts to vocalize the names by adding the short vowels and other diacritics that don't appear in conventional Arabic and Hebrew script.
You can also perform orthographic completion as part of the translation process. See Translation options.
Arabic, Chinese, Japanese, and Korean, names are often unsegmented, so that is there are no spaces between the words in the name. The translator attempts to segment the unsegmented names by adding spaces between the words in the name.
Segmentation is available when the source and target languages are Arabic Chinese, Japanese, or Korean and the source and target transliteration schemes are native.
You can also perform segmentation as part of the translation process. See Translation options.
Variant Latin-Script Representations of Name in non-Latin Script
-
Language: Source and target language is Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu.
-
Script: Source script is Arabic script; the target script is Latin script.
-
Transliteration: Source transliteration scheme is native; the target transliteration scheme is folk.
The variants of a multi-word name include the cross product of the variants of each word. If, for example, each word in a two-word name has 10 variants, the name has 100 variants. Accordingly, it is a good idea to translate one word at a time. A two-word translation must produce 100 variants to provide the same information as two one-word translations producing 10 variants each.
Use the com.basistech.rnt.ITranslator setMaximumResults
method to control the number of variants that are returned.
Example:
Source domain: Arabic language, Arabic script, native transliteration.
Target domain: Arabic language, Latin script, folk transliteration
نبيل شعث contains two words. Ten variants of نبيل are nabil, nabile, nabille, nabeel, nabiyl, nabiyle, nabiylle, nebil, nebile, nebille. Ten variants of شعث are sha`ath, sha'ath, shaath, sha`th, sha'th, shath, cha`ath, cha'ath, chaath, cha`th. These translations use the orthographic completion option, which is turned on by default.
With automated translation, the client provides one or more names, input and output text domains, and types of translation desired. For each name, the application generates a list of translations and associated confidence scores.
Automated usage model for performing RNT translations:
-
Set up your environment.
You must define the directory in which you installed RNI-RNT ($BT_ROOT
), and instantiate an Environment object. See Handling the Runtime Environment.
-
Create a Translator Factory and use it to instantiate a Translator.
A given translator can perform translations from one source text domain to one target text domain. The Java API includes support for creating a Translator wrapper that can handle multiple source and target domains.
-
Set translation options (or use the default option settings).
For a listing of the source and target language domains to which each of these translation options applies, see Supported Translation Option Domains.
-
Use the Translator to translate names from the source domain to the target domain.
-
Handle the list of one or more translation results that the Translator generates for each translation. Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest. The sum of the confidence score for all the results that the Translator can generate is less than or equal to 1.0.
-
Release resources, such as the Translator and Environment.
translator.close();
environment.close();
Multithreading
RNT translators are multithreadable.
Java Packages: The RNT classes are in com.basistech.rnt
and com.basistech.rnt.options
(translation options). Utility classes that RNT uses are in com.basistech.util
.
Note: Unqualified class names that appear in this section are in the com.basistech.rnt
and com.basistech.rnt.options
packages.
For detailed information about the API, see the Java API Reference.
RNT provides a factory class for creating translators. The factory is responsible for instantiating the correct RNT internal implementation class, which may vary depending on the source and target text domains you specify. For a table that maps input domains to output domains, see Supported Translation Domains.
The following fragment uses the factory to create a translator for translating names from Arabic documents in Arabic script to their standard English form in Latin script, using the IC transliteration scheme:
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_translator.java
When you are done using the translator, close it:
translator.close();
To create a wrapper object that packages a number of Translators, use RuleSetTranslator
and define a list of TranslationRule
s. Each TranslationRule
specifies the transliteration scheme for the specified language domain and entity type (NEConstants.NE_TYPE_NONE
for all entity types).
The translations options are defined in the package com.basistech.rnt.options
.
-
Orthographic Completion. Class: CompleteOrthographyOption
For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to perform orthographic completion (add short vowels and other diacritics) prior to translation. The translator infers an orthographic completion if it cannot locate the name in its Arabic dictionary. For Pashto or Urdu, the translator omits orthographic completion for any named elements it cannot locate in the appropriate dictionary. The default setting for this option is true.
Given the lack of clear diacritization standards for Iranian Persian, Afghan Persian, Pashto and Urdu, the orthographic completion for these languages reflects BasisTech standards to assist the translation process, and is not intended for external use.
Suppose you are processing نايف أبو شرخ. As is the case in conventional Arabic, this text is not vocalized. For some transliteration schemes (such as IC) the transliteration of unvocalized Arabic is undefined. The translator produces NAyf 'Bw Shrkh. With the orthographic completion option, the Translator adds the missing vowels (giving نَايِف أَبُو شَرْخ) and produces the correct IC transliteration: Nayif Abu-Sharkh.
Orthographic completion is performed for Hebrew names in Hebrew script, but it is not controlled by this option. It is always enabled.
For supported languages and scripts, see Orthographic Completion.
-
Orthographic Minimization. Class: MinimizeOrthographyOption
For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to devocalize (remove short vowel diacritics). The source domain and the target domain must be one of these languages, Arabic script, and Native transliteration. You can use this option to generate the Arabic script representation of names found in most media, such as news articles. The default setting for this option is false.
For supported languages and scripts, see Orthographic Minimization.
-
Statistical Methods. Class: StatisticalMethodsOption
For personal names in Arabic, Hebrew, Japanese, and Russian, use statistical methods to establish information that is not found in a dictionary. Statistical methods are used to do the following:
-
Classify unknown personal names as native or foreign.
-
(Arabic, Hebrew) Vocalize unknown personal names classified as native, or unknown personal names classified as foreign.
-
(Arabic, Hebrew, Japanese, Russian) translate unknown personal names classified as foreign.
For Arabic this option is set to true by default. If statistical methods are turned off, performance with Arabic input is faster, but for personal names not found in its dictionary, the translator can only mechanically transliterate the input.
For Hebrew, Japanese, and Russian, statistical analysis is always performed.
For supported languages and scripts, see Statistical Methods.
-
Performance Tradeoff. Class: PerformanceTradeoff
For personal names in Arabic, Japanese, or Russian, you can control the tradeoff the translator makes between speed and correctness when it is performing statistical analysis. For Arabic, statistical methods must be turned on (the default). As mentioned above, statistical analysis is always performed for Japanese and Russian. Four settings are defined in com.basistech.rnt.options.TradeoffEnum
:
For supported languages and scripts, see Performance Tradeoff.
-
Segmentation. Class: SegmentOption
For Chinese, Japanese or Korean names, you set the option to segment unsegmented names. Unsegmented Thai names are segmented, but it is not controlled by this option; it is always enabled.
Suppose you are processing 胡錦濤. This name is not segmented and the Pinyin transliteration is hujintao. With the segmentation option, the Translator segments the name into 胡 and 錦濤, and produces the correct Pinyin translliteration: hu jintao.
For Korean, the name may be in Hangul or Han script. For example, the following Hangul and Han representations of the same name are not segmented: 김정일 and 金正日. With the segmentation option, the Translator uses Hangul to segment either of these forms into 김 and 정일.
By default, the Segmentation option is set to true.
For supported languages and scripts, see Segmentation.
-
Normalization. Class: NormalizeOption
The normalization option applies to Arabic, Chinese, and Japanese names.
For Arabic native names, the normalizer applies a set of standardization rules. For example, the normalizer inserts a space in عبدالمجيد, producing the more standard representation: عبد المجيد (the IC transliteration is 'Abd-al-Majid).
For Chinese, normalization converts any characters in the traditional Chinese variant to the simplified Chinese variant (the standard for China). For example, the normalizer converts 張 to 张.
For Japanese names, normalization converts Kanji variants (including old Kanji) to their standard form. For example, the normalizer converts 亞 to 亜.
By default, the Normalization option is set to true.
For supported languages and scripts, see Normalization.
-
Pashto IC: Variant Spelling and Region. For Pashto, when applying the IC standard, these two options implement variations specified in the IC Pashto Standardized Transliteration System for Personal Names. For supported languages and scripts, see Variant Spelling and Region.
-
Korean Geography. For Korean, when applying the BGN standard, the standard that is actually used depends on the Korean Geography Option. For North Korea (the default) McKune-Reischauer is used. For South Korea, Revised Romanization of Korean is used. For supported languages and scripts, see Korean Geography.
You can set various parameters for the ITranslator
, and you must instantiate an ITranslatable
object with which the Translator performs the translation.
ITranslator Methods for Setting Translation Parameters
- void setMaximumResults (int maxResults)
-
Sets the maximum number of candidate translations that RNT generates. If you are only interested in the best or most likely result, set this to 1.
- void setMinimumConfidence (double confidence)
-
Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest.
- <T> void setOption (optionValue)
-
Each option is defined by a class. The options are defined in com.basistech.rnt.options
).
By default, PerformanceTradeoff
is set to TradeoffEnum.NORMAL
, VariantSpellingOption
is false, RegionOption
is RegionEnum.DEFAULT
(region unknown), and the other options are set to true. You can use this method to reset an option. For example:
setOption(new CompleteOrthographyOption(false));
setOption(new PerformanceTradeoff(TradeoffEnum.FAST);
RNT performs the translation on an ITranslatable
object. An ITranslatable
object contains several properties: data (the name), language, script, and entity type (person, location, organization, etc.). Language and script should match the language and script of the source text domain. Entity type may be unknown (com.basistech.util.NEConstants.NE_TYPE_NONE
). The ITranslatable
object may be extended to include additional information, such as geocoordinates for locations. For more information, see the Javadoc for the implementation of ITranslatable
: com.basistech.rni.match.Name
.
The following example translates an Arabic name: "صفية طالب السهيل".
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/translate.java
Inspecting Translation Results
The ITranslator translate()
method returns a list of TranslationResult
objects.
Each TranslationResult
provides access to the translation, the confidence associated with that translation (a double from 0 to 1.0), and may provide additional information with an associated confidence score. The sum of the confidence of all the results returned by a translation is less than or equal to 1.0. The additional information may include orthographic completion (the diacritization of names in Arabic or Hebrew script), segmentation (of names in Chinese, Korean, Japanese, or Thai), and language of origin (for Arabic, Chinese, or Japanese Katakana script). By default, these options are set to true
, in which case the Translator attempts to infer the additional information. You can turn off one or all of these options.
For names in Arabic script, orthographic completion means the addition of short-vowel markers and other diacritics that are absent in conventional Arabic script but required for accurate transliteration.
https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/inspect_translation_results.java
This translation returns a list containing one result.
Overriding Name Pair Translations
You can create UTF-8 files that specify how names are to be translated. The filenames specify the language of the source and target domains, and may specify an entity type. The file entries specify the text of the source and target names, the script of the target domain (required if the target language may be written in multiple scripts), and may specify a confidence score for the translation. RNT applies Unicode NFD normalization to the name strings, and performs the translations specified in these files.
Filenames. The filenames use ISO639 three-letter codes to specify the language of the source domain and the language of the target domain. The filename may also specify an entity type.
fullnames_SRCLANG_TARGETLANG[_TYPE].txt
For example, fullnames_eng_zho_PERSON.txt
would contain entries for translating English PERSON names to Chinese. fullnames_ara_eng.txt
would contain entries for translating Arabic names of any or no entity type to English.
Sample Fullname Override Files. fullnames_ara_eng_LOCATION.txt
, fullnames_jpn_eng_LOCATION.txt
, fullnames_rus_eng_LOCATION.txt
contain entries respectively for translating LOCATION names from Arabic, Japanese, and Russian to English.
File entries. Each row in the file, except for rows beginning with #
, contains tab-delimited fields with source name, target name, target script (not required if the target language is only written in one script), and optional confidence score.
source name Tab target name[ Tab target_script] [Tab confidence_score]
The confidence score must be between 0 and 1.0. If it is not included, RNT sets the confidence score to 1.0.
The following entry in fullnames_eng_zho_PERSON.txt
specifies that Ho Lide should be translated to 贺 利得 with a confidence score of 0.99 if the entity type for the source name is PERSON and the script for the target domain is Hans
(simplified Chinese).
Ho Lide贺 利得Hans0.99
The translations you specify are not commutative, so the preceding entry has no influence on the translation of the Chinese 贺 利得 to English.
The following entry in fullnames_ara_eng.txt
specifies that علي سعيد should be translated to 'Ali Sa'id with a confidence score of 1.0.
علي سعيد'Ali Sa'id
You can include multiple entries with the same source name, in which case it translates to multiple target names. The sum of the confidence scores for the source name must be between 0 and 1.0.
If you do not include result scores, and a source name translates to multiple target names, RNT sets the confidence score for each pair to 1 divided by the number of targets. If for example, a source name translates to two targets, the confidence score for each translation is 0.5.
Location of Override Files. Place your override files in the $BT_ROOT/rlpnc/data/rnt/ref/override
directory.
Tip
To define your own override tables (character streams) in place of the tables in the default directory. See the HTML API documentation for the com.basistech.rnt.DictionaryService.replaceConfiguration
method.
RNT provides an API that you can use to build interactive applications to translate Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and Russian names from English or native script to standardized English. For supported transliteration schemes, see Supported Translation Domains.
The input is an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name in native script or in English that the user wants to translate. In the common case, the name is in English but may not conform to the desired transliteration standard.
-
For Arabic names, the application walks users through the procedure of generating the name in fully vocalized Arabic script (conventional Arabic does not include short vowels), and transliterating the name.
-
For Iranian Persian, Afghan Persian and Pashto names, the application walks the user through the process of generating the names in standard Arabic script (no short-vowel markers).
-
For Chinese, Korean, or Russian names, the application walks the user through the process of generating the name in Hani, Hangul, or Cyrillic.
To take full advantage of the resources that RNT Interactive provides, the user should have some familiarity with Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and/or Russian.
For detailed information about the API, see the com.basistech.rnt.assistant
package in the Java API Reference.
Overview of an Interactive Application
An interactive application that walks the user through the process of translating an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name does the following:
-
Sets the Basis root directory and instantiates a TranslationAssistant
.
-
Collects user input: a name to transliterate and a description of the desired output.
The name is in English (a 'folk' transliteration) or in native script.
The output is defined as one foreign language text domain (such as Arabic, Arab, NATIVE) and one or more English text domains (such as English, Latn, BGN). See Supported Translation Domains.
-
Asks RNT to initialize an output object, which includes segmentation information about the input.
The segmentation may not match the segmentation implied by the user input, and needs to be recirculated to RNT as part of each user interaction. For example, in Arabic the definite article or family/clan indicator 'al' may or may not be joined to the element that follows. In some cases, whether it should be joined is unambiguous. In other cases, either segmentation is possible. The selection you make for one element may undo the selection you have already made for another element and/or may change the options available for that other element.
-
For each segment in the name, provides the users with a set of output alternatives. Each alternative includes the segment transliteration for the specified output text domains. For Arabic input only, the alternative may include a brief gloss and part of speech to assist the user in making a choice: either may be 'Name'; the gloss may contain a Buckwalter annotation, such as 's' for surname' or 'f' for feminine name.
When the user selects an alternative, the application passes it to the output object, and passes the current segmentation (which the selection may have changed) back to the input.
-
Publishes the final output: for each output text domain, the combination of alternatives that the user has selected.
-
When the user is done, closes interactive RNT to free resources.
For a sample application that simulates the interactive process described above, see InteractiveTranslationSample and the source code in $BT_ROOT/rlpnc/samples/java/InteractiveTranslationSample.java
.
As shipped (you can modify the sample), the input is an Arabic name: Safiyah Talib Al Suhail.
RNT divides this input into a number of segments and generates alternatives for each segment. RNT returns these alternatives in descending order of confidence (the best alternative is the first). For Arabic input only, as the following table shows, RNT provides additional information about each alternative to help the user make the best selection.
As the table also indicates, Al could be an individual component, but in the context of the word that follows, should be joined with Suhail
The final output (choosing the first alternative for each segment) is as follows:
IC transliteration: Safiyyah Talib al-Suhayl
Native transliteration: صَفِيَّة تَلِيب اَلسُّهَيْل