CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

High-quality Speech Translation for Language Learning

Chao Wang & Stephanie Seneff


We describe a translation framework aimed at achieving high-quality speech translation within restricted conversational domains. Towards this goal, we developed an interlingua-based approach, in which a generation-based method is augmented with an example-based method to improve system robustness, even with imperfect inputs due to speech recognition errors. The framework is integrated into a dialogue-based language tutoring system, to provide immediate translation assistance to students during the dialogue interaction.


The main components to language learning are reading, writing, listening and speaking. While it is possible for diligent students to gain adequate proficiency in the first three areas, the goal of improving conversational skills cannot be achieved by simply working hard. This is mainly due to the lack of a proper environment and adequate opportunity to practice speaking. Dialogue systems can potentially change the situation by providing an entertaining and non-threatening conversational environment[1]. A critical technology in our vision is the ability to generate high-quality translations for speech inputs in the native language, to provide students with immediate assistance when they have difficulty expressing themselves in the new language[2]. With this assistance, their conversation with the computer can carry on even before they have sufficient proficiency in the foreign language.

Language learning presents special challenges to a translation system because the quality of the translation must be essentially perfect, to avoid teaching the student inappropriate language patterns. However, it is only essential that the translation be fluent and contain the same overall meaning as the original L1 utterance. It need not be an exact translation, but could instead be viewed more as a paraphrase. In fact, we believe that it would be beneficial to the student if there was some randomness in the translation, such that multiple repetitions of the same English utterance produced different ways to render an equivalent intent in Mandarin. Interestingly, since the student is very likely to repeat verbatim what the system proposes, it is of paramount importance that the translation be understandable by the system's L2 grammar.


Our translation framework adopts the interlingua approach and is integrated with our dialogue system development via a shared meaning representation which we call a semantic frame. Given an input sentence, a parse tree is derived and critical syntactic relations and semantic elements in the parse tree are extracted. The resulting semantic frame can be used to generate key-value (KV) information for the dialogue system, and to generate a sentence in the original language (paraphrasing) or in a different language (translation). The generation is controlled by a set of rules and a context-sensitive lexicon, which can be fine-tuned to achieve high quality. We adopt a knowledge-rich approach in both the parsing and generation components, while emphasizing portability of the grammar and generation rules to new domains.

Our dialogue tutoring system employs two grammars, one to parse the native language (L1) for translation, and one to parse the foreign language (L2) for dialogue processing. We can make use of the L2 grammar to achieve some ``quality assurance'' on the translation outputs. If the generated translation failed parsing under the L2 grammar, we resort to an example-based method in which semantic information encoded as key-value pairs is used to look up a pre-compiled L2 corpus for a suitable candidate. If both methods failed, the system will prompt the student to rephrase. We think that a null output is perhaps better than an erroneous one, given the intended use of our system. The example-based mechanism complements the rule-based generation in that it tends to be more robust for ill-formed inputs.


We evaluated the speech translation quality using English speech data, recorded from phone calls to the publicly available Jupiter weather information system. Our test data consists of 695 utterances selected from a set of held-out data. Utterances whose manually-derived transcription can not be parsed by the English grammar are excluded from the evaluation, since they are likely to be out-of-domain sentences and would simply contribute to null outputs. The test data have on average 6.5 words per utterance. The recognizer achieved 6.9% word error rate and 19.0% sentence error rate on this set.

A bilingual judge rated the translation quality based on grammaticality and fidelity, wherein input is either the manual transcription or the recognizer output. Performance differences between these two modes reflect degradations caused by speech recognition errors.

Table 1 and Table 2 summarize the subjective ratings of the translation outputs given perfect transcription or speech as input. The example-based method is more robust than the generation-based method in that it has many fewer hard failures, however, at the expense of increased "bad" translations because of erroneous matches. In the "By-generation/example" mode, we adopt the strategy of preferring the generation-based output if it can be accepted by the Chinese grammar. The generation method is able to achieve high fidelity in the translation, preserving syntactic correspondence between English and Chinese as much as possible. The addition of example-based mechanism as a back-off is able to improve the performance over that of either method alone. The translation accuracy in speech mode is lower, as expected, due to mis-recognized semantic entities, such as city names. This is unlikely to be a serious issue for the language students, because our system echoes a paraphrase of the recognized input to keep the user informed during the interaction. A closer look at the system outputs also revealed that minor syntactic errors in speech recognition outputs seldom cause translation degradation, due to the example-based mechanism. As long as a robust parse can be found containing all the semantically important fragments, we are able to produce an appropriate translation using the lookup mechanism. The overall translation accuracy is 97.5% ("perfect" + "acceptable")/("perfect" + "acceptable" + "bad") for manual transcriptions, with only a 1.9% rejection rate. For speech, the accuracy drops to 93.6%, and the failure rate increases to 3.0%.

Quality Perfect Acceptable Bad Failed Accuracy(%)
By-generation 612 9 6 68 99.0
By-example 607 45 23 20 96.7
By-generation/example 644 17 21 13 97.5

Table 1. Translation performance on manual transcriptions of a set of 695 utterances. Accuracy is measured only on those translations that did not fail.

Quality Perfect Acceptable Bad Failed Accuracy(%)
By-generation 582 17 19 77 96.9
By-example 575 50 42 28 93.7
By-generation/example 610 21 43 21 93.6

Table 2. Translation performance on speech recognition outputs of a set of 695 utterances. Accuracy is measured only on those utterances that did not fail.

Research Support

Support for this research was provided in part by the Cambridge/MIT Initiative.


[1] Stephanie Seneff, Chao Wang, and Julia Zhang. Spoken Conversational Interaction for Language Learning. In Proc. of InSTIL, Venice, Italy, May, 2004.

[2] Chao Wang and Stephanie Seneff. High-Quality Speech Translation for Language Learning. In Proc. of InSTIL, Venice, Italy, May, 2004.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)