CSAIL Publications and Digital Archive header
bullet Technical Reports bullet Work Products bullet Research Abstracts bullet Historical Collections bullet

link to publications.csail.mit.edu link to www.csail.mit.edu horizontal line


Research Abstracts - 2006
horizontal line

horizontal line

vertical line
vertical line

Pronunciation Assessment and Feedback in Dialogue Systems for Foreign Language Learning

Mitchell Peabody, Chao Wang & Stephanie Seneff


Learning a foreign language as a teenager or as an adult is considered by many to be one of the most difficult learning endeavors a person can undertake. It involves the inculcation of unfamiliar sounds, vocabulary, syntax, gestures, and dialogue structures such that the student can quickly understand and appropriately respond to sentences directed at them. Communicative competence at a level where one could be considered fluent is normally achieved through speaking with native speakers. This sort of communication allows the student to make mistakes, receive feedback from a friendly source, and to make corrections.

An opportunity exists for computer systems to supplement a foreign language curriculum by providing dynamic, interactive, non-threatening, and fun environments for students to explore and practice a foreign language on their own. Dialogue systems, already in use throughout the world in task-based customer service applications, can provide such environments.

One of the attractions of using dialogue systems for language learning purposes is that several domains[1,7,8], are already available with the Galaxy framework. Much of the work involved in constructing such dialogue systems from scratch is already done; thus, to use the system for foreign language learning, the task is "simply" to modify the system for use by non-native speakers[4].

Using the cases of American students learning Mandarin Chinese, Chinese students learning English, and Korean students learning English, the issues of mispronunciation at the segmental and suprasegmental levels are being addressed. This entails the detection of mispronunciations, the identification of mispronunciations deemed the most critical, and the correction of those mispronunciations.

Segmental Error Detection

Native speaker variations in speech make recognition difficult; accented speakers present an even greater challenge. In Computer Aided Pronunciation Training (CAPT) applications, the interest is not only in trying to recognize accented speech, but in identifying poorly pronounced speech. Using the explicit phonological rule framework of the SUMMIT recognition system[1], contextual changes in continuous Korean accented English can be explicitly modeled.

In recent research, the incorporation of linguistic knowledge has been used with some amount of success in modeling Korean accented English. Types of pronunciation errors that native Korean speakers make when speaking English were identified through expert phonological analysis. The phone-set consisted entirely of American English phones trained on American speech. Rules modeling these changes were incorporated into a SUMMIT recognizer with English segment models trained on native English acoustic data. Conceptually, the American phonological rules were augmented to include novel variations introduced by Korean speakers.

Forced recognition experiments were done to analyze the ability of the recognizer to identify phones for which a mispronunciation had been detected. A 5% improvement was seen in phone-error rate on Korean speech when recognition is done with a recognizer using augmented phonological rules compared with a recognizer using only the default phonological rules. This indicates that, not only can recognizer performance be potentially improved through explicit modeling of accented speech, but that the recognizer is correctly identifying phones which were transcribed by hand.

Ongoing, unpublished work expands on this idea for using the phonological modeling to detect mispronunciations but with three main differences. The first is the language pair is now Chinese-English. Investigatory work is being done in both directions (Americans learning Mandarin and Chinese learning English) to establish that the techniques developed are generalizable. The second difference is that, in the Korean-English recognizer, the acoustic models used were all trained on native English speech. In other words, the pronunciation errors were annotated in terms of the English phone set. This new work attempts to improve the accuracy of the recognition and error detection by training two distinct sets of acoustic models. The third distinction is that the approach is primarily data-driven.

The motivation for this is that the phonological rules for the Korean accented English took years to develop and were relatively expensive. The draw of a turn-key, data-driven approach is clear; it has the potential to be cheaper and faster. To this end, another corpus of Chinese accented English was collected using the Jupiter[9] weather system.

Two English recognizers using segment models were trained. One of the recognizers (referred to as E) was trained on 104,396 American English utterances; the other recognizer (CE) was trained using 13,289 Chinese accented English utterances provided by the Information Technology Research Institute (ITRI) in Taiwan. The models in CE were relabeled to indicated that they represented non-native models. For example, /a/ was labeled /ce:a/.

The models from the two recognizers were then combined into a recognizer (ECE) augmented with phonological rules allowing transitions from each of the American English phones to the Chinese equivalents. Because of the finite state transducer (FST) representation used throughout the recognizer, the changes could be modeled in a hierarchical fashion with the phonological changes due to the Chinese accent overlaid on the native phonological changes. This is a modification from the Korean-English recognizer design which combined the rules into a joint set.

Confusion matrices comparing the canonical transcriptions with those produced by forced mode recognition using the ECE recognizer were computed. The normalization factor was the number of times that an American phone occurred in the test set, which consisted of 3,497 American English utterances and 333 Chinese-accented English utterances. Because of the rules used, only substitution errors are explicitly modeled. These can be expanded quite easily. Experimental results show that, across the board, the number of confusions for non-native phones is almost twice the rate of corresponding confusions in the native set.

A final component in pronunciation assessment is examining the specific case of non-native production of tones in Mandarin Chinese. Tonal languages, which comprise the majority of the world's languages, use pitch register (height) and contour to lexically disambiguate words and phrases which share the same phonetic transcription. Using a tone classification system developed in [6], experiments are ongoing to determine if it can be used for detected tone errors in non-native Mandarin.

What sets this work apart is that, in addition to detecting incorrect tone, it is also working towards automatically correcting tone errors. Signal processing techniques are used to extract and reshape the fundamental frequency of the speech to more closely match appropriate native speaker contours. A student attempting to correct his or her speech for tone, can then hear an utterance replayed to them as it should sound, while preserving their individual voice quality. The hope is that, by chomarping the modified speech with the original, the student will be better able to perceive the corrections that need to be made.

The motivation for this is that students occasionally get frustrated at having to imitate a voice for which they lack the pitch range. For example, a male student with a female teacher will often find that he is unable to exactly match his teacher's pitch. The lack of a voice for which he can comfortably make changes becomes an obstacle to his learning correct pronunciation.


One of the critical pieces of work in this thesis will be creating an appropriate corpus. Approximately 5 hours of data, provided by the Defense Language Institute (DLI) consisting of recordings of student oral proficiency interviews at various levels, has been digitized and transcribed. While this data has proved useful in providing acoustic data as well as a starting point for understanding student difficulties with the Mandarin tone system, it suffers from several problems.

For one, the data were chunked using a very simple end-point detection algorithm based on energy measures. While this partitioned the data into manageable snippets to transcribe, the separation was not perfect and many utterances start and end in the middle of sentences. This presents a problem in terms of sentence intonation. Secondly, recording conditions were not uniform. Placement of microphones, consistent microphone use, etc. provide for very uneven data. Finally, the topic domains as well as student level are highly varied, leading to a very large lexical and syntactic space.

These shortcomings have been addressed by producing a system for tutoring students on lexical tone as written in the pinyin system of Romanization[2]. This system modifies the MuXing[8] Mandarin weather system to elicit both spoken and typed input in response to prompts concerning weather conditions.

The approach for the elicitation of prompts is similar to those proposed by Tomokiyo [5], where glosses were presented to the student with instructions to speak a given response type. Prompting in this fashion allows students to construct their own responses and provide nearly spontaneous speech, as opposed to read speech, which is important for dialogue systems.

A key difference between our lexical tone system and that used in Tomokiyo's scheme is that simple corrective guidance is provided for erroneous lexical tone errors. The ability to translate from English to Chinese[7] is also provided so that a student unsure of their grammar can receive instruction. A correction of lexical tone errors is performed through an expansion of possible lexical tones on each of the syllables and conversion into a word graph parseable by the probabilistic TINA[3] natural language system.

Domain dependent data have already been collected using this system, and there are immediate plans for collecting more. Runs for collecting data were conducted at MIT and at DLI. These trials were used to diagnose and fix problems that were not anticipated in the original system. Additionally, surveys were given to participants after the experiment to assess their perceptions of the program.

The results indicate that students are eager to practice speaking and appreciated the feedback on their work. They also indicate that attention must be paid to the appearance and functioning of the interface and that clear expectations must be spelled out for them. This data collection effort is ongoing and has thus far provided approximately 500 samples of complete sentences spoken by American learners of Mandarin.

Future Work

This document has presented a very brief overview of current work. Future work includes more rigorous evaluations of segmental error detection, detection of the tonal errors made by Americans speaking Mandarin, and prosodic correction of English-accented Mandarin. Using a phase vocoder, the F0 portion of the speech signal will be corrected to more closely resemble that of a native speaker of Mandarin. The final step will be to integrate these techniques into a dialogue system to provide feedback and guidance to students learning a foreign language.


This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


[1] J. R. Glass, G. Flammia, D. Goodine, M. Phillips, J Polifroni, S. Sakai, S. Seneff, and V.Zue. Multilingual spoken-language understanding in the MIT Voyager system. Speech Communication, 17:1--18, 1995.

[2] M. Peabody, S. Seneff, and C. Wang. Mandarin tone acquisition through typed interactions. In Proc. of InSTIL/ICALL Symposium: NLP and Speech Technologies in Advanced Language Learning Systems, 2004.

[3] S. Seneff. Tina: a probabilistic syntactic parser for speech understanding system. In Proc. of the DARPA Workshop, February 1989.

[4] S. Seneff, C. Wang, and J. Zhang. Spoken conversational interaction for language learning. In Proc. of InSTIL/ICALL Symposium: NLP and Speech Technologies in Advanced Language Learning Systems, 2004.

[5] L. M. Tomokiyo and S. Burger. Eliciting natural speech from nonnative users: Collecting speech data for lvcsr. In Proc. of the ACL-IALL joint workshop on computer mediate language assessment and evaluation in NLP, June 1999.

[6] C. Wang. Prosodic Modeling for Improved Speech Recognition and Understanding. MIT, Phd Thesis. June 2001.

[7] C. Wang and S. Seneff. High-quality speech translation for language learning. In Proc. of InSTIL/ICALL Symposium: NLP and Speech Technologies in Advanced Language Learning Systems, 2004.

[8] C. Wang, S. Cyphers, X. Mou, J. Polifroni, S. Seneff, J. Yi, and V. Zue. Muxing: a telephone-access Mandarin conversational system. In Proc. ICSLP'00, pp. 715--718(II), Beijing, China, October 2000.

[9] V. Zue, S. Seneff, J. R. Glass, J. Polifroni, C. Pao, T. J. Hazen, and I. L. Hetherington. Jupiter: A telephone-based conversational interface for weather information. IEEE Trans. on Speech and Audio Processing, 8(1):100--112, 2000.

vertical line
vertical line
horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu