CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

A Multi-Pass, Dynamic-Vocabulary Approach to Real-Time, Large-Vocabulary Speech Recognition

I. Lee Hetherington

What

We present a technique perform speech recognition with a large virtual vocabulary while operating with a relatively small active vocabulary. The technique utilizes multiple recognition passes on a given utterance with vocabulary refinement in between passes.

Why

Traditionally, most speech recognition systems treated the vocabulary as fixed. However, a system with a flexible vocabulary may have advantages in terms of reacting to dialogue state, user customization, and novel words encountered within an application. For some types of applications, the use of a large static vocabulary may not be optimal compared to a more nimble system that can dynamic refine its vocabulary as needed based on context, even within the same utterance. For example, there are over 30,000 (city, state) pairs in the USA containing more than 16,000 unique words. If we consider the 6.4 million (street, city, state) triples that are part of USA street addresses, it is not feasible or desirable to have all active at once in order to recognize utterances of the form "32 Vassar Street, Cambridge, Massachusetts." This would require a vocabulary containing nearly 300,000 distinct words, and activating all at once would likely hurt performance in terms of memory, speed, and accuracy.

How

We propose to recognize large vocabularies by utilizing multiple passes with vocabulary refinement between them, a general idea first introduced to cope with many compound and inflected words in recognition of broadcast news [1]. The basic idea is to make use of contextual information within the utterance itself to trigger the addition of new words and phrases to the recognizer for subsequent passes.

Consider recognizing street addresses of the form "32 Vassar Street, Cambridge, Massachusetts" in three passes:

Recognize state name while skipping over street and city names: "32 UNKNOWN Street, UNKNOWN, Massachusetts."
Add city names compatible with hypothesized states and recognize city-states while skipping over street names: "32 UNKNOWN Street, Cambridge, Massachusetts."
Add street names compatible with hypothesized city-states and recognize street-city-states: "32 Vassar Street, Cambridge, Massachusetts."

In order to accomplish this process, we need some type of out-of-vocabulary (OOV) or "filler" model to skip over portions of an utterance that may be outside the recognizer's active vocabulary [2], a database to guide vocabulary refinement, letter-to-sound capabilities to produce pronunciations for new words (particularly if the database is dynamically changing) [3], and the ability to rapidly alter the system vocabulary and perform recognition with the refined vocabulary [4].

Progress

We have utilized this multi-pass approach within several of our systems, including a system for obtaining information about restaurants [5], a prototype system for entering street addresses (e.g., to a navigation system), and an expanded version of our group's Jupiter weather information system [6,7] that can handle 30,000 city-states in the USA.

In our expanded weather information system, we compared recognition accuracy, measured in terms of word error rate (WER) and city-state token error rate (TER) for our original baseline system with about 500 cities and static and dynamic vocabulary systems with the expanded 30,000+ cities. Controlling for recognition speed so all systems operate at real-time speed on a 2.4GHz Pentium 4, we found that on general utterances that may contain out-of-vocabulary words, spontaneous speech effect, noises, etc., that all three systems performed comparably. If we examine performance on utterances containing city-states not in the baseline vocabulary but in the expanded vocabulary (e.g., "Will it snow in St. Charles, Illinois tomorrow?"), we found that both the static and dynamic systems could correctly recognize better than 80% of the city-state tokens. The dynamic system performed a few percent better (83.5% vs. 80.8%) than the static system while operating with an active vocabulary averaging about 1,100 city-states per utterance vs. the 30,000 of the static system, nearly a 30-fold reduction in active vocabulary. Part of the reason for the improvement is that there are less confusions with smaller vocabularies, and part is due to the flat within-class language model (1/N) providing less penalty with the dynamic system because N is significantly smaller.

Future Work

We expect to continue this work in various systems, particularly in those where the vocabulary cannot be statically compiled ahead of time when a live database (e.g., the Web) is in use. Also, we are in the process of shrinking our recognizer memory footprint so it can run on small devices (e.g., PDAs), and we expect that keeping active vocabulary as small as possible while still accessing a larger virtual vocabulary will aid us in this process.

References

[1] P. Geutner, M. Finke, and P. Scheytt. Adaptive vocabularies for transcribing multilingual broadcast news. In Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 925-928, Seattle, May 1997.

[2] I. Bazzi and J. Glass. Modeling out-of-vocabulary words for robust speech recognition. In Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 1613-1616, Beijing, Oct. 2000.

[3] G. Chung, C. Wang, S. Seneff, E. Filisko, and M. Tang. Combining linguistic knowledge and acoustic information in automatic pronunciation lexicon generation. In Proc. Interspeech, Jeju, Oct. 2004.

[4] J. Schalkwyk, L. Hetherington, and E. Story. Speech recognition with dynamic grammars using finite-state transducers. In Proc. Interspeech, pp. 1969-1972, Geneva, Sep. 2003.

[5] G. Chung, S. Seneff, C. Wang, and L. Hetherington. A dynamic vocabulary spoken dialogue interface. In Proc. Interspeech, Jeju, Oct. 2004.

[6] V. Zue, S. Seneff, J. Glass, J. Polifroni, C. Pao, T.J. Hazen, and L. Hetherington. Jupiter: a telephone-based conversational interface for weather information. IEEE Trans. Speech and Audio Processing, vol. 8, no. 1., pp. 85-96, Jan. 2000.

[7] L. Hetherington. A multi-pass, dynamic-vocabulary approach to real-time, large-vocabulary speech recognition. Submitted to Proc. Interspeech, Lisbon, Sep. 2005.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)