MIT CSAIL Research Abstracts

CSAIL Publications and Digital Archive header

Technical Reports

Work Products

Research Abstracts

Historical Collections

horizontal line

Research Abstracts - 2006
horizontal line

horizontal line

Out-Of-Vocabulary Modeling for Open Vocabulary Speech Recognizers

Ghinwa F. Choueiter & James R. Glass

Introduction

In speech recognition applications, out-of-vocabulary (OOV) words are a major cause of increased word error rates. Unfortunately, the occurrence of OOV words is unavoidable because vocabulary sizes are finite. Therefore approaching the problem by increasing the vocabulary size is not a permanent solution since new words will always occur. Furthermore, OOV words are often keywords or content words, such as proper names and cities, that should be detected properly by the recognizer.

In this research, we are investigating an approach for modeling OOV words using sub-word units consisting of hybrid clusters of graphemes and phonemes. Such a model has the capability of generating spellings of the OOV words. The flexibility of the model lies in its ability to generate reasonable spellings for both in-vocabulary as well as out-of-vocabulary words. The potential applications of this OOV model are many such as automatic transcription and query answering where the query can contain novel keywords.

In this abstract, we briefly describe the implemented OOV model, the experiments set up for far, as well as the results obtained and the future work that we aim to pursue.

The Out-of-Vocabulary Model

In this research we implement a sub-word based OOV model that allows any unknown word to be represented as a sequence of sub-word units. The units are currently obtained using a grammar defined by hand-written rules that encode positional, phonological, and morphological constraints. The rule-based grammar has the capability to parse words into two different types of sub-lexical units:

Morphs: phoneme clusters augmented with positional information.
Grapheme-phonemes: letter clusters augmented with the corresponding morph they encode.

Figure 1 illustrates two hierarchical parse trees for the words diction and facial generated using the rule-based grammar. One can read off the morphs from the third level branches and can obtain the grapheme-phonemes by combining both the graphemes and the morphs. Figure 1 also illustrates how the same morph sh+ occurs in both diction and facial but is spelled with different graphemes: ti and ci respectively.

Figure 1: Hierarchical parse trees for the words diction and facial.

With the rule-based grammar, we are able to generate our sub-word units by parsing a large dictionary. For the purpose of our current experiments, we only use the grapheme-phonemes to build a sub-lexical language model.

Figure 2 (left) shows how the OOV model is integrated with the speech recognizer. As illustrated, the recognizer can either hypothesize a word or an OOV in which case, the OOV model is activated and a sequence of grapheme-phonemes is generated.

Figure 2: On the left, an illustration of the grapheme-phoneme OOV model integrated with the speech recognizer.g-p stands for grapheme-phoneme. The structure is similar to that of the Bazzi-Glass model. On the right, an illustration of the Bisani-Ney model where the OOV model is not separate and there is no distinction between a word and a sub-word unit.

The idea of implementing a sub-word based OOV model is not a new one. Bazzi and Glass proposed a phone-based OOV model which represents unknown words as a sequence of phones[1]. Our model has the additional capability of generating the spelling of OOV words but the underlying mechanism of integrating the OOV model with the recognizer is similar to theirs as illustrated in Figure 2. Bisani and Ney also worked on a sub-lexical OOV model[2]. However, in their work there is no distinction between words and sub-word units, and the recognizer can either hypothesize a word or one of the sub-word units as illustrated in Figure 2. We have also implemented the Bisani-Ney model using the grapheme-phoneme units.

Experiments

In order to assess the effect of introducing an OOV model into our recognizer, we set up a 16kWrd baseline speech recognizer which includes no OOV modeling. To build a word trigram, we use lecture data collected at MIT. The baseline is tested on a physics lecture and language modeling adaptation is performed using a companion physics textbook.

To evaluate the performance of the baseline recognizer with a larger vocabulary, we increase it to 38kWrd. We also set up a cheating experiment with no OOV words by augmenting the vocabulary with all the unknown words.

Results

Reference	it was originally conceived by a dutchman named kaiz hooker
Baseline	it was originally conceive by a dutch hmm and name case book a
Grapheme-phoneme OOV Model	it was originally conceive by a DUTCHMAN name CASEBOOKA
Bisani-Ney Model	it was originally conceived by a DUTCHMAN name CASE BOOK A

TABLE 1: Sample outputs from the implemented OOV models compared to the baseline and the reference utterance. The words in red are OOV words, and the UPPERCASE words in blue are those generated by the corresponding OOV model.

The sample results in Tables 1 are promising and show that our OOV model is capable of generating reasonable and sometimes correct spelling of OOV words. We are currently still conducting experiments and assessing the performance of the OOV model in terms of word and letter error rate. So far, the OOV model is introducing a slight degradation in error rate compared to the baseline.

Conclusions and Future Work

We presented a grapheme-phoneme OOV model for detecting and spelling unknown words, and we compared its performance to a baseline which contains no OOV modeling showing that the model is in fact capable of generating adequate spellings of unknown words. We are still facing several challenges with the OOV model. First, although the model does not overfire, it sometimes fires at the wrong locations causing unknown words to be merged or split for example. Furthermore, the recognizer does not always do a good job at recognizing the underlying phonemes making it even harder to generate a correct spelling. So far, most of our approaches use the grapheme_phoneme units. However, we also have access to morphs units which only incorporate sound information. It would be interesting to implemented a morph OOV model and then generate a spelling from the morph sequence. Finally, the sub-word units used so far are obtained using a rule-based grammar. We also plan on investigating the performance of the recognizer with units derived using a data-driven approach.

References:

[1] I. Bazzi and J. R. Glass. Modeling Out-of-Vocabulary Words for Robust Speech Recognition. In The Proceedings of ICSLP, Beijing, China, October 2000.

[2] M. Bisani and H. Ney. Open Vocabulary Speech Recognition with Flat Hybrid Models. In The Proceedings of Eurospeech, pp. 725--728, Lisbon, Portugal, September 2005.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu