Figure 3: Color-coded excerpt from a textbook showing context-dependent topic labels. |
To evaluate the effectiveness of HMM-LDA, we performed several language modeling experiments that leverage the context-dependent topic labels. For these experiments we trained the models on a set of 150 general lectures and an introductory computer science textbook, tuned model parameters on the first 10 lectures from a class based on the textbook, and tested the model on the remaining 10 lectures. To avoid out-of-vocabulary issues, we used the combined vocabulary of 27,244 words for all of our models.
First, we observed that standard trigram trained on general lectures tends to place too much weight on words from the training set topics, topics that are not representative of the potential topics of target lectures. Thus, by training a trigram only on n-grams without topic words (LS), we distribute the probability assigned to training set topic words to general content words. When linearly interpolated with the original trigram model (L), the combined model better matches the target topic distribution and reduces perplexity by more than 4%, as shown in Table 1.
By utilizing the context-dependent topic labels, we extend beyond the traditional unigram topic models and build topic n-gram models from n-gram sequences that end in words assigned to each topic. This not only captures topic key phrases, but also common transitions from common words to topic words, as shown in Figure 4.
Figure 4: A sample of n-grams from select topics. |
Since each lecture only covers a subset of topics, we determine the minimum perplexity achievable via an optimal linear interpolation of the topic n-gram models with lecture and textbook trigram models. From this cheating experiment, we find that perplexity can be reduced by over 10% over the tuned interpolation of lectures and textbook models baseline by matching the mixture weights to the unknown, but skewed, topic weights in the target lectures.
To track the unknown topic weights in the target lectures, we adapt the current mixture distribution according to the posterior topic distribution given the current word. Surprisingly, we obtain lower perplexity than the optimal static mixture interpolation. Evidently, adaptive mixture interpolation not only learns the underlying topic weights of the target lectures, but also tracks them as the distribution shifts over long lectures, yielding 11.8% reduction in perplexity over the baseline model.
Table 1: Performance of various model combinations compared with the baseline. Percentage of perplexity reduction is shown in parentheses. |
Since the transcript from an automatic speech recognizer (ASR) often contains significant amount of errors, we investigated the effect errors have on the overall trend in topic distribution. As seen in Figure 5, even with a word error rate of 39.7%, the topic distribution on the ASR output still maintains most of the dominant topic components, suggesting that adapting at the topic level is less sensitive to recognition errors.
Figure 5: Adaptation of topic model weights on manual and ASR transcription of a single lecture. |
In this work, we have leveraged the context-dependent topic labels from HMM-LDA to build simple adaptive language models that reduces perplexity by 11.8%. In future work, we plan on evaluating the models on recognition experiments. Furthermore, we hope to apply the changes in topic distribution to improve lecture segmentation. Finally, we are investigating changes to the HMM-LDA model to better capture content word distributions and speaker-specific variations.
Support for this research was provided in part by the National Science Foundation under grant #IIS-0415865.
[1] Jerome R. Bellegarda. Exploiting Latent Semantic Information in Statistical Language Modeling. In Proc. IEEE, pp. 1279--1296, Aug. 2000.
[2] David M. Blei, Andrew Y. Ng and Michael I. Jordan. Latent Dirichlet Allocation. In Journal of Machine Learning Research, pp. 993--1022, Jan. 2003.
[3] P. R. Clarkson and A. J. Robinson. Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache. In Proc. ICASSP, pp. 799--802, Munich, Germany, Apr. 1997.
[4] Daniel Gildea and Thomas Hofmann. Topic-Based Language Models Using EM. In Proc. Eurospeech, pp. 2167--2170, Budapest, Hungary, Sept. 1999.
[5] Thomas L. Griffiths and Mark Steyvers. Finding Scientific Topics. In Proc. National Academy of Science, pp. 5228--5235, Apr. 2004.
[6] Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17, pp. 537--544, 2005.
[7] Rukmini M. Iyer and Mari Ostendorf. Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache. In IEEE Transactions on Speech and Audio Processing, pp. 30--39, Jan. 1999.
[8] Roland Kuhn and Renato De Mori. A Cache-Based Natural Language Model for Speech Recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 570--583, June 1990.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |