CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

A Generative Model for Simultaneous Topic Inference and Text Tegmentation

Konrad Körding, Josh Tenenbaum & Tom Griffiths

Many streams of real-world data such as spoken language typically consist of segments, during each of which a characteristic set of topics is covered [1]. For instance, during one segment of a conversation, people might speak about their friends and all the topics associated with them, while during another segment they might speak about their work and associated topics. Here we present a generative model for the topic structure of natural language conversations, which explicitly models both the sequence of discourse segments and the distribution of topics within each segment. Inference in the model is doing using Markov chain Monte Carlo (MCMC) methods. This approach simultaneously segments the conversation into topically coherent discourse section and discovers which are words are characteristic of each topic [2] in the conversation.

Generative Model

To allow for a Bayesian treatment of the problem we formulate a generative model, a model that describes how the data (given the corpus of data) is generated in a hierarchical fashion by latent (unobserved) components, such as topics. The model (sketched in Figure 1) explored derives from the topics model described in [3].

An email, text document or meeting is represented as a series of sentences or utterances. Each utterance u has a characteristic distribution over topics a "topic vector" which describes a multinomial distribution over T topics. A set of topics are represented as distributions over characteristic words, with the jth topic being represented by and the probability of the word w under that topic being . If the utterance starts a new segment (c=1) then is randomly drawn from a symmetric Dirichlet() distribution, otherwise the topic vector does not change:

Each word in an utterance is generated by choosing a topic from the utterance's topic vector

and then choosing a word from that topic's distribution over words:

Figure 1: Graphical model indicating the dependencies among variables in the probability distribution defined by the generative model for topic segmentation described in the text.

The utterances in a meeting (or sentences in a document) are divided into topical segments. Associated with each utterance a variable indicates if the given utterance starts a new segment. This variable is chosen according to where we assume that is drawn from a symmetric Beta() distribution.

All of these components, the segmentations as well as the topics must be inferred from the data: the number of topics, the distribution over words for each topic, the boundaries between utterances in different meeting segments, and the topic vector for each meeting segment.

Integrating out the parameters associated with the topic vectors the probability of a given set of word topic assignments z and topic change variables c can be calculated using Bayes rule to be:

We use MCMC methods to sample from the joint posterior over topic assignments for each word (z) and topic segmentation variables (c).

Results

We applied the algorithm to a portion of the ICSI meeting corpus, which consists of spoken-text transcripts of multiple meetings. Topic boundaries, as judged by two human raters, are also available for some of the meetings, but these data were not used in training the model. Rather, topic segmentation was completely unsupervised, and the inferred boundaries were compared with those judged by human raters.

Data from all meetings were merged into a single dataset that contained approximately 500000 words. We sampled for 4000 iterations of MCMC. Table 1 shows the most indicative words for four of the topics inferred at iteration 4000. Figure 2 shows how the inferred topic segmentation probabilities at each utterance compare with the segment boundaries as judged by human raters. Figure 2 (left) shows the inferred boundary probabilities along with a segmentation done by a human for a typical part of the text. Figure 2 (right) shows a receiver operating characteristic (ROC) curve for our unsupervised algorithm, parameterized by a threshold on the probability of topic segment shifts. The fit is relatively good considering that the segmentation task is difficult for people, and the algorithm was not given the correct number of segments to find. Performance is at least as good as the current best supervised systems [5].

Table 1: The 10 words most associated with the topic for 4 out of the 10 topics is shown.

Here we have shown that it is possible to automatically segment text into topically coherent segments and simultaneously to discover the topics underlying a corpus of conversations. This approach, based on dynamic probabilistic generative models, is quite general and may be applied to many other data sets where at each point of time the data is described by a topic vector that changes only occasionally. We are currently exploring applications of these methods to segmentation and topic identification in internet chat data and video data.

Figure 2: (left) For a typical part of the meeting data the probability of segmentation assigned by the model is shown in blue vs a manual segmentation shown in red. (right) The ROC curve for predicting human segmentations

Acknowledgements

This work was in part funded by the DARPA/SRI CALO project and by a DFG Heisenberg Stipend (KPK). Matthew Purver participated in many helpful discussions and supplied the meeting corpus that we experimented with.

References

[1] Chafe, W.L. (1979). The flow of thought and the flow of language. In Syntax and semantics: Discourse and syntax, ed. by Talmy Givón, volume 12, 159-182. Academic Press.

[2] Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022

[3] Griffiths, T.L., & Steyvers, M. (2004) Finding scientific topics. Proceedings of the National Academy of Sciences. vol. 101: 5228-5235

[4] Barzilay,R & Lee,L, (2004) Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization HLT-NAACL : Proceedings of the Main Conference, pp. 113--12

[5] Galley, M, McKeown, K, Fosler-Lussier, E & Jing, H, et al (2003). Discourse Segmentation of Multi-party Conversation. Proceedings of the 41st Annual Meeting of the association for Computational Linguists.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)