Abstracts - 2006
Minimum Cut Model for Spoken Lecture Topic Segmentation
Igor Malioutov & Regina Barzilay
The development of computational models of text structure is a central concern in natural language processing. Text segmentation is an important instance of such work. The task is to partition a text into a linear sequence of topically coherent segments and thereby induce a content structure of the document. The applications of the derived representation are broad, encompassing information retrieval, question-answering, and text summarization.
Most unsupervised text segmentation algorithms assume that variations in lexical distribution indicate topic changes. When documents exhibit sharp variations in lexical distribution, these algorithms are likely to detect segment boundaries accurately. For example, most algorithms achieve high performance on synthetic collections, generated by concatenation of random text blocks . The difficulty arises, however, when transitions between topics are smooth and distributional variations are subtle. This is evident in the performance of existing unsupervised algorithms on less structured datasets, such as spoken meeting transcripts . Therefore, a more refined analysis of lexical distribution is needed.
Our work addresses this challenge by casting text segmentation in a graph-theoretic framework. We abstract a text into a weighted undirected graph, where the nodes of the graph correspond to sentences and edge weights represent the pairwise sentence similarity. In this framework, text segmentation corresponds to a graph partitioning that optimizes the normalized-cut criterion . This criterion measures both the similarity within each partition and the dissimilarity across different partitions. A robust analysis is achieved by jointly considering all the partitions in a document, moving beyond localized decisions. We show that the exact solution can be found in polynomial time, using dynamic programming. In general, the problem of minimizing normalized cuts is NP-complete, but for the case of chain graphs with a linearity constraint on the segmentation we formulate an exact dynamic programming solution.
There are two broad classes of algorithms devised for text segmentation: supervised, feature-based systems formulated in the discriminative framework [4, 5], and unsupervised systems based on the assumption of lexical cohesion within segments [1,6]. Our work fits most closely with this latter line of research. Most unsupervised algorithms assume that fragments of text with homogeneous lexical distribution correspond to topically coherent segments. The homogeneity is typically computed by analyzing the similarity in the distribution of words within a segment. Other approaches determine segment boundaries by locating sharp changes in similarity of adjacent blocks of text .
In contrast to previous approaches, the minimum cut algorithm simultaneously optimizes the similarity within each segment and the dissimilarity across segments. Thus, the homogeneity of a segment is determined not only by the similarity of its members, but also by their relation to other words in a text. We show that optimizing this objective refines the analysis of the lexical distribution and enables us to detect subtle topical changes. Another advantage of this formulation is its computational efficiency. Similarly to other segmentation approaches [1, 6], we employ dynamic programming to find the globally optimal solution.
Whereas previous unsupervised approaches to segmentation rested on intuitive notions of similarity density, we formalize the objective of text segmentation through cuts on graphs. We aim to jointly maximize the intra-segmental similarity and minimize the similarity between different segments. In other words, we want to find the segmentation with a maximally homogeneous set of segments that are also maximally different from each other.
Let be an undirected, fully-connected, weighted graph, where is the set of nodes corresponding to sentences in the text and is the set of weighted edges. The edge weights, , define a measure of similarity between pairs of nodes and , where higher scores indicate higher similarity.
The cut between two disjoint sets of nodes and is defined to be the sum of the crossing edges between the two sets of nodes: . We need to ensure that the two partitions are not only maximally different from each other, but also that they are themselves homogeneous by accounting for intra-partition node similarity. We adopt the normalized cut on graphs, introduced by Shi and Malik  in the context of image segmentation. The cut is normalized by the volume of the subpartitions, or the sum of edges from the partition to the whole graph: .
We consider the problem of partitioning the graph into disjoint sets of nodes , which form a partition of the graph to minimize the normalized cut: , where is the set difference between the set of nodes in the entire graph and the set of nodes in partition . By minimizing this objective we simultaneously minimize the similarity across partitions and maximize the similarity within partitions.
Papadimitriou proved that the problem of minimizing normalized cuts on graphs is NP-complete . However, in our case, the graph assumes the form of a linear, fully-connected, chain of nodes and the multi-way cut is constrained to preserve the linearity of the segmentation. By segmentation linearity, we mean that all of the nodes between the leftmost and the rightmost node of a particular partition have to belong to that partition. With this constraint, we formulate a dynamic programming algorithm for directly finding the minimum normalized multiway cut in polynomial time. The time complexity of our dynamic programming algorithm is , where is the number of partitions and is the number of nodes in the graph or sentences in the transcript.
We tested our algorithm on a corpus of spoken lectures. Segmentation in this domain is challenging in several respects. Being less structured than written text, lecture material exhibits digressions, disfluencies, and other artifacts of spontaneous communication. In addition, the output of speech recognizers is fraught with high word error rates due to specialized technical vocabulary and lack of in-domain spoken data for training. Finally, pedagogical considerations call for fluent transitions between different topics in a lecture, further complicating the segmentation task.
Our first corpus consists of spoken lecture transcripts from an undergraduate Physics class taught at MIT. The Physics lecture transcript segmentations were produced by the teaching staff of the course. Their objective was to facilitate access to lecture recordings available on the class website. This segmentation conveys the high-level topical structure of the lectures.We have access both to manual transcriptions of these lectures and also output from an automatic speech recognition system, the SUMMIT segment-based Speech Recognizer . We also evaluated our system on the lecture transcripts from the graduate Artificial Intelligence class taught at MIT. In order to evaluate our system, we compute the Pk measure, one of the more widely used metrics for assessing text segmentation performance . Pk measure estimates the probability that a randomly chosen pair of words within a window of length k is inconsistently classified.
We use the state-of-the-art system by Utiyama and Isahara  as our point of comparison, which has been shown to outperform or compare favorably with other top-performing segmentation systems. Our system outperforms the baseline on both domains; the performance improvement is consistent across the two datasets (See Table 1). Another attractive property of the algorithm is robustness to noise: the accuracy of our algorithm does not deteriorate significantly when applied to automatically recognized speech. On the set of lecture transcripts obtained from a speech recognition system, our method achieved 0.311 Pk measure while manual transcripts yield 0.299 Pk measure (4% relative increase in Pk measure). The baseline system performance degrades from 0.311 to 0.352 Pk measure (13.2% relative increase).
Considering long-distance lexical relations contributes to the effectiveness of the algorithm. Many other existing unsupervised segmentation algorithms are unable to exploit these non-local dependencies. We analyzed variations in the segmentation performance on a range of testing conditions. Not surprisingly, the performance of the algorithm depends on the distributional properties of the input text. We found that the segmentation accuracy on a synthetic text corpus is not predictive of the performance on the real data. These results strongly suggest that segmentation algorithms have to be evaluated on a collection of texts displaying real-world variability.
This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.