MIT CSAIL Research Abstracts

We are working on a conceptually simple algorithm which can analyze an audio recording and automatically discover and identify recurring patterns which are likely to be significant with respect to the content of the recording. The initial component of our algorithm is a completely unsupervised method for grouping acoustically similar patterns into clusters using pair-wise comparisons. Unlike traditional approaches to speech processing which use an automatic speech recognizer, this clustering process does not utilize a pre-specified lexicon. There are many ways in which the clusters can be used: for directly summarizing unprocessed audio streams, for initializing a lexicon prior to performing more exhaustive recognition, or for performing information retrieval tasks on the recording. In [1], we first introduced and demonstrated the utility of our pattern discovery method for augmenting audio information retrieval. Here, we discuss extensions to our original approach which allow us to use the clusters to automatically identify words in the audio stream without explicitly relying on an ASR system.

Approach

Our approach can be summarized in 3 main stages which are listed and described below.

Experiments and Results

The speech data used for our experiments is taken from a corpus of audio lectures collected at MIT [2]. The entire corpus consists of approximately 300 hours of lectures from a variety of academic courses and seminars. The audio was recorded using an omni-directional microphone in a classroom environment. In a previous paper, we described characteristics of this lecture data and performed recognition and information retrieval experiments [3]. Each lecture typically contains a large amount of speech (from thirty minutes to an hour) from a single person in an unchanging acoustic environment. On average, each lecture contained only 800 unique words, with high usage of subject-specific words and phrases. Each of the lectures used in our experiments was taken from courses in computer science (CS), physics, and linear algebra.

Computer Science
Cluster Size	Common Word(s)	Purity	Hypothesis	Score
72	square root	1.00	square	0.23
40	procedure	0.97	procedure	0.34
21	combination	1.00	commination	0.43
19	computer	1.00	computer	0.63
17	primitive	0.94	primitive	0.12
14	definition	1.00	definition	0.29
12	parentheses	1.00	prentice	0.29
12	product	0.83	prada	0.65
10	operator	0.80	operator	0.37
10	and	1.00	and	0.29

Linear Algebra
Cluster Size	Common Word(s)	Purity	Hypothesis	Score
73	combination	0.52	combination	0.26
46	be	0.24	woodby	0.03
45	column	1.00	column	0.85
42	and	0.64	manthe	0.04
32	minus	0.94	minus	0.45
30	matrix	0.97	matrix	0.69
27	is	0.56	get	0.08
22	right hand side	0.95	righthand	0.46
14	picture	1.00	picture	0.88
11	one and	1.00	wanna	0.26

Physics
Cluster Size	Common Word(s)	Purity	Hypothesis	Score
21	is	0.24	paced	0.05
18	charge	1.00	charge	0.76
17	positively	0.76	positively	0.48
16	electricity	1.00	electricity	0.71
9	forces	0.89	forces	0.72
9	positive	0.89	positive	0.94
7	gravitational	1.00	invitational	0.61
6	times ten to	1.00	tent	0.57
6	distance	0.83	distance	0.51
5	gravity	1.00	gravity	0.44

Table 1: Ten largest clusters for lectures in computer science, linear algebra, and physics. From left to right, the columns list cluster size, the underlying common reference word(s) for the cluster, the cluster purity (fraction of nodes containing the common reference word(s)), the top cluster identity hypothesis, and the hypothesis score.

In Table 1 , we show example clusters from each lecture ranked by size and include the results of the cluster identification algorithm. At first glance, we found that the identification procedure worked surprisingly well given that we did not use a unigram language model to bias the baseform dictionary prior to search. We show individual clusters rather than overall identification accuracy because we found the types of errors made by the algorithm particularily enlightening.

The errors highlighted in blue originate with the clustering component of our algorithm. The clusters erroneously combine acoustically similar, but lexically dissimilar nodes consisting of function word sequences, and word components. Examples include {"this is", "misses", "which is"} and {"would be", "b", "see"}. The lexical disagreement for these clusters can be seen in their purity scores, which measures what fraction of the nodes contain the most common reference word(s) for the cluster. An important observation we can make, however, is that the scores for the hypothesized words are very low, indicating that we can use the identification score as a metric for rejecting these types of clusters.

The errors highlighted in purple and pink can be attributed to limitations of the identification procedure. Multi-word phrase clusters induce purple errors, where the best matching single word in the dictionary only partially covers the reference phone sequence. In the case of "right hand side" and "square root", the algorithm finds one of the constituent words, but for "times ten to", the best matching word is "tent", which occurs phonetically, but not lexically, in the phrase.

Pink errors are characterized by single-word clusters with relatively high purity. Upon examining the node phonetic transcriptions, we concluded that the errors for these cases is due to the inability of the phonological rules to account for the surface realization of the word. Conspicuous examples are all present in the CS lecture, where the lecturer consistently omits the "b" in "combination" and omits both schwas in "parentheses". In the future, we may be able to mitigate these errors by using more powerful phonological rules or by adopting a more flexible search phase. It is interesting to note that almost all of these errors occurred in the CS lecture, which suggests that their occurrence may be dependent on speaking style.

Ongoing Research

In the future, we plan to improve upon the results we have presented by iterating the word discovery process and by using phone lattices instead of the top phone transcription during the cluster identification stage. We can also incorporate lexical knowledge to help identify clusters by using the word N-best lists for each cluster to find the set of words that maximizes some joint probability of all words occurring in a single document. Even without the cluster identification component, we believe that our approach has many potential uses. In many tasks involving the organization of large amounts of audio data, the core idea of pattern discovery may be more suitable than a traditional speech recognizer because it is completely language independent and requires no training data. The unsupervised nature of the algorithm also makes it useful for improving our understanding of how to learn directly from speech.

References:

[1] A. Park and J. Glass, Towards Unsupervised Pattern Discovery in Speech. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico, December 2005.

[2] J. Glass, T. J. Hazen, L. Hetherington, and C. Wang, Analysis and Processing of Lecture Audio Data: Preliminary investigations. In Proc. HLT-NAACL 2004 Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 9--12, Boston, May 2004.

[3] A. Park, T. J. Hazen, and J. Glass, Automatic processing of audio lectures for information retrieval: Vocabulary selection and language modeling, In Proc. ICASSP , pp. I--497--450, Philadelphia, 2005.

Word Acquisition Using Unsupervised Acoustic Pattern Discovery

Alex S. Park & James R. Glass

Introduction

Approach

Experiments and Results

Ongoing Research

References: