Abstracts - 2006
Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings
In recent years, improvements in data storage and networking technology have made it feasible to provide Internet users with access to large amounts of multimedia content. For example, many universities are now providing free web-based access to audio-visual recordings of academic lectures (e.g., MIT's OpenCourseWare) Unlike text, however, audio-visual material is not easy to search and browse without time-aligned transcriptions.
Because manual transcription can be a costly and time-consuming process, the development of automatic approaches for transcribing lectures is an obvious step towards making multimedia content more accessible. However, in many cases, approximate manual transcriptions (i.e., imperfect transcripts that were generated quickly and/or cheaply) may be available. In this case, performing an automatic alignment of these imperfect transcripts with the speech in the audio file is preferable to automatically generating a new transcription.
In this work, we have developed a new approach for forced alignment of long approximate transcriptions of spoken lectures which is designed to discover and correct errors in the manual transcripts, hopefully improving the fidelity and usefulness of the transcripts in the process.
Automatic Alignment Process
In this work we have taken an approach to automatic alignment of long audio files similar to the recursive approach proposed by Moreno , but with two primary differences. First, our approach does not use recursion to progressively reduce the alignment process to operate over smaller and smaller segments. Instead, our system uses a fixed number of recognition passes each performing a specific type of refinement to the proposed transcription alignment. Second, our system explicitly searches for discrepancies between the provided manual transcription and the observed acoustics in the audio file and attempts to correct these discrepancies. The different stages of our processing are explained in detail in the sections that follow.
Stage 1: Automatic Speech Recognition In the first stage of our approach, we use the SUMMIT automatic speech recognition (ASR) system to produce an automatically generated transcript of a long audio file [2,3]. Because we have a manually generated transcript we can strongly bias the recognizer's language model to favor the content of the transcript. In our case we implement a mixture trigram model to combine a trigram model trained on the transcript with a topic-independent trigram trained on a variety of data including the Switchboard corpus and a collection of transcribed lectures covering a variety of topics. By string aligning the speech recognition result against the manual transcript and finding matching word sequences between the two transcripts, anchor points for a refined forced-alignment stage can be determined.
Stage 2: Pseudo-Forced Alignment After obtaining anchor points from the first stage recognition, the second stage produces a pseudo-forced alignment of the manual transcript across the speech segments spanning between the first stage anchor points. We call it a pseudo-forced alignment because we do not force the recognizer to align the exact string present in the manual transcription. Instead, we assume that errors in the transcript are possible and we allow insertions of new words and substitutions for existing words through the use of a phonetic-based out-of-vocabulary (OOV) word filler model . We also allow any word in the transcript to be deleted. The rates of the insertions, substitutions and deletions can be controlled using penalty weights to insure that correctly transcribed words are rarely replaced or deleted.
Stage 3: Alignment Editing After the second stage is complete, the transcript is fully aligned against the speech, and regions containing potential substitutions, deletions and insertions are marked. From this transcript, we can re-run our speech recognizer over local segments containing these marked regions. We allow the recognizer to edit the manual transcript by hypothesizing any word(s) in the recognizer's full vocabulary as a replacement for the marked substitutions and insertions. A trigram language model is used to provide language model constraint for the proposed edits. We also allow the recognizer to reconsider any deletions proposed in the previous stage. This editing process allows the system to correct some of the errors of the human transcriber.
For our experiments we examined three 80-minute-long academic lectures. Human transcripts of these lectures were initially generated by a commercial transcription service. An examination of the initial transcripts yielded several observations. First, they were considerably cleaner than the actual speech. Filled pauses, false starts, partial words, grammatical errors and other speech errors were typically removed or corrected in the transcriptions. The transcribers also made substitution errors (i.e., substituting an incorrect word for the actual spoken word) for about 2.5% of all words. Often these errors were on subject specific or technical terms that the transcribers were likely unfamiliar with. A exact transcript of the lectures was later generated in our laboratory. A comparison of the approximate transcripts generated by the commercial service against the carefully produced exact transcript demonstrated a 10% difference between the transcripts. In our experiments we attempt the align and correct the approximate transcript against the underlying audio.
Table 1 belows summarizes our experiments using our automatic alignment process. The automatic speech recognition (ASR) first stage achieved a word error rate (WER) of 24.3% across the three lectures. The second stage forced alignment result is reported without consideration for the substitutions, insertions, and deletions proposed by the recognizer. Insertions proposed by the OOV word filler model are ignored in the WER calculation, and transcribed words that are deleted or substituted are retained at the time locations where the substitutions or deletions occured. The forced alignment WER of 10.3% is directly comparable with 10% error observed when string aligning the approximate transcript against the exact transcript (i.e., the "oracle" alignment). The automatic alignment procedure shows only a minor 0.3% degradation in performance (i.e., 0.3% of words were misaligned when compared to their actual location in the audio file). This indicates that our system is producing an accurate alignment.
The editing stage reduces the WER of the transcription from 10.3% to 8.8%. This is a 14% relative reduction in word error rate from the automatic forced alignment, and a 12% relative reduction in error rate over the optimally aligned human transcription. An examination of the results shows that almost all of the improvement is gained from the insertion of words omitted by the human transcriber. While the fidelity of the corrected transcriptions was improved, a preliminary examination reveals that most of the automatic corrections are not important for human comprehension. The system showed little ability to consistently correct human mistranscriptions of important content words. This is not entirely unexpected as most of the human errors involved substitutions which are phonetically close to the actual words spoken, and hence similarly difficult for the automatic system to discriminate.
Table: Word error rates for the different stages of our automatic alignment and error correction procedure.
 P. Moreno, C. Joerg, J.-M. Van Thong, and O. Glickman, A recursive algorithm for the forced alignment of very long audio segments. In Proceedings of the International Conference on Spoken Language Processing, Sydney, Australia, December 1998.
 J. Glass, A probabilistic framework for segment-based speech recognition. In Computer, Speech, and Language, vol. 17, no. 2-3, pp. 137--152, April-July 2003.
 A. Park, T. J. Hazen, and J. Glass, Automatic processing of audio lectures for information retrieval: Vocabulary selection and language modeling. In Proceedings of the International Conference on Acoustics Speech and Signal Processing, Philadelphia, PA, March 2005.
 I. Bazzi and J. Glass, Modeling out-of-vocabulary words for robust speech recognition. In Proceedings of the International Conference on Spoken Language Processing, Beijing, China, October 2000.