Subjectively, the patterns that these patches are "looking for" (in a matched-filter sense) do not correspond well to known acoustic phonetic phenomena [14]. However, taken as a whole, their outputs do a fair job in representing phonetic units, which makes subsequent classification possible. However, they are not very robust to background noise, channel noise, speaker variations, etc. [1], and could likely be improved.
By adopting the more general view of features being extracted from a T-F image (by convolving patches), it would appear unlikely that those in Figure 1 would be the optimal set. There are many possibilities for designing such filters, but one possible approach for designing a set of patches is to utilize knowledge of acoustic phonetics. One can create a hand-designed set of detectors, each matching specific T-F time-frequency patterns which are known to be crucial for phonetic identification (or more importantly, phonetic discrimination). Such patterns will include, for example, formant locations, trajectories and bandwidths; burst locations; frication cutoff frequencies, etc [12]. An example set of possible filters is shown in Figure 2. Figure 3 shows an example of using one such filter, which is tuned to a distinct formant transition. This figure shows the result of convolving the filter with a wideband spectrogram, highlighting closely matching T-F regions.
Clearly, the proposed filter set of Figure 2 is quite different in nature than the standard representations (Figures 1 and 2). Several possible benefits over standard methods include,
There has been other work attempting to move away from frame-based features and utilizing more general T-F regions. One example which has shown benefits is the use of so-called TRAP features [7], which are features computed across a single frequency band over wide time ranges (up to 500ms or more). This is a case of using long verizontal patches as opposed to the tall vertical patches of MFCCs. The TRAP parameters are different for each frequency band and are learned from training data. The considerable work in sub-band ASR [2,9] can also be considered as a special case in which patches like Figure 1 are used, but modified to only cover a portion of the total frequency range.
Recent work in auditory neuroscience suggests that a major function of the primary auditory cortex may be computing ``features'' very similar to convolving ``patches'' over the T-F image generated in the cochlea [3,4]. A computational model of the auditory cortex has been proposed [4] which is essentially a 2D wavelet transform over an auditory spectrogram. Individual neurons are tuned to detect very specific patterns (referred to as a spectro-temporal receptive fields, or STRFs) tuned to particular modultion rates in both time and frequency. The work of [8] was an initial attempt to incorporate these ideas into ASR features, using STRFs modeled as 2-D Gabor filters. We believe that the T-F patches we plan to explore have interesting parallels with these biologically motivated features.
The ideas mentioned above can lead to a variety of related experiments. The areas we plan to focus on can roughly be divided into features and classifiers.
Some preliminary results suggest that phonetically-inspired features can achieve competitive results with baseline MFCC-based features on phonetic classification tasks. However, we feel that working on full recognition tasks will be a better way to continue this line of research. A baseline HMM recognizer on the Aurora corpus of noisy digits has been set up, which offers a good task for evaluating ASR front-ends. Our current goal consists of trying to improve ASR performance on the Aurora dataset by incorporating novel features and discriminitive classifiers into an existing ASR system.
[1] H. Bourlard, H. Hermansky, and N. Morgan. Towards increasing speech recognition error rates. Speech Communication, 18:205-231, 1996.
[2] H. Bourlard and S. Dupont. A new ASR approach based on independent processing and recombination of partial frequency bands. In International Conference on Spoken Language Processing, pages 426-429, 1996.
[3] B. Calhoun and C. Schreiner. Spectral envelope coding in cat primary auditory cortex. J. Aud. Neuroscience, 1:39-61, 1995.
[4] T. Chi, P. Ru, and S. Shamma. Multiresolution spectrotemporal analysis of complex sounds. Journal of the Acoustical Society of America, 118:887-906, 2005.
[5] S. Davis and P. Mermelstein. Comparison of parametric representation for monosyllable word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:357-366, Aug 1980.
[6] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 87(4):1738-1752, April 1990.
[7] H. Hermansky and S. Sharma. Temporal patterns (TRAPS) in ASR of noisy speech. In International Conference on Acoustics, Speech, and Signal Processing, 1997.
[8] M. Kleinschmidt. Localized spectro-temporal features for automatic speech recognition. In Eurospeech, 2003.
[9] J. McAuley, J. Ming, D. Stewart, and P. Hanna. Subband correlation and robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13:956-964, 2005.
[10] R. M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches to Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002.
[11] R. Rifkin, K. Schutte, M. Saad, J. Bourvie, and J. Glass. Noise robust phonetic classification with linear regularized least squares and second-order features. To appear in ICASSP 2006.
[12] K. N. Stevens. Acoustic Phonetics. The MIT Press, Cambridge, MA, 1998.
[13] S. Young. Large vocabulary continuous speech recognition. Signal Processing Magazine, IEEE, 13:45-57, 1996.
[14] Victor W. Zue. The use of speech knowledge in automatic speech recognition. In Proceedings of the IEEE, volume 73, pages 1602-1615, Nov 1985.
[15] R. P. Lippmann. Speech recognition by machines and humans. Speech Communication, 22:1-15, 1997.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |