CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Inter-voice Audio Morphing

Jake Bouvrie, Tony Ezzat, and Tomaso Poggio

The Problem

This project seeks to develop a framework, which we have called “inter-voice morphing,” for morphing between samples of speech that are identical in content, but produced by different speakers. Given two spoken phrases, we wish to smoothly morph between the acoustic characteristics that define the speakers in order to produce intermediate sequences that lie along a perceptual continuum between the two individuals.

Motivation

The theory and technology behind morphing between audio samples of different speakers has yet to solidify, however there are several applications for which such technology might prove useful. Speech synthesis and recognition applications requiring large template databases (e.g. concatenative speech synthesis) or, alternatively, the ability to smoothly interpolate within a small library of sequences, might find audio morphing particularly useful. In addition, behavioral experiments in psychology, psychophysics, and linguistics often times make use of large corpora of spoken samples possessing a range of desired spectral or perceptual properties. Such corpora are difficult and costly to produce; the automatic production of a speech database with user-tunable parameters from a small set of exemplars could prove to be an acceptable alternative to genuine human speech, or even open up new research opportunities in human perception.

Previous Work

Ezzat et al [6] and Pfitzinger [4,5] have independently laid a foundation for speaker morphing, although using slightly different approaches. Ezzat borrows several key ideas originally introduced in the context of morphable models [1,2,3], to interpolate via 1-D optical flow between the corresponding smooth spectra resulting from a cepstral decomposition of a pair of audio sequences. Pfitzinger utilizes a Linear Predictive Coding (LPC) based source-filter decomposition, but performs a similar DFW-based “warping” step to interpolate the LPC and residual spectrums.

Approach

While we adopt a source-filter model of speech, the merits of an LPC-based decomposition over a cepstral based source-filter decomposition (or vice-versa) are unclear. >Each method attempts a clean separation of the vocal-tract filter that shapes the excitation signal from the excitation itself; however it is one goal of this research to develop a procedure by which to decompose commonly encountered speech into components that lend themselves well to interpolation. For the cepstral method, the deconvolution process is an additional concern that warrants attention. Most applications have traditionally derived the excitation signal from the complex cepstrum by performing division in the Fourier domain, however alternative approaches such as non-negative least-squares deconvolution may produce residuals less prone to oscillations and large cancellations which make interpolation difficult or error prone. While we have thus far focused mainly on the above separation methods, the decomposition need not adhere to the source-filter paradigm and other basis changes may prove to be more effective.

Given a suitable decomposition of speech into smooth spectra and excitation signals, we wish to interpolate smoothly between the respective smooth spectra and excitation components so as to construct intermediate sequences by traveling along an interpolation trajectory. The approach we plan to take follows from a combination of optical flow and dynamic programming, where one-dimensional correspondences along pairs of signals are computed (via DP) to produce an optimal path through the space of signals defined by the source and destination sequences.

Once the correspondences have been computed, we can then morph between audio samples by setting a mixing parameter that defines the distance to travel along the lines connecting each set of corresponding points in the source signals, and then reconstructing the audio samples given the interpolated spectra and original phase information.

Progress

Using a corpus of recorded phrases, we have begun to experiment with the aforementioned algorithms for source-filter decomposition and subsequent interpolation. A comparison between an LPC-based method and the cepstral approach is underway, however it remains to be seen as to which of the two is a more suitable technique for producing signals that can be morphed to give high perceptual realism.

Future Work

After choosing a technique for decomposing a source signal into morphable components, we intend to focus specifically on the development of an algorithm for morphing the excitation (residuals) component. While Ezzat et al. have shown that 1-D optical flow performs well for the smoothed spectrum component describing the vocal tract filter, excitation signals are inherently erratic and we expect that morphing this component of the speech decomposition will be significantly more challenging than interpolating the smoothed spectrum. Preliminary experimentation has also revealed that, as one might expect, we cannot simply cross-fade the residual component or reconstruct using the excitation signal from only one or the other of the source speech signals. Thus, accurate morphing of the pitch residuals is an important aspect of this project for future consideration.

Research Support

This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA) Contract No.N00014-02-1-0915, National Science Foundation (ITR/SYS) Contract No. IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, National Science Foundation-NIH (CRCNS) Contract No.EIA-0218693, National Science Foundation-NIH (CRCNS) Contract No.EIA-0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH66239-01A1. Additional support provided by: Central Research Institute of Electric Power Industry (CRIEPI), Daimler-Chrysler AG, Compaq/Digital Equipment Corporation, Eastman Kodak Company, Honda R&D Co., Ltd., Industrial Technology Research Institute (ITRI), Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, NEC Fund, Oxygen, Siemens Corporate Research, Inc., Sony, Sumitomo Metal Industries, and Toyota Motor Corporation.

References

[1] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Alyn Rockwood, editor, Proceedings of SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 187194, Los Angeles, 1999.

[2] T.Ezzat, G.Geiger, and T.Poggio. Trainable videorealistic facial animation. In Proceedings of SIGGRAPH 2002, volume 21, pages 388398, San Antonio, Texas, 2002.

[3] M. Jones and T. Poggio. “Multidimensional Morphable Models: A Framework for Representing and Matching Object Classes,”in Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 683-688, January 4-7, 1998.

[4] Pfitzinger, H.R.DFW-based Spectral Smoothing for Concatenative Speech Synthesis. Proc. ICSLP 2004, vol. 2, pp. 1397-1400. Korea. Oct, 2004.

[5] Pfitzinger, H.R. Unsupervised Speech Morphing between Utterances of any Speakers. Proc. of the 10th Australian Int. Conf. on Speech Science and Technology (SST 2004), pp. 545-550. Sydney. Dec., 2004.

[6] T.Ezzat, E.Meyers, J.Glass, andT.Poggio, Morphing Spectral Envelopes using Audio Flow, (submitted).

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)