CSAIL Research Abstracts - 2005 link to http://publications.csail.mit.edu/abstracts/abstracts05/index.html link to http://www.csail.mit.edu
bullet Introduction bullet Architecture, Systems
& Networks
bullet Language, Learning,
Vision & Graphics
bullet Physical, Biological
& Social Systems
bullet Theory bullet

horizontal line

Robust Limited-Enrollment Speaker Identification for Handheld Devices

Ram Woo & Timothy J. Hazen


Handheld devices offer two distinct challenges for speaker identification techniques. First, their mobility ensures that the environmental conditions the devices will experience will be highly variable. Thus, when performing speaker identification, the audio captured by these devices could contain highly variable background noises that yield potentially low signal-to-noise ratios. Second, the quality of input devices, such as the audio microphone, is also a factor. Typical consumer products are constrained to components that are both small and inexpensive, resulting in a lower quality data capture than is typically used in laboratory experiments. In this work, we focus on the issue of speaker identification robustness for applications on handheld devices. Specifically, we evaluated our speaker identification technology with respect to its robustness to (1) background noise, (2) microphone transducer variability, and (3) limited speaker enrollment data.

Within the last year we have achieved the following goals:

  • Collected a corpus for studying speaker identification robustness issues.
  • Examined the effectiveness of a variety of different feature extraction methods for speaker identification.
  • Evaluated the robustness of our speaker identification system under the conditions of variable background noise, variable microphones, and limit speaker enrollment data.

We summarize these achievements below.

Data Collection

In cooperation with Intel, we have collected a speech corpus designed specifically to examine the robustness issues discussed above. Data collection was performed at both MIT and Intel using Intel's Morro Bay prototype handheld device. In total, data was collected from 125 people (96 at MIT and 29 at Intel). We have used this corpus as the basis for our experiments. Each person was recorded in three different environments (a quiet office, a moderately noisy lobby, and a noisy street corner) using three different microphones (the Morro Bay's internal far-field microphone and two different external earphone microphones). Of these 125 people, 68 provided multiple sessions of data collection enabling them to be used as "enrolled" users in our experiments. The other 57 people contributed a single session of data and were used as imposters during our evaluations.

Experimental Results

Figure 1 shows detection/error tradeoff (DET) curves of our initial experiments showing the effect of testing the system in conditions which are mismatched to the testing condition. In these experiments the users enrolled by speaking a short phrase four times during one enrollment session within a single environment using a single microphone. During testing, known users were evaluated against unseen imposters speaking the same phrase. Even when the test environment and microphone were matched to the enrollment condition, the equal error rate (the point on the curve where the false acceptance rate of imposters equals the false rejection rate of known users) is over 10%. This is significantly worse than our previous experiments where enrolled users spoke 64 enrollment phrases spread over 4 enrollment sessions. In those experiments our system achieved an equal error rate of 1.6% [1,2]. The significant drop in performance demonstrates the difficulty of performing accurate speaker identification from limited enrollment data collected from a single session.

Figure 1: DET curves showing false alarm probability of accepting an imposter versus the false rejection (or miss) probability for actual users for three experiments: (1) microphone and environment are the same during both training and testing, (2) the microphones are the same but the testing environment is mismatched, (3) the environments are matched but the microphones are mismatched.AVSR in white noise with mismatched models

Next we examined the effectiveness of multi-style training. In these experiments, the enrolled users recorded a single phrase in each of the six different conditions (i.e., one utterance in each of the three different environments using each of the earphone or internal microphones). Figure 2 shows the performance of multistyle training when the environement condition is varied. There are two trends to note. First, all of the curves are significantly better than the mismatched curves in Figure 1. This indicates that even a small amount of enrollment data from a particular environment can significantly improve the system's performance. The figure also shows the relative degradation in performance as the background noise conditions get worse. The equal error rates double from just under 10% to just under 20% as the user moves from the quiet office condition to the noisy street corner.

Figure 2: DET curves when using multi-style training and tested using under three different environment conditions.

In our research we are exploring the issues of speaker identification under the scenario where an enrolled user can select their own random login passphrase. Under this scenario, the user can gain added security if the user's passphrase is kept secret and is unknown to any potential imposters (even under the condition where the system is only utilizing the user's voice characteristics and is not explicitly trying to verify that the passphrase was spoken). Figure 3 shows the speaker identification performance of the multistyle training system when the imposters either have or do not have knowledge of the user's passphrase. As can be seen, the equal error rate is cut in half when the user's passphrase is unknown to potential imposters.

Figure 3: DET curves for speaker identification using multi-style training under the condition that the imposters either have or do not have knowledge of the user's passpharse.

Future Work

In the coming year of this project we will continue to work on robust speaker identification. Based on the results obtained in our initial studies, we will pursue improvements to our most promising methods. The incorporation of explicit noise compensation techniques into the speaker ID system is one new avenue we intend to explore. This research will likely include investigations into noise compensation techniques such as parallel model combination [3] or universal compensation [4]. We also intend to explore methods to synthesize multi-style models from single condition data.


Support for this research has been provided by Intel.


[1] T. J. Hazen, E. Weinstein, and A. Park. Towards robust person recognition on handheld devices using face and speaker identification technologies. In Proceedings of the International Conference on Multimodal Interfaces, Vancouver, November 2003.

[2] T. J. Hazen, E. Weinstein, R. Kabir, A. Park, and B. Heisele. Multi-modal face and speaker identification on a handheld device. In Proceedings of the Workshop on Multimodal User Authentication, Santa Barbara, December 2003.

[3] M. Gales and S. Young. Robust continuous speech recognition using parallel model combination. Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352--359, Sept. 1996.

[4] J. Ming, D. Stewart, and S. Vaseghi. Speaker identification in unknown noisy conditions - a universal compensation approach. In Proceedings of the International Conference of Acoustics, Speech and Signal Processing, Philadelphia, March 2005.

horizontal line

MIT logo Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)