Language and Speech Laboratory

Decoding speech in the presence of other sources

Jon Barker, Martin Cooke, Dan Ellis.

Speech Communication      volume:45:5-25.

The statistical theory of speech recognition introduced several decades ago has brought about low word error rates for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present in virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise into account is perhaps the most serious obstacle to the application of ASR technology. Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in low-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A second approach utilises noise models and attempts to decode speech taking into account their presence. Again, model-based techniques can work for simple noises, but they are computationally complex under realistic conditions and require models for all sources present in the signal. In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlike earlier model-based approaches, our framework makes no assumptions about the noise background, although it can exploit such information if it is available. It does not require models for background sources, or an estimate of their number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of speech models which generated a given observation sequence, the new approach additionally determines the most likely set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, which searches both across subword state and across alternative segregations of the signal between target and interference. We call this modified system the speech fragment decoder. The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasks in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word error rate in the condition of factory noise at 5dB SNR from over 59% for a standard ASR system to less than 22%.