The statistical theory of speech recognition introduced several decades ago has brought about low word error rates
for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present
in virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise into
account is perhaps the most serious obstacle to the application of ASR technology.
Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques
attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in
low-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A second
approach utilises noise models and attempts to decode speech taking into account their presence. Again, model-based
techniques can work for simple noises, but they are computationally complex under realistic conditions and require
models for all sources present in the signal.
In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlike
earlier model-based approaches, our framework makes no assumptions about the noise background, although it can
exploit such information if it is available. It does not require models for background sources, or an estimate of their
number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional
acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of
speech models which generated a given observation sequence, the new approach additionally determines the most likely
set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one
interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, which
searches both across subword state and across alternative segregations of the signal between target and interference.
We call this modified system the speech fragment decoder.
The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasks
in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word
error rate in the condition of factory noise at 5dB SNR from over 59% for a standard ASR system to less than 22%.