Language and Speech Laboratory

Crowdsourcing in Speech Perception

Martin Cooke, Jon Barker, Mª Luisa García Lecumberri.

Crowdsourcing in Language and Speech

John Wiley

Maxine Eskenazi


Our understanding of human speech perception is still at a primitive state, and the best theoretical or computational models lack the kind of detail required to predict listeners' responses to spoken stimuli. It is natural, therefore, for researchers to seek novel methods to gain insights into one of the most complex aspects of human behaviour. Web-based experiments in principle offer a means to ask new types of questions permitted by detailed response distributions gleaned from large listener samples. For example, instead of instructing listeners to classify speech stimuli into one of a small number of categories chosen by the experimenter, large sample experiments allow the luxury of meaningful analysis of what is effectively an open set of responses. This freedom from experimenter bias is more likely to lead to unexpected outcomes than a traditional formal test which, of necessity, usually involves far fewer participants. Web-based experimentation involving auditory judgements and speech stimuli is in its infancy, but early efforts over the last decade have produced some useful data. Some of these early crowd sourcing experiences are related in this chapter. However, the promise of web-based speech perception experiments must be tempered by the realisation that the combination of audio, linguistic judgement and the web is not a natural one. Notwithstanding browser and other portability issues covered elsewhere in this volume, it is relatively straightforward to guarantee a consistent presentation of textual elements to web-based participants, but the same cannot be said currently of audio and speech stimuli in particular. Similarly, while it may be possible using pre-tests to assess the linguistic ability of a web user whose native language is different from the subject material of a text-based web experiment, it is far more difficult to do so for auditory stimuli. Here, performance alone is not a reliable indicator of nativeness, since it can be confounded with hearing impairment or equipment problems. We examine these issues in depth. Nevertheless, we will argue that with careful design and post-processing, useful speech perception data can be collected from web respondents. Technological advances are making it easier to ensure that stimuli reach a listener's ears in a pristine state, and that the listener's audio pathway is known. New methodological techniques permit objective confirmation of respondent-provided data. Ingenious task selection can lead to the collection of useful data even if absolute levels of performance fall short of those obtainable in the laboratory. In the latter part of this chapter we present a comprehensive case study which illustrates one approach which seems particularly well-suited to web-based experimentation in its current evolutionary state, viz. the crowd-as-filter model. This technique uses crowdsourcing solely as a prior screening process prior to the selection of exemplars which are pursued further in formal tests. As we will see, tokens which have the potential to say something interesting about speech perception are rare, and the great benefit of crowdsourcing is to increase the rate at which interesting tokens are discovered.