Show Less
Open access

Individual Differences in Speech Production and Perception

Series:

Edited By Susanne Fuchs, Daniel Pape, Caterina Petrone and Pascal Perrier

Inter-individual variation in speech is a topic of increasing interest both in human sciences and speech technology. It can yield important insights into biological, cognitive, communicative, and social aspects of language. Written by specialists in psycholinguistics, phonetics, speech development, speech perception and speech technology, this volume presents experimental and modeling studies that provide the reader with a deep understanding of interspeaker variability and its role in speech processing, speech development, and interspeaker interactions. It discusses how theoretical models take into account individual behavior, explains why interspeaker variability enriches speech communication, and summarizes the limitations of the use of speaker information in forensics.
Show Summary Details
Open access

Perceptual Adjustments to Speaker Variation

Frank Eisner

Donders Institute for Brain, Cognition and Behaviour, Centre for Cognition, Radboud University

Perceptual Adjustments to Speaker Variation

Abstract: Differences between speakers pose a challenge for listeners as speaker variation is among the main causes of variability in the speech signal. Listeners’ ability to adapt to this variability is essential for successful comprehension. Recent research has explored how the perceptual system learns from variability by adjusting how acoustic cues are mapped onto perceptual categories. This learning can be guided by a number of different types of information, including the linguistic content of the speech, or visual cues to articulation from the speaker’s face. Properties of the learning mechanism have been identified, such as the finding that perceptual adjustments can be specific to a particular speaker and are stored for later encounters of that speaker. Learning can also generalise under certain conditions, to individuals or to a group of people. Evidence from behavioural and neuroimaging research implies a top-down process, by which learning can be driven by different types of higher-level information and results in a bias at an early acoustic-phonetic processing stage. This chapter discusses how learning helps listeners to deal with speaker variation, and considers the implications of this line of research for models of speech perception.

1.   Introduction

Spoken-language comprehension requires listeners to adjust to variability in the speech signal. This variation is caused by a range of factors, including differences between speakers (e.g., in the anatomy of their vocal tract or regional accent), variation within speakers (e.g., in register, speech rate, or physiological state), but also variable signal quality (e.g., because of ambient noise, or filtering through a phone connection). While the impact of this variability is often detrimental to the performance of automatic speech recognition systems (Benzeghiba et al., 2007), human listeners can normally adjust their perception quite easily. In this chapter I review an emerging body of research which aims to understand the cognitive mechanisms underlying this plasticity in the perceptual system. This work has revealed learning processes that can act fast and induce long-lasting changes in the mapping ← 39 | 40 → of acoustical cues onto linguistically meaningful units. Psychologists have referred to this kind of adjustment as perceptual learning in the sense of Gibson (1969), who defined it as “an increase in the ability to extract information from the environment, as a result of experience and practice with stimulation coming from it.” Listeners can thus be said to become better at understanding potentially difficult speech as a result of perceptual learning.

Since inter- and intra-talker variability is naturally present in speech, the ability of the perceptual system to adjust to it is essential for speech comprehension. In many traditional accounts of speech perception, variability was regarded as a nuisance, something to be discarded or ‘normalised’ in the process of translating the speech signal into more abstract linguistic representations (Pisoni, 1997). Recent evidence suggests, however, that not only are listeners able to adapt dynamically to sources of variability, but that they in the process encode detailed information about those sources. This knowledge can then be useful in the future in similar listening situations. For example, being familiar with a speaker’s voice makes it easier to understand that person in a noisy listening situation (Nygaard and Pisoni, 1998; Nygaard, Sommers, and Pisoni, 1994).

As perceptual learning can become effective quickly, it is amenable to being studied in a laboratory setting. Perceptual adjustments to various sources of variability have been observed after short exposure periods on the order of minutes or hours. The dependent measure in such experiments is typically a shift in perception (e.g., a shift in the location of a phoneme category boundary), or a global increase in intelligibility (e.g., being able to repeat more words correctly) following exposure (Samuel and Kraljic, 2009). Learning can thus be measured respectively at the sublexical, acoustic-phonetic level, or at the lexico-semantic level. Here I will discuss some recent studies that have used perceptual learning paradigms in order to understand basic properties of the adaptation process – when it occurs, what constrains it, how general or specific it is, and what kinds of information in the speech signal can drive it. Although these subtle changes in perception are still quite difficult to track with neuroimaging methods, there is recent evidence showing that this type of learning affects early processing stages in the auditory cortex, supporting the idea that relatively high-level sources of information can drive changes at a relatively low perceptual level. Understanding the mechanisms which enable this adaptability thus gives us a more complete picture of spoken-language ← 40 | 41 → processing. I will end by discussing some implications of this literature for computational and neurobiological models of speech perception.

2.   Adjusting perceptual categories

There is ample evidence that listeners can adapt to a range of different types of variability in the speech signal, such as in synthetic (Fenn et al., 2003; Greenspan, Nusbaum, and Pisoni, 1988), time-compressed (Dupoux and Green, 1997) or noise-vocoded speech (Rosen et al., 1999), speech embedded in multi-speaker babble noise (Song et al., 2012), and accents (Clarke and Garrett, 2004; Weber et al., 2014). In foreign-accented speech, for example, significant processing gains begin to emerge after exposure to only a few accented sentences (Clarke and Garrett, 2004; Weber et al., 2014). These studies have typically used either an increase in intelligibility, as measured by having listeners repeat or transcribe what they heard, or an increase in processing speed, as measured by reaction times in a comprehension-based task, as the dependent variable.

A central question in the context of speaker-specific listening is whether this kind of learning, such as adapting to a foreign accent, can also generalise and aid in the comprehension of other speakers who speak with the same accent. This was investigated in a series of experiments on Chinese-accented English with American listeners by Bradlow and Bent (2008). In their study, listeners were trained to become better at understanding Chinese-accented speech coming either from only one speaker or from several different speakers. After training, generalisation of learning was tested with speech materials from an unfamiliar speaker. For listeners in both conditions, intelligibility of the accented speech increased during training. However, only after exposure to multiple speakers was there evidence of speaker-independent learning. Thus, the perceptual system seemed to treat the unfamiliar accent initially as a speaker idiosyncrasy, but was able to construct a more abstract representation of that accent after exposure to it from multiple speakers. This behaviour is adaptive in the sense that it would not be beneficial to apply learning about a speaker idiosyncrasy indiscriminately, since any given novel speaker is unlikely to have that same idiosyncrasy in their speech. It is beneficial however, to have a more abstract representation of non-standard features that apply to a larger group, ← 41 | 42 → because the learned representation can be applied immediately rather than having to go through the learning process over and over again for every encounter of a new speaker with that accent.

While this type of empirical research has revealed important properties of perceptual learning about speakers, measuring global comprehension by testing at the lexical level, cannot identify what exactly it is in the speech signal that listeners are adapting to, or how they do it. However, a related series of studies has investigated how perceptual learning affects processing at a sublexical level, and the mechanisms that may be driving it. These experiments used an ambiguous speech stimulus, that is, a sound that falls on the category boundary between two phonemes, as a proxy for a speaker idiosyncrasy or a feature of an accent. Learning is measured by observing relatively subtle shifts in the categorisation of such ambiguous stimuli following a period of exposure. During exposure, listeners have different types of contextual information available that can disambiguate the perception of such sounds. In fact, there are several sources of information that can drive learning, including lexical, visual, and sublexical cues, which are discussed in turn below.

A seminal study by Norris and colleagues demonstrated that listeners can use lexical knowledge of their language to guide perception of speech sounds at a sublexical level (Norris et al., 2003). For example, an ambiguous fricative that is midway between /s/ and /f/ is perceived as /s/ when placed in a context like “albatro–”, but is perceived as an /f/ at the end of a word like “paragra–” (Ganong, 1980). Repeated exposure to the ambiguous sound in such lexically-biased contexts leads to a recalibration of the category boundary between /s/ and /f/ in a way that is consistent with the lexical context (see Figure 1, Eisner and McQueen, 2006): Listeners who heard the ambiguous sound in words where it replaced an /f/ subsequently categorised more sounds on an /f/-/s/ continuum as /f/, while, conversely, listeners who had heard the same ambiguous sound in /s/-biased contexts subsequently categorised more sounds as /s/ (Norris et al., 2003). A control condition, in which the same ambiguous sound was embedded in non-words, produced no shift in categorisation responses. This pattern of results suggests that listeners use lexical information to adjust their perception of an ambiguous sound after only brief exposure to this speaker idiosyncrasy (in this case, 12 instances of the critical sound during exposure). ← 42 | 43 →

Figure 1.  Perceptual learning effect in a pretest–exposure–posttest design analogous to that of Eisner and McQueen (2006; unpublished data). Two groups of listeners first categorised sounds from a 5-step /s/-/f/ continuum. Their responses were equivalent before exposure (left panel). Participants then heard the most ambiguous step 3 embedded in 2.5 minutes of continuous speech, where it replaced all /f/ sounds for one group, and all /s/ sounds for the other group. Categorisation of step 3 shifted following exposure, such that listeners with /s/-biased exposure gave more /s/ responses, and listeners with /f/-biased exposure gave more /f/ responses.

While the paradigm by Norris and colleagues is based on the lexical influence on phoneme perception (i.e., the Ganong effect; Ganong, 1980), a related paradigm is based on a similar influence from the visual domain (i.e., the McGurk effect; McGurk and MacDonald, 1976). The original McGurk effect demonstrated that auditory and visual cues are immediately integrated in perception, by showing that a video of a talker articulating the syllable /ba/ combined with a clear auditory /ga/ often results in the fused percept of /da/. A more recent study found that visual cues can also drive auditory recalibration in situations where ambiguous auditory information is disambiguated by visual information: When perceivers repeatedly hear a sound which could be either /d/ or /b/, presented together with a video of ← 43 | 44 → a speaker producing /d/, their phonetic category boundary shifts in a way that is consistent with the information they receive through lip-reading, and the ambiguous sound is assimilated into the /d/ category. However, when the same ambiguous sound is presented with the speaker producing /b/, the boundary shift occurs in the opposite direction (Bertelson et al., 2003; Vroomen and Baart, 2009a). Thus, listeners can use information from the visual modality to retune their perception of ambiguous speech input; in this case long-term knowledge about the co-occurrence of certain visual and acoustic cues (but also from orthographic mapping; Mitterer and McQueen, 2009).

In addition to visually- and lexically-driven recalibration, two types of sublexical information have been shown to drive similar learning effects. One is the phonotactic regularities of a language. For example, the English sequence “–rul” is phonotactically legal if the initial sound is /f/, but illegal if it was /s/. The reverse case is a sequence like “–nud”, where the nonword ‘snud’ is consistent with English phonotactics, but ‘fnud’ is not. In direct analogy to lexically- and visually-driven learning, listeners can exploit these statistical regularities when the acoustic signal is ambiguous: Repeatedly hearing an ambiguous /s/–/f/ fricative in contexts like “–rul” results in a shift of the category boundary towards /f/, whereas hearing the same sound in contexts like “–nud” results in a shift to /s/ (Cutler et al., 2008). A second type of sublexically-driven adaptation is induced by contingencies between acoustic cues that make up a phonetic category, such as the multidimensional cues to the identity of stop consonants. For example, one of the main differences between /b/ and /p/ is a temporal distinction in the onset of voicing (VOT), but one of the secondary cues is the fundamental frequency (F0) of a following vowel. Because these two cues co-occur in a predictable manner (shorter VOTs occur with low F0; longer VOTs with high F0), listeners have implicit knowledge which, again, can be exploited when the speech signal is unclear: Repeated exposure to a stop with ambiguous VOT, in an F0 context which is either consistent with /b/ or /p/, will lead listeners to adjust their category boundary for /b/ and /p/ accordingly over time (Idemaru and Holt, 2011).

Sublexical category adjustments can thus be guided by various kinds of language-specific information. Research using the exposure–test paradigm to induce phonetic recalibration has revealed some fundamental properties ← 44 | 45 → of how listeners adjust to speaker idiosyncrasies. The learning is fast and does not require explicit attention (McQueen et al., 2006b). While listeners are not usually conscious of the shift, learning can be modulated by high-level contextual information. For example, learning is blocked when the source of the ambiguity can be attributed to a transient event, such as the speaker having a pen in her mouth, rather than an inherent characteristic of the speaker (Kraljic et al., 2008). It has been shown to remain stable for a period of up to one week (Eisner and McQueen, 2006; Witteman et al., 2015), although the effect dissipates after prolonged testing involving unambiguous sounds (van Linden and Vroomen, 2007; Vroomen and Baart, 2009b). In parallel with research on generalisation of learning about a foreign accent, several studies have investigated whether category recalibration is speaker-specific or speaker-independent, by changing the speaker between exposure and test phase. This work so far has produced mixed results, sometimes finding evidence of generalisation across speakers (Kraljic and Samuel, 2006; 2007; Reinisch and Holt, 2014) and sometimes evidence of speaker-specificity (Eisner and McQueen, 2005; Kraljic and Samuel, 2007; Reinisch et al., 2014). The divergent findings might be partly explained by considering the perceptual similarity between tokens from the exposure and test speakers (Kraljic and Samuel, 2007; Reinisch and Holt, 2014). When there is a high degree of similarity in the acoustic-phonetic properties of the critical phoneme, it appears to be more common that learning transfers from one speaker to another.

There is thus evidence from a variety of sources that speaker-specific information in the signal influences speech perception. Strikingly, there is also evidence that listeners’ beliefs about who is talking are enough to have an impact on perception (Rubin, 1992). For example, the perceived ethnicity of a speaker can affect how intelligible listeners find their speech. In one study, when primed with a photo of a Chinese Canadian speaker, native listeners judged speech materials as more accented, and less intelligible, than when the same speech materials were presented without a photo. No such effect occurred when the prime was a photo of a White Canadian speaker (Babel and Russell, 2015). Effects of perceived speaker identity are not limited to global intelligibility or accentedness ratings, but have been found also at a sublexical level. Listeners take their knowledge of foreign and regional accents into account when making judgements ← 45 | 46 → about individual speech sounds (Hay et al., 2006; Jannedy et al., 2011; Niedzielski, 1999). For example, listeners reported hearing more raised variants of the vowel // in spoken sentences when primed with the written word ‘Australian’ than when primed with the word ‘New Zealander’ and hearing the same sentences (Hay et al., 2006). This pattern is in line with the typical // productions of talkers from Australia and New Zealand. In a recent study, we asked whether the perceived accent of a talker would also influence how likely listeners are to make a perceptual adjustment to that talker’s idiosyncratic pronunciations (Eisner et al., 2013). The idiosyncrasy in this case was word-final devoicing of English stop consonants, which often occurs in learners of English whose native language is Dutch, German, or Turkish, among others. Native English listeners were exposed to Dutch-accented English which contained devoiced stop consonants at the end of words (e.g., ‘seed’ pronounced more like ‘seat’), but not in any other positions. These listeners appeared to adjust to the devoicing by expanding their category for the voiced stop /d/, as measured immediately after exposure. The learning generalised to other positions in the word, such that words with initial voiceless stop consonants such as ‘town’ were acceptable instances for words that should be voiceless, such as ‘down.’ Interestingly, this generalisation to from word-final to word-initial position was only found with genuine Dutch-accented speech, but not in a second experiment in which the speaker was English native and purposefully mimicked the final devoicing. In that case, listeners adjusted to the devoicing, but did not generalise the learning to other positions. The perceived global accent of the speaker thus appears to constrain how listeners perceive individual speech sounds, but also the way in which they adjust to a talker idiosyncrasy.

To summarise, previously acquired knowledge about non-standard productions of a particular speaker, or a group of speakers, can affect sub-lexical processes in general and perceptual learning in particular. This ability of the system to utilise this kind of previously learned information has implications for models of speech perception. ← 46 | 47 →

3.   Speaker idiosyncrasies in models of speech perception

3.1.   Computational models

Adjusting to speaker idiosyncrasies as described above is not yet fully explained by current computational models of speech comprehension. Two broad classes of models of speech perception are distinguished on the basis of the granularity of acoustic-phonetic information as the signal is being processed from sound wave to meaning: abstractionist and episodic models. In abstractionist models such as TRACE (McClelland and Elman, 1986), the Distributed Cohort Model (Gaskell and Marslen-Wilson, 1997), or Shortlist (Norris, 1994), acoustic-phonetic detail, including information about the speaker, does not feature in the computations leading up to word recognition. These models have a layered architecture with a lexical level at the top and abstract, phoneme-like units mediating between the speech signal and the lexicon. TRACE, for example, can also in principle account for general learning effects because it has top-down connections across the system by which lexical information can modulate sublexical processing. However, because the input to those models consists of abstract units not containing fine phonetic detail, an adjustment at a sublexical processing stage would always generalise across the system, regardless of who the speaker is: There is no mechanism to incorporate prior knowledge about the speaker into the processing stream. In contrast, episodic models such as MINERVA (Goldinger, 1998) encode detailed memory traces about every spoken word they encounter, and do not feature abstract sublexical units. During word recognition, lexical candidates are activated in proportion to the similarity between the input signal and memory traces. This lack of abstraction means that fine phonetic detail remains part of the representation. Episodic models are thus able to explain speaker-specific learning effects. However, this type of model fails to account for a different finding in the literature on perceptual learning of speaker idiosyncrasies: The learning has a broad effect in the sense that it applies beyond the specific instances heard during exposure, and generalises to other words in the listener’s mental lexicon (McQueen et al., 2006a), even words of other languages when spoken by the same talker (Reinisch et al., 2012). This generalisation is difficult to explain without a prelexical processing layer containing abstract representations that are connected to all entries in the lexicon (Cutler et al., 2010). In an episodic model, a learned adjustment ← 47 | 48 → remains specific to the exposure items, whereas in an abstractionist model, a prelexical recalibration of a phoneme contrast will affect all words in the lexicon which contain that contrast. In summary, both classes of computational model remain insufficient for explaining recent data on how listeners adjust to speech, and the evidence may point towards some kind of hybrid model. In such a model, fine phonetic detail, for example speaker-specific information, needs to be taken into account in the decoding of the speech signal. The output of these early perceptual processes might be conceived of as being probabilistic, such that the input to the word recognition system consists of phoneme likelihoods rather than strings of abstract phoneme categories (as in the revised Shortlist B model; Norris and McQueen, 2008).

3.2.   Neurobiological models

The idea of an acoustic-phonetic processing system which can take into account fine phonetic detail of previously learned episodes has also received some support from neuroscience. Research in this area has identified several candidate regions in superior temporal and inferior parietal cortex (Chan et al., 2013; Obleser and Eisner, 2009; Turkeltaub and Coslett, 2010) that are engaged in aspects of processing speech at a sublexical level of analysis. Like some of the computational models, models of the neurobiology of speech perception incorporate the notion of a functional hierarchy in the processing of sound, and speech in particular. A hierarchical division of the auditory cortex underlies the processing of simple to increasingly complex sounds both in non-human primates (Kaas and Hackett, 2000; Petkov et al., 2006; Rauschecker and Tian, 2000) and in humans (e.g., Binder et al., 1997; Liebenthal et al., 2005; Obleser and Eisner, 2009; Scott and Wise, 2004). Beyond these early acoustic phonetic stages, processing streams extending in antero-ventral and postero-dorsal direction from primary auditory cortex have been identified (Hickok and Poeppel, 2007; Rauschecker and Tian, 2000; Rauschecker and Scott, 2009; Scott and Johnsrude, 2003). In the left hemisphere, the anterior stream is usually attributed with decoding linguistic meaning (Davis and Johnsrude, 2003; Hickok and Poeppel, 2007; Scott et al., 2000). In contrast, the anterior stream in the right hemisphere appears to be less sensitive to linguistic information, and more sensitive to information about speakers more generally. Studies that have investigated cortical ← 48 | 49 → responses to human vocal sounds in general, and to speaker variation in particular, have found activations primarily on the right (Belin and Zatorre, 2003; Belin et al., 2000; Formisano et al., 2008; Kriegstein and Giraud, 2004; Kriegstein et al., 2008; Kriegstein et al., 2003); and there is converging evidence for conspecific vocalisations in non-human primates (Petkov et al., 2008). The literature thus suggests that there are right-lateralised regions in the auditory cortex that are engaged in the processing of speaker-specific information in speech, but it is currently unclear whether these systems support speech perception, for example by making available speaker-specific information that can be integrated by an early acoustic-phonetic processing system in the left hemisphere.

Nevertheless, there is some evidence from neuroscience for this kind of modulation of early acoustic-phonetic processing. Although it did not specifically investigate speaker-specificity, a recent study by Kilian-Hütten et al. (Kilian-Hütten et al., 2011) demonstrated that early acoustic-phonetic processing is indeed affected by previously learned biases. This study found direct evidence of dynamic adjustments to a phonetic category in left auditory cortex: Using a visually-guided perceptual recalibration paradigm (Bertelson et al., 2003), regions of primary auditory cortex (specifically, Heschl’s gyrus and sulcus, extending into planum temporale) could be identified whose activity pattern specifically reflected listeners’ adjusted percepts after exposure, rather than simply physical properties of the stimuli. This suggests not only a bottom-up mapping of acoustical cues to perceptual categories in left auditory cortex, but it also shows that the mapping involves the integration of previously learned knowledge within the same auditory areas; in this case, coming from the visual system. Whether linguistic processing in left auditory cortex can be driven by other types of information, such as speaker-specific knowledge from the right anterior stream will be an interesting question for future empirical investigation.

4.   Conclusions

Plasticity in the mapping of acoustic features to perceptual categories underlies listeners’ ability to adjust rapidly to idiosyncratic properties of individual speakers. Once an adjustment has been learned, it can be used again for later encounters with a speaker. The evidence from the perceptual ← 49 | 50 → learning literature is compatible with a system in which such learned biases are integrated with bottom-up properties of the signal early on during processing, and suggests that the output of this system is probabilistic in nature. However, these processes cannot yet be fully accounted for by current computational and neurobiological models. Perceptual adjustments can be driven by a variety of different sources, such as visual, lexical, and sublexical – and possibly more that are yet to be identified. Studying perceptual adaptation in response to speaker variability is becoming feasible with advanced neuroimaging methods, and this promises to be a valuable tool for probing the neural underpinnings of sublexical processing and abstraction.

Acknowledgements

FE is supported by the research consortium “Language in Interaction” from the Dutch Science Foundation (NWO), and part of this work was funded by NWO grant 275-75-009 to the author. Thanks to two anonymous reviewers for their comments on an earlier version of the manuscript.

References

Babel, M., and Russell, J. (2015). Expectations and speech intelligibility. The Journal of the Acoustical Society of America, 137(5), 2823–2833.

Belin, P., and Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. NeuroReport, 14, 2104–2109.

Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., and Pike, B. (2000). Voice-selective areas in human auditory cortex. Nature, 403, 309–312.

Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., et al. (2007). Automatic speech recognition and speech variability: A review. Speech Communication, 49(10-11), 763–786.

Bertelson, P., Vroomen, J., and de Gelder, B. (2003). Visual recalibration of auditory speech identification: A McGurk aftereffect. Psychological Science, 14, 592–597.

Binder, J. R., Frost, J. A., Hammeke, T. A., Cox, R. W., Rao, S. M., and Prieto, T. (1997). Human brain language areas identified by functional magnetic resonance imaging. Journal of Neuroscience, 17, 353–362.

Bradlow, A. R., and Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106(2), 707–729. ← 50 | 51 →

Chan, A. M., Dykstra, A. R., Jayaram, V., Leonard, M. K., Travis, K. E., Gygi, B., et al. (2013). Speech-specific tuning of neurons in human superior temporal gyrus. Cerebral Cortex. Cortex, first published online May 16, 2013 doi:10.1093/cercor/bht127

Clarke, C. M., and Garrett, M. F. (2004). Rapid adaptation to foreign-accented English. The Journal of the Acoustical Society of America, 116(6), 3647–3658.

Cutler, A., Eisner, F., McQueen, J. M., and Norris, D. (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. Laboratory Phonology, 10, 91–111.

Cutler, A., McQueen, J., Butterfield, S., and Norris, D. (2008). Prelexically-driven perceptual retuning of phoneme boundaries. In Proceedings of Interspeech-2008, 2056.

Davis, M. H., and Johnsrude, I. S. (2003). Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8), 3423–3431.

Dupoux, E., and Green, K. (1997). Perceptual adjustment to highly compressed speech: effects of talker and rate changes. Journal of Experimental Psychology: Human Perception and Performance, 23(3), 914–927.

Eisner, F., and McQueen, J. M. (2005). The specificity of perceptual learning in speech processing. Perception & Psychophysics, 67(2), 224–238.

Eisner, F., and McQueen, J. M. (2006). Perceptual learning in speech: Stability over time. The Journal of the Acoustical Society of America, 119(4), 1950–1953.

Eisner, F., Melinger, A., and Weber, A. (2013). Constraints on the transfer of perceptual learning in accented speech. Frontiers in Psychology, 4, 148.

Fenn, K. M., Nusbaum, H. C., and Margoliash, D. (2003). Consolidation during sleep of perceptual learning of spoken language. Nature, 425, 614–616.

Formisano, E., De Martino, F., Bonte, M., and Goebel, R. (2008). “Who” is saying “what?” Brain-based decoding of human voice and speech. Science (New York, NY), 322(5903), 970–973.

Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–125. ← 51 | 52 →

Gaskell, M. G., and Marslen-Wilson, W. D. (1997). Integrating form and meaning: A distributed model of speech perception. Language and Cognitive Processes, 12, 613–656.

Gibson, E. J. (1969). Principles of perceptual learning and development. Book. Englewood Cliffs, NJ.

Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279.

Greenspan, S. L., Nusbaum, H. C., and Pisoni, D. B. (1988). Perceptual learning of synthetic speech produced by rule. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 421–433.

Hay, J., Nolan, A., and Drager, K. (2006). From fush to feesh: Exemplar priming in speech perception. The Linguistic Review, 23(3), 351.

Hickok, G., and Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393–402.

Idemaru, K., and Holt, L. L. (2011). Word recognition reflects dimension-based statistical learning. Journal of Experimental Psychology: Human Perception and Performance, 37(6), 1939–1956.

Jannedy, S., Weirich, M., and Brunner, J. (2011). The effect of inferences on the perceptual categorization of Berlin German fricatives. In Proceedings of the International Congress of Phonetic Sciences, Hong Kong, pp. 962–965.

Kaas, J. H., and Hackett, T. A. (2000). Subdivisions of auditory cortex and processing streams in primates. Proceedings of the National Academy of Sciences, USA, 97(22), 11793–11799.

Kilian-Hütten, N., Valente, G., Vroomen, J., and Formisano, E. (2011). Auditory cortex encodes the perceptual interpretation of ambiguous sound. Journal of Neuroscience, 31(5), 1715–1720.

Kraljic, T., and Samuel, A. G. (2006). Generalization in perceptual learning for speech. Psychonomic Bulletin and Review, 13(2), 262–268.

Kraljic, T., and Samuel, A. G. (2007). Perceptual adjustments to multiple speakers. Journal of Memory and Language, 56, 1–15.

Kraljic, T., Samuel, A. G., and Brennan, S. E. (2008). First impressions and last resorts: how listeners adjust to speaker variability. Psychological Science, 19(4), 332–338. ← 52 | 53 →

Kriegstein, K. V., and Giraud, A.-L. (2004). Distinct functional substrates along the right superior temporal sulcus for the processing of voices. NeuroImage, 22, 948–955.

Kriegstein, K. V., Dogan, O., Grüter, M., Giraud, A.-L., Kell, C. A., Grüter, T., et al. (2008). Simulation of talking faces in the human brain improves auditory speech recognition. Proceedings of the National Academy of Sciences of the United States of America, 105(18), 6747–6752.

Kriegstein, K. V., Eger, E., Kleinschmidt, A., and Giraud, A. L. (2003). Modulation of neural responses to speech by directing attention to voices or verbal content. Cognitive Brain Research, 17(1), 48–55.

Liebenthal, E., Binder, J. R., Spitzer, S. M., Possing, E. T., and Medler, D. A. (2005). Neural substrates of phonemic perception. Cerebral Cortex, 15, 1621–1631.

McClelland, J. L., and Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86.

McGurk, H., and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.

McQueen, J. M., Cutler, A., and Norris, D. (2006a). Phonological abstraction in the mental lexicon. Cognitive Science: a Multidisciplinary Journal, 30(6), 1113–1126.

McQueen, J. M., Norris, D., and Cutler, A. (2006b). The dynamic nature of speech perception. Language and Speech, 49, 101–112.

Mitterer, H., and McQueen, J. M. (2009). Foreign subtitles help but native-language subtitles harm foreign speech perception. PloS One, 4(11), e7785.

Niedzielski, N. (1999). The effect of social information on the perception of sociolinguistic variables. Journal of Language and Social Psychology, 18(1), 62–85.

Norris, D. (1994). Shortlist: A connectionist model of continuous speech recognition. Cognition, 52, 189–234.

Norris, D., and McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115, 357–395.

Norris, D., McQueen, J. M., and Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47, 204–238. ← 53 | 54 →

Nygaard, L. C., and Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception and Psychophysics, 60, 355–376.

Nygaard, L. C., Sommers, M. S., and Pisoni, D. B. (1994). Speech perception as a talker-contingent process. Psychological Science, 5(1), 42–46.

Obleser, J., and Eisner, F. (2009). Pre-lexical abstraction of speech in the auditory cortex. Trends in Cognitive Sciences, 13(1), 14–19.

Petkov, C. I., Kayser, C., Augath, M., and Logothetis, N. K. (2006). Functional imaging reveals numerous fields in the monkey auditory cortex. PLOS Biology, 4, 1–14.

Petkov, C. I., Kayser, C., Steudel, T., Whittinstall, K., Augath, M., and Logothetis, N. K. (2008). A voice region in the monkey brain. Nature Neuroscience, 11, 367–374.

Pisoni, D. B. (1997). Some thoughts on ‘normalization’ in speech perception. Collection (pp. 9–30).

Rauschecker, J. P., and Tian, B. (2000). Mechanisms and streams for processing of “what” and “where” in auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 97(22), 11800–11806.

Rauschecker, J., and Scott, S. (2009). Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nature Neuroscience, 12(6), 718–724.

Reinisch, E., and Holt, L. L. (2014). Lexically guided phonetic retuning of foreign-accented speech and its generalization. Journal of Experimental Psychology: Human Perception and Performance, 40(2), 539–555.

Reinisch, E., Weber, A., and Mitterer, H. (2012). Listeners retune phoneme categories across languages. Journal of Experimental Psychology: Human Perception and Performance, 39(1), 75–86.

Reinisch, E., Wozny, D. R., Mitterer, H., and Holt, L. L. (2014). Phonetic category recalibration: What are the categories? Journal of Phonetics, 45, 91–105.

Rosen, S., Faulkner, A., and Wilkinson, L. (1999). Adaptation by normal listeners to upward spectral shifts of speech: Implications for cochlear implants. The Journal of the Acoustical Society of America, 106(6), 3629–3636. ← 54 | 55 →

Rubin, D. (1992). Nonlanguage factors affecting undergraduates’ judgments of nonnative English-speaking teaching assistants. Research in Higher Education, 33(4), 511–531.

Samuel, A. G., and Kraljic, T. (2009). Perceptual learning for speech. Attention, Perception & Psychophysics, 71(6), 1207–1218.

Scott, S. K., and Johnsrude, I. S. (2003). The neuroanatomical and functional organization of speech perception. Trends in Neurosciences, 26, 100–107.

Scott, S. K., and Wise, R. J. S. (2004). The functional neuroanatomy of prelexical processing in speech perception. Cognition, 92, 13–45.

Scott, S. K., Blank, C. C., Rosen, S., and Wise, R. J. S. (2000). Identification of a pathway for intelligible speech in the left temporal lobe. Brain, 123, 2400–2406.

Song, J. H., Skoe, E., Banai, K., and Kraus, N. (2012). Training to improve hearing speech in noise: Biological mechanisms. Cerebral Cortex, 22(5), 1180–1190.

Turkeltaub, P. E., and Coslett, H. B. (2010). Localization of sublexical speech perception components. Brain and Language, 114(1), 1–15.

van Linden, S., and Vroomen, J. (2007). Recalibration of phonetic categories by lipread speech versus lexical information. Journal of Experimental Psychology: Human Perception and Performance, 33(6), 1483–1494.

Vroomen, J., and Baart, M. (2009a). Phonetic recalibration only occurs in speech mode. Cognition, 110(2), 254–259.

Vroomen, J., and Baart, M. (2009b). Recalibration of phonetic categories by lipread speech: measuring aftereffects after a 24-hour delay. Language and Speech, 52(Pt 2-3), 341–350.

Weber, A., Di Betta, A. M., and McQueen, J. M. (2014). Treack or trit: Adaptation to genuine and arbitrary foreign accents by monolingual and bilingual listeners. Journal of Phonetics, 46, 34–51.

Witteman, M. J., Bardhan, N. P., Weber, A., and McQueen, J. M. (2015). Automaticity and stability of adaptation to a foreign-accented speaker. Language and Speech 58: 168–189. ← 55 | 56 → ← 56 | 57 →