Show Less
Open access

Speech production and perception: Learning and memory


Edited By Susanne Fuchs, Joanne Cleland and Amélie Rochet-Capellan

Learning and memory processes are basic features of human existence. They allow us to (un)consciously adapt to changes in our social and physical environment in a variety of ways and may have been a precursor for survival in human evolution. Through several reviews and original work the book focuses on three key topics that enhanced our understanding of the topic in the last twenty years: first, the role of real-time auditory feedback in learning, second, the role of motor aspects for learning and memory, and third, representations in memory and the role of sleep on memory consolidation.

The electronic version of this book is freely available, thanks to the support of libraries working with Knowledge Unlatched. KU is a collaborative initiative designed to make high quality books Open Access for the public good. More information about the initiative and links to the Open Access version can be found at

Show Summary Details
Open access

Spatial and temporal variability of corrective speech movements as revealed by vowel formants during sensorimotor learning

Eugen Klein, Jana Brunner, Phil Hoole

Spatial and temporal variability of corrective speech movements as revealed by vowel formants during sensorimotor learning

Abstract: Previous perturbation studies demonstrate that speakers can reorganize their motor strategies to adapt for articulatory or auditory perturbations (Savariaux, Perrier & Orliaguet, 1995; Rochet-Capellan & Ostry, 2011). However, across most studies we observe a fluctuating amount of inter-individual differences with respect to the adaptation outcome. To evaluate the predictions of the hypotheses put forward to explain these differences, we conducted a multidirectional auditory perturbation study investigating F2 perturbation with native Russian speakers. During participants’ production of CV syllables containing the close central unrounded vowel /ɨ/, F2 was perturbed in opposing directions depending on the preceding consonant (/d/ or /g/). The bidirectional shift was intended to encourage participants to produce the vowel /ɨ/ with two different motor strategies and allowed us to investigate intra-individual variation of adaptation patterns as a function of the perturbation direction and the consonantal context. To examine the evolution of the adaptation process, we performed generalized additive mixed modelling (GAMM) on the averaged and individual formant data using the experimental trials as discrete time points. In doing so, we were able to examine sudden changes in participants’ adaptation strategies, which appeared as non-linearities in the F2 curve. Our results suggest that previously formulated hypotheses regarding individual adaptation processes make empirical predictions which are not confirmed by the bidirectional perturbation data. Therefore, we propose a more general hypothesis that the successful adaptation is dependent on speakers’ ability to coordinate the perceived auditory errors with appropriate compensatory movements, which is influenced in turn by the complexity of the adaptation task. We discuss this hypothesis in the context of individual adaptation patterns and show that it not only can explain the inter-individual, but also the inter-study variability observed in previous perturbation studies.

Keywords: auditory feedback, real-time perturbations, formants, variability, individual behavior, generalized additive mixed modelling, Russian

←77 | 78→

1. Introduction

1.1. Perturbation and sensorimotor learning

Picture the situation of taking a photo of beautiful lakeside scenery and accidently dropping your camera into the water. Despite your misfortune, you are lucky and can spot the camera within what appears to be a reachable distance at the lake bottom. Hastily, you try to retrieve the camera but grab a few times beside it before you can actually take hold of it. Or even worse, you realize that the bottom that appeared reachable lies in fact much deeper below the water surface. In the described example, the coordination between your visual input and your hand movements is disrupted by the visual distortions caused by the different reflective angles between the water and the air. The fact that you can eventually grab the camera after a few attempts, assuming the lake bottom is indeed reachable by hand, provides evidence for the flexibility of the human sensorimotor system which is able to adapt for the visual perturbations and to find alternative motor strategies to reach the intended goal.

The same is mostly true for mechanical and auditory perturbations of speech. That is, when you later recall you tale of bad luck to a friend during the conference dinner, and you get upset about the unreasonable repair costs of your camera, you might speak with a mouth full of food. In this case, your articulators’ movements might be impeded by pieces of food which will force you to find alternative strategies to intelligibly articulate the words you intend to utter. Or, in another scenario, you may have to increase the loudness of your voice to compensate for the loud conversation happening at the table next to yours.

During experiments applying controlled perturbation, speakers have to produce speech under aggravated conditions, e.g., under blockage of their jaw movements or under altered auditory feedback. As in the initial example with the hand-eye coordination, speakers need to coordinate errors transmitted by their sensory input with appropriate corrective articulator movements to be able to retain intelligibility of their speech. In the case of speech, it is particularly intriguing which sensory channels (e.g., somatosensory, proprioceptive, or auditory) are involved in the process of adaptation. The answer to this question may provide a better understanding of the different types of sensory information relevant for speech ←78 | 79→production and ultimately the goals of articulator movements. Thus, the study of perturbed speech provides an empirical means to study the nature of speech sound representations as well as learning processes that occur in speech production.

1.2. Outcome variability in perturbation studies of speech

Despite the general ability of speakers to reorganize their motor strategies to retain the acoustic make-up of the intended speech sounds under aggravated conditions, the outcome of adaptation processes in speech exhibits high inter-individual and inter-study variability. For instance, Gay, Lindblom and Lubker (1981) examined participants’ productions of vowels when a bite block was inserted between their teeth. The authors found that speakers were able to adapt to these static perturbations with very little or no practice and produce acoustic outputs equivalent to their unperturbed speech. However, in a study by Savariaux, Perrier and Orliaguet (1995) when speakers’ lips were blocked with a tube during the production of the French [u]; only six out of 11 speakers were able to partially compensate for the labial perturbation and only one speaker compensated completely by changing the constriction location from a velo-palatal to a velo-pharyngeal region. The remaining four speakers did not compensate at all. Similar variability of the experimental outcomes is also observed across other articulatory perturbation studies, e.g., by Baum & McFarland (1997), Jones & Munhall (2003), and Brunner, Hoole & Perrier (2011). To explain this variability, Savariaux et al. (1995) suggest that the varying degree of adaptation among participants is due to “speaker-specific internal representation of articulatory-to-acoustic relationships”.

More recently, it has become possible to study speakers’ articulatory-to-acoustic relations by means of real-time perturbation of speakers’ auditory feedback. This methodology allows alteration of such acoustic parameters as fundamental frequency (f0; Jones & Munhall, 2000) and vowel formants (F1 and/or F2; Houde & Jordan, 1998; Purcell & Munhall 2006; Villacorta, Perkell & Guenther, 2007), and has the advantage that multiple perturbation conditions can be tested within the same study without participants’ awareness of any systematic manipulations. For instance, Rochet-Capellan and Ostry (2011) perturbed the first formant (F1) in the ←79 | 80→vowel /ɛ/ in opposing directions depending on the experimental stimulus in which it was embedded (head or bed), while in a control stimulus (ted) the F1 remained unchanged throughout the experiment. The authors found that speakers were overall able to adapt for the three distinct F1 levels which means that during the study participants employed three different motor strategies to produce the vowel /ɛ/. However, as with articulatory perturbation studies mentioned above, there is a noteworthy proportion of speakers, ranging from 10 to 20 % per study, who fail to adapt to auditory perturbations. Roughly speaking, these speakers exhibit two qualitatively different types of adaptation behaviors: either adjusting their response in the same direction as the applied perturbation, or hardly reacting to it.

One of the more recent hypotheses put forward to explain the outcome variability observed in perturbation studies is the idea by Lametti, Nasir and Ostry (2012) that speakers have individual preferences for articulatory or auditory feedback to control their speech production. To empirically evaluate their claim, Lametti et al. (2012) investigated participants in different experimental conditions where the authors either perturbed participants’ jaw trajectories without altering their speech acoustics, or perturbed their auditory feedback, or applied both types of perturbation simultaneously. The authors found a negative correlation between the amount of articulatory and auditory adaptation which means that speakers who adapted to articulatory perturbations, adapted to auditory alternations to a lesser degree.

However, Lametti et al.’s (2012) hypothesis conflicts with observations previously made by Ghosh et al. (2010) who investigated the relation between somatosensory and auditory acuity, where acuity stands for the degree to which speakers were sensitive to changes in articulatory and auditory feedback signals. Running contrary to the idea that speakers exhibit individual preferences towards auditory or somatosensory feedback, Ghosh et al. (2010) found that both types of acuity positively correlated with each other as well as with the magnitude of produced sibilant contrasts. In the context of vowels, the latter finding was previously made by Perkell et al. (2004). Furthermore, auditory acuity has been shown to have an influence on the adaptation magnitude during auditory perturbation of vowel formants (Villacorta et al., 2007) as well as during articulatory perturbation of sibilants (Brunner, Ghosh, Hoole, Matthies, Tiede ←80 | 81→& Perkell, 2011). In contrast to Lametti et al.’s (2012) hypothesis which predicts that speakers who fail to adapt to auditory perturbations should virtually ignore them, individual differences in auditory acuity provide a way to explain partial compensations which are frequently observed in auditory perturbation studies.

Another explanation for partial compensations was provided by Katseff, Houde & Johnson (2012) who suggest that these are the result of speakers’ attempts to integrate the altered auditory signal with the normal somatosensory signal that speakers receive during a perturbation experiment. Similar to other authors (e.g., Sato, Schwartz & Perrier, 2014), Katseff et al. (2012) assume that vowel targets are defined as regions in a multidimensional acoustic-somatosensory space. That is, when during auditory perturbation the acoustic parameters of speakers’ speech are diverted from the target, speakers will compensate for the acoustic error. However, their compensation will stop when the discrepancy between the auditory and somatosensory signals becomes too large. Katseff et al. (2012) support their view by the observation that in their study of F1 perturbation the relative compensation magnitude decreased from 100 % for 50 Hz perturbations to 40 % for 250 Hz perturbations. An analogous finding was previously made by MacDonald, Goldberg & Munhall (2010) for F1 and F2 perturbation.

At this point, we would like to add that it is alternatively possible that it is not the discrepancy between the altered acoustic and somatosensory signals that is causing the incomplete compensation, but rather physical restrictions which do not allow participants to compensate beyond a certain physical limit. For instance, it seems plausible that large F1 perturbations could require speakers to push their tongue beyond physical limits imposed by the palate, the upper incisors, or other parts of the vocal tract.

1.3. The role of the adaptation task complexity

Although the hypotheses reviewed above are based on different premises, they mostly ascribe the source of the inter-individual outcome variability to the mechanisms of speakers’ internal models of speech motor control. This approach leads to a situation in which each of the proposed hypotheses offers a potential explanation for the inter-individual adaptation ←81 | 82→variability in the context of a specific perturbation task; however, none of them does actually provide a general account of the variability that is observed across different experimental tasks or conditions. In this specific situation, it appears necessary to investigate the question of whether the complexity of the adaptation task might have an impact on its outcome. Let us illustrate this point with an example.

It is plausible to assume that the adaptation to bite-block perturbation during production of vowels (e.g., Gay et al., 1981) requires an articulatory adjustment that is more similar to the unperturbed condition compared to the case of lip-tube perturbation during the production of /u/ (Savariaux et al., 1995). During the first task, participants are merely required to lift their tongue more strongly than usual since their jaw, which normally assists at this task, is blocked. Furthermore, the direction of the compensatory tongue movement does not change due to the perturbation. During the lip-tube perturbation, on the other hand, participants have to compensate for blocked lip rounding by retracting their tongue. This articulatory adjustment is less obvious as the articulator used to compensate for the perturbation and its movement direction are less associated with the usual articulatory configuration used to produce the intended sound. As a consequence, the adaptation process may take longer and fewer speakers are able to identify the appropriate articulatory adjustments to compensate for the perturbation. Therefore, in our current study we will also investigate the question of whether the outcome variability can be explained by speakers’ inability to coordinate the perceived auditory error with appropriate corrective articulatory movements.

1.4. Current study

To investigate in more detail how speakers translate the altered auditory signal into corrective articulatory movements, we conducted a bidirectional auditory perturbation study with native Russian speakers. Unlike Rochet-Capellan & Ostry (2011), who investigated speakers’ adaptation to multiple F1 degrees, we focused our investigation on F2, which is, roughly speaking, an indicator of horizontal tongue displacement. In our experiment, participants had to produce the close central unrounded vowel /ɨ/ embedded in CV syllables /dɨ/ and /gɨ/. Depending on the preceding consonant, F2 in /ɨ/ was perturbed in opposing directions.

←82 | 83→

The bidirectional perturbation imposed higher adaptation demands on our participants since they had to coordinate their corrective movements in two different ways depending on the perturbation direction. Based on the hypothesis that higher task complexity influences the adaptation process, we expected to observe a high amount of exploratory corrective movements and possibly also spontaneous behavior changes in the course of the experiment.

The combination between the place of articulation (alveolar vs. velar) and the perturbation direction (down vs. up) was counterbalanced between all participants which allowed us to control for the potential influence of articulatory restrictions associated with each syllable on the compensation. To investigate how quickly participants can adapt to abrupt and substantial magnitude changes in perturbation, we increased the perturbation amount in 150 Hz steps across three perturbation phases and excluded ramp trials (gradual changes of perturbation magnitude) from the experiment.

Finally, to understand the spatial and temporal evolution of the adaptation process, we analyzed the formant data with generalized additive mixed models (GAMMs) which allowed us to observe non-linear changes in participants’ responses to perturbation. By doing this, we seek to overcome the shortcomings of previous perturbation studies which concentrate on the comparison of speakers’ performance between the beginning and the end of the adaptation task, i.e., in most extreme cases during the first and the last trial of the experimental session or more often during the first 15–20 and the last 15–20 trials of the experiment. Unfortunately, this aggregation approach allows only for pairwise time-uncorrelated comparisons (e.g., Feng, Gracco & Max, 2011; Trudeau-Fisette, Tiede, & Ménard, 2017) while the evolution of the adaptation process is often presented only in exploratory scatterplots in earlier studies (e.g., Rochet-Capellan & Ostry, 2011; Lametti et al., 2012; Mitsuya, Munhall & Purcell, 2017).

2. Methods

2.1. Participants

18 native speakers of Russian (14 female and 4 male) without reported speech, language, or hearing disorders participated in the experiment. ←83 | 84→All participants were recruited in Berlin. The mean age of the group was 25.8 years (range 20–37). Participants had spent on average three years in Germany prior to the recordings. The study was approved by the local ethics committee and all speakers gave their written consent to participate in the study.

2.2. Equipment

For each experimental session, participants were seated in front of a 19-inch monitor inside a sound attenuated booth. The monitor served to display the stimuli and experimental instructions which were presented in Russian. Participants’ speech was recorded with a Beyerdynamic Opus-54 neck-worn microphone and fed back via foam tipped E-A-RTONE 3A earphones (Figure 1). The distance between participants’ mouths and the microphone was about 3–5 cm. The earphones attenuated the air-conducted sound by 25–30 dB while the feedback level was amplified relative to the microphone gain to weaken potential effects of air and bone conduction. The feedback volume was fixed across all participants. However, it was not possible to quantify the feedback level in a precise and meaningful manner since actual feedback volume is expected to vary slightly due to such parameters as the length and the size of participants’ ear channels. Real-time tracking and formant perturbation were performed with AUDAPTER, which is a C++ audio signal processing application executable within a MATLAB environment (cf. for technical details Cai et al., 2008). The delay of the feedback loop was approximately 14ms. The ←84 | 85→original and perturbed audio signals were digitized and saved with a sampling rate of 16 kHz. AUDAPTER also stored data files which contained the formant values (F1, F2, and F3) tracked on each trial.

Figure 1: (A) Scheme of the experimental set-up. (B) Foam tipped insert earphones.

2.3. Speech stimuli and experimental protocol

For our study we chose Russian since its vowel inventory includes the close central vowel /ɨ/ which is flanked within the F2 space on each side by the two phonemes /i/ and /u/. This constellation allowed us to investigate multiple adaptation in /ɨ/ with bidirectional perturbation of the F2 frequency. The vowel /ɨ/ has a special status in the Russian vowel system since it never appears in word initial position or after palatalized consonants (cf. Bolla, 1981, p. 66).

Each recording session lasted approximately 20–25 minutes and consisted of four experimental phases. Before the start of the first experimental phase, participants completed a few practice trials with unrelated speech material to assure they understood the task and were able to perform it accurately. During a baseline phase, which lasted for 60 trials, no auditory perturbation was applied and participants were able to familiarize themselves with the experimental situation of receiving auditory feedback over earphones. On each trial, which had an approximate duration of 2 seconds, participants were visually prompted to produce one of the four CV syllables /di/, /dɨ/, /gɨ/, and /gu/. This was done to assess participants’ initial F1–F2 formant space. The inter-stimulus interval between the trials was approximately 1.5 seconds. The visual presentation of the stimuli was controlled by a customized MATLAB software package developed at the Institute of Phonetics and Speech Processing, LMU Munich.

During the three following perturbation phases, each of which lasted for 50 trials, participants produced CV syllables containing the close central unrounded vowel /ɨ/ embedded in the context of alveolar and velar consonants /d/ or /g/. Depending on the consonantal context, the F2 was perturbed either downwards or upwards on each trial of each perturbation phase. Within each perturbation phase all stimuli were presented in pseudorandom order. This means that a participant could experience one perturbation direction on one trial and the other direction on the immediately following one; also, the same perturbation direction was never applied on more than two consecutive trials. The interaction between the place of articulation (alveolar vs. ←85 | 86→velar) and the perturbation direction (downward vs. upward) was evenly counterbalanced between the 18 participants resulting in two experimental groups (A and B). The perturbation magnitude amounted to 220 Hz during the first perturbation phase and increased in each perturbation phase by 150 Hz. Consequently, the perturbation magnitude was 370 Hz for the second perturbation phase and reached 520 Hz in the last phase of the experiment. The amount of perturbation did not change within each shift phase. There were no ramp trials between the perturbation phases.

Participants were naïve to the purpose of the experiment and were instructed to produce all syllables with prolonged vowels. The prolongation of the vowels maximized the amount of time during which participants were exposed to perturbed vowels. To keep the prolongation duration consistent across participants, they were assisted by a visual go-and-stop signal during their production. The go-and-stop signal had the form of a frame. Between the trials, while the frame stayed red, the response syllable of the upcoming trial appeared on the display and stayed within the frame. When a trial started, the frame color turned green which gave participants the signal to begin with their response.

Following the experimental session, all participants were asked if they noticed anything unusual in their auditory feedback during the experiment. A few of the participants reported that their pronunciation was different from what they are used to or that they perceived an acoustic difference between the syllables /dɨ/ and /gɨ/. Most participants attributed these pronunciation differences to the effect of listening to their own speech on audio recordings, so when asked if and how these differences affected their production, participants reported to have ignored these. From previous research, however, it is known that participants are not able to voluntarily control their reaction to auditory perturbation even if they are told to ignore it (cf. Munhall, MacDonald, Byrne & Johnsrude, 2009).

All recordings of 18 participants amounted to 3780 trials. The onset and offset of the vowel segment produced on each trial were labeled manually in MATLAB using its graphical input facilities. Subsequently, the formant trajectories were extracted from AUDAPTER’s data files based on the labeled onset and offset boundaries. A window with a length of 50 % of each formant trajectory centered at its midpoint was used to compute the formant means produced on each trial.

←86 | 87→

2.4. Data analysis

All analyses were performed in R (version 3.4.1; R Core Team, 2017). During the data analysis, we first examined the general adaptation pattern that occurred over the course of the experiment in the syllables containing the central vowel /ɨ/. Next, we looked at individual spatial and temporal changes of vowel formants due to the applied perturbation. Finally, by investigating participants’ initial F1–F2 vowel space, we evaluated the potential influence of the surrounding sound categories /i/ and /u/ on the individual compensation strategies.

To examine average formant changes in participants’ production of the two syllables /dɨ/ and /gɨ/ across the four experimental phases, we fitted a generalized additive model (GAM; Hastie & Tibshirani, 1987). A GAM is a significant extension of a generalized linear regression model which allows the modelling of non-linear relationships between the dependent and independent variables (Wood, 2017a). Therefore, GAMs are much more flexible compared to linear regression models. The non-linear relationships are modelled via complex functions (smooths) which are constructed from ten basis functions (e.g., linear, quadratic, and cubic functions) with an adjustable number of basis dimensions. The number of basis dimensions is a number which indicates the upper limit of how complex the constructed function can be and is estimated directly from the data during the modelling process. That means that the usage of GAMs does not require from the researcher a predefined specification of a certain (non-linear) function as it is derived directly from the data. To prevent overfitting of the data, i.e., modelling of functions which are too complex and therefore might obscure any generalizable patterns in the data, GAMs are estimated using penalized likelihood estimation and cross-validation (cf. for details Wood, 2006). In the case of cross-validation, several subsets of the complete data sample are created always excluding a single data point and the model is refitted to all of these subsets examining how well it predicts the excluded data. One further advantage of GAMs is the possibility to include random effects into the model structure to account for individual response variability across but also within speakers (cf. Baayen, Vasishth, Kliegl & Bates, 2017). To denote the inclusion of random effects in the fitted model, it is dubbed generalized ←87 | 88→additive mixed model (GAMM). For a hands-on introduction to GAMMs with a focus on dynamic speech analysis see Sóskuthy (2017).

The GAMM offers three main advantages for analyzing the data from the current experiment. First, it is possible to analyze the data as a function of time which allows us to investigate the whole adaptation process rather than just its outcome. Secondly, the non-linearity of parameter smooths does not make any assumptions regarding the temporal or spatial characteristics of the adaptation process. Finally, the parameter smooths can be estimated including random effects which allows us to capture individual variability of the adaptation process.

Prior to building the GAMM model, participants’ raw formant frequencies were normalized by subtracting each participants’ mean formant frequency produced during the baseline phase for the respective syllable (/dɨ/ or /gɨ/). This was done to exclude participant-specific differences regarding their absolute formant magnitudes (e.g., due to gender differences). By means of this normalization, the average F1 and F2 values for /dɨ/ and /gɨ/ were set at zero for the baseline phase.

Subsequently, using the mgcv package (Wood, 2017b) we fitted one GAMM model for each formant (F1 and F2) with normalized frequencies averaged across all participants and all experimental trials as dependent variable. The data of the unperturbed syllables /di/ and /gu/, which were uttered by participants only during the baseline phase, were not included in the resulting GAMMs. All GAMM models were evaluated, interpreted, and visualized by means of the itsadug package by van Rij, Wieling, Baayen & van Rijn (2017).

In the model structure, we included random factor smooths with an intercept split for the perturbation direction (upward vs. downward) in order to assess (potentially non-linear) individual compensation magnitude differences over the course of the experiment. The model also included a fixed effect which assessed the ‘constant’ effect of the perturbation direction independently from the temporal variation. The resulting models explained 46.6 % and 66.9 % of the variance in the F1 and F2 data, respectively. In comparison, the model which did not include the random smooths (participant-specific temporal variation) but only random intercepts and random slopes explained only 31.2 % of the variance in the F2 data. Maybe somewhat surprisingly, the inclusion of the phase ←88 | 89→number (shift 1, shift 2, and shift 3) as an interaction with the perturbation direction did not significantly improve the model fit. We also refitted the F2 model including an interaction between the perturbation direction (upward vs. downward) and the experimental group (A vs. B) which also did not improve the fit. In both cases, the goodness of fit was assessed by the Akaike Information Criterion (AIC; Akaike 1974).

Following the suggestion in Baayen, van Rij, de Cat & Wood (2016), the fitted models were investigated for the presence of autocorrelation in their residuals. Autocorrelation in the present study represents the correlation between the formant frequencies produced by one participant on two consecutive experimental trials. The higher the autocorrelation value is the less amount of information is contributed for the statistical model by each additional experimental trial. Ignoring this issue might result in overconfident estimates of the standard errors, confidence intervals, and p-values. The amount of autocorrelation at lag 1 was relatively moderate in the present data with 0.2 for F1 and 0.17 for F2. The effect of autocorrelation was practically reduced to zero by incorporating AR(1) error models in the specification of the fitted GAMM models. The corrected models explained 23.1 % and 63.4 % of the variance in the F1 and F2 data, respectively. The dropped percentages of the explained variance are due to the refitted models taking into account the autocorrelation which makes their prediction about actual frequency values worse. This is especially true for the F1 model which is an indication that much of the variance in the initial model can be explained by autocorrelated errors rather than by the specified model parameters such as the direction of the applied perturbation. Visual model inspection revealed that the residuals of the adjusted GAMMs followed a normal distribution for F1 and F2 data.

To examine individual spatial and temporal differences of the adaptation process, we extracted F2 curves estimated for each participant by the GAMM model described in the above paragraphs.

In order to evaluate whether the occurrence of certain individual compensation patterns was induced by sound categories surrounding the perturbed vowel, we investigated participants’ F1–F2 space using their baseline phase production. For this purpose, we fitted two linear-mixed models using the lme4 package (Bates, Mächler, Bolker, and Walker, 2015). One model was fitted for each of the two average formant frequencies (F1 ←89 | 90→and F2) that were produced by participants in the syllables /di/, /dɨ/, /gɨ/, and /gu/ during the baseline phase. The model structure included the produced syllable and the interaction between the syllable and gender as fixed effects and the formant frequency as dependent variable. Furthermore, both models included an interaction between the syllable and the compensatory pattern observed for each participant (cf. section 3.2 for a detailed discussion of individual compensation patterns). Random intercepts were modeled for each participant as well as random slopes for each produced syllable.

Visual model inspection revealed that the residuals of the chosen models followed a normal distribution for F1 and F2 data. P-values were obtained with the lmerTest package by Kuznetsova, Brockhoff, and Bojesen-Christensen (2016).

3. Results

3.1. Overall compensatory behavior

The GAMM estimated for F1 suggested that the applied perturbation did not have a ‘constant’ effect on the produced F1 values since its average did not significantly differ from the baseline on trials with upward (2.97 Hz, t=1.09, p > .05) as well as on trials with downward perturbation (-1.14 Hz, t=-0.32, p > .05). These values represent ‘constant’ F1 differences for the whole experiment since they do not take into account any changes that appeared over time. Taking the temporal variation over the course of the experiment into account, the model did not reveal a F1 difference from the baseline for either of the two perturbation directions (Figure 2A). Furthermore, a direct comparison between trials with applied upward and downward perturbation revealed no significant difference in their F1 curves (Figure 2B). The average F1 difference amounted to 0.96 Hz (95 % CI [-6.03 7.94]) by the end of the first shift phase, 1.90 Hz (95 % CI [-6.14 9.95]) by the end of the second shift phase, and 2.93 Hz (95 % CI [-7.86 13.72]) by the end of the experiment. Random non-linear smooths of the F1 model suggest that there were unsystematic participant-specific F1 changes which are most likely not related to the applied perturbation (Figure 2C).

The absence of systematic compensatory effects in F1 is expected as no F1 perturbation was applied during the experiment. This outcome provides ←90 | 91→additional support for the validity of the applied experimental manipulation and the assumption that any systematic effects found for F2 are due to the application of the bidirectional perturbation. Due to the absence of compensatory effects in F1, we will not discuss this variable any further.

Figure 2: Visual summary of the fitted GAMM model for F1: (A) Average compensatory effects (excluding random participant effects) in F1 for downward and upward perturbation over the course of the experiment. Grey bands represent 95 % confidence intervals. (B) The average difference in F1 between trials produced under opposing perturbation directions over the course of the experiment. Grey bands represent 95 % confidence intervals. (C) Random smooths estimated for each participant for her/his average F1 curve split by the perturbation direction.

The GAMM estimated for F2 suggested that the applied perturbation had a ‘constant’ effect on the produced F2 values on trials with upward (-127.58 Hz, t=-5.76, p < .05) as well as on trials with downward perturbation (143.14 Hz, t=5.36, p < .05). The direction of the ‘constant’ effect was opposed to the direction of the applied perturbation during upward and downward perturbation. Examining the effect of the perturbation over time, the model revealed that this effect increased for both directions (upward and downward) over the course of the experiment (Figure 3A). On average, however, the effect appears to be stronger for the upward perturbation compared to the downward perturbation. The F2 difference between trials produced under opposite perturbation directions became significant after the baseline phase and increased, as expected, over the three perturbation phases (Figure 3B). The average F2 difference amounted to -131.51 Hz (95 % CI [-183.5 ←91 | 92→-79.52]) by the end of the first shift phase and to -193.61 Hz (95 % CI [-254.79 -132.44]) by the end of the second shift phase. By the end of the experiment, the average F2 difference reached -261.13 Hz (95 % CI [-345.41 -176.84]). The model suggested that the average compensatory effect in F2 can be modeled by linear functions for both perturbation directions as the estimated degrees of freedom (EDF) for both smooth terms amounted to 1. On the other hand, the random smooths fitted for individual participants exhibited a high degree of non-linearity for upward (EDF= 116.66) and downward (EDF = 97.76) perturbation directions.

The random F2 smooths fitted individually for each participant demonstrate that above and beyond the general tendency to counteract the applied perturbation, participants’ adaptation patterns exhibited high variability in both investigated dimensions (formant frequency and time). For instance, the individual smooths refined the general observation that the downward perturbation caused on average weaker compensatory effect ←92 | 93→over time. In Figure 3C, it is apparent that for most participants the solid lines (F2 curves produced under downward perturbation) remained closer to the baseline compared to the dashed lines (F2 curves produced under upward perturbation).

Figure 3: Visual summary of the fitted GAMM model for F2: (A) Average compensatory effects (excluding random participant effects) in F2 for downward and upward perturbation over the course of the experiment. Grey bands represent 95 % confidence intervals. (B) The average difference in F2 between trials produced under opposing perturbation over the course of the experiment. The solid thick line denotes the region where the F2 difference was significant. Grey bands represent 95 % confidence intervals. (C) Random smooths estimated for each participant for her/his average F2 curve split by the perturbation direction.

To understand these participant-specific differences, we will examine and discuss individual adaptation patterns in more detail in the next section.

3.2. Individual compensatory patterns

As revealed by the individual F2 curves estimated by the GAMM model, the most distinct characteristic among participants was the magnitude of their compensation for the downward perturbation. Based on this metric, we identified five participants who throughout the experiment were compensating for the downward perturbation to the same extent as for the upward perturbation and 10 participants who compensated less (if at all) for the downward perturbation compared to the upward perturbation. In Figure 4, the first group (‘symmetrical’ compensation pattern) is represented by participants 3, 9, and 13, while participants 4, 16, and 17 can be considered to represent the second group (‘asymmetrical’ compensation pattern).

Examining subfigures for participants 3, 9, and 13, we see that the F2 curves for the two syllables /dɨ/ and /gɨ/ diverged by equal amounts from the baseline as the experiment progressed. For participants 4 and 16, on the other hand, the F2 curve produced under the upward perturbation diverged more strongly from the baseline. In contrast, the F2 curve produced under the downward perturbation appears to have fluctuated around the baseline. For participant 17, the effect of the perturbation direction appears to be flipped with stronger compensation for the downward perturbation.

In addition to the symmetrical and asymmetrical compensation patterns, we identified in the sample three participants who were not able to consistently compensate for the opposite perturbation directions throughout the experiment (see participants 5, 6, and 7 in Figure 4). In summary, all 18 participants who participated in the study exhibited one of the three described adaptation behaviors. Representative data of only nine participants is depicted in Figure 4 due to space limitations.

←93 | 94→

As revealed by Figure 4, individual adaptation patterns exhibited a lot of spatial and temporal non-linearities. This fact makes it prohibitive to apply plain pairwise comparisons between participants’ production during the baseline and the last perturbation phase to assess whether speakers have successfully adapted to the perturbation. In the worst case, this approach risks obfuscating the specific characteristics of the adaptation patterns. Just on grounds of such comparison, participants 5, 6, and 7 would qualify as speakers who failed to compensate in opposite directions. However, examining the evolution of their F2 responses over ←94 | 95→the course of the experiment it is apparent that all three participants tried to compensate for the applied shifts with participant 7 eventually being able to achieve this goal for the upward but not the downward perturbation direction.

Figure 4: Individual compensatory effects in F2 for downward and upward perturbation across all experimental trials. The F2 curves were estimated by the same GAMM model which is depicted in Figure 3. Please note: individual y-axis scales were applied due to big inter-individual differences of the compensatory magnitude. Vertical dashed lines denote the beginnings and the ends of the experimental phases. After the baseline phase (Base), the perturbation magnitude amounted to 220 Hz (Shift 1), 370 Hz (Shift 2), and 520 Hz (Shift 3).

Participant 6, for instance, initially increased F2 in both experimental syllables independently of the perturbation direction. After the second perturbation phase, the produced F2 frequency started to drift again into the negative direction but remained, nonetheless, distinct for both syllables. This pattern provides evidence for the fact that, although she was not counteracting the applied perturbation, participant 6 was able to perceive the auditory errors caused by the downward and upward perturbations and to differentiate between them. Participant 5, on the other hand, differentiated between the two perturbation directions during the first perturbation phase, but changed her compensatory movements for the downward shifts during the second perturbation phase such that she produced the same F2 frequency for both syllables at the end of the experiment.

Analogously to participants 5 and 6, participant 7 was not able to develop an appropriate compensation strategy when the perturbation was first applied. However, she changed her initial incorrect strategy in the course of the experiment. She started to counteract the perturbation during the second perturbation phase and eventually developed two consistently different production strategies by the end of the experiment.

The relative compensation magnitude varied substantially across all participants independently of whether they could successfully compensate for both perturbation directions or not. During the last perturbation phase, for instance, the compensation magnitude fluctuated between 6.6 and 103 % across all participants. Also, the amount of change of the compensation magnitude over the course of the experiment was not identical among the participants. Compare the adaptation patterns of participants 9 and 13 in Figure 4. While for participant 9 the compensation magnitude increased with the increasing perturbation magnitude, the compensation magnitude in participant 13 appears to have reached an absolute compensation limit for both perturbation directions around 100 Hz. Overall, there was a weak negative correlation between the average compensation magnitude and the perturbation magnitude (r = -0.19, t=-10.31, p<.05, 95 % CI [-0.23 -0.16]). This suggests that the ←95 | 96→average compensation magnitude slightly decreased as the perturbation magnitude increased.

3.3. Role of the initial F1–F2 space

To explain the occurrence of the symmetrical and asymmetrical compensatory patterns, we investigated the influence of participants’ initial F1–F2 space on their compensatory performance.

The mean F1 and F2 frequencies produced by all participants during the baseline phase are summarized in Figure 5 split by participants’ gender.

There were no statistically significant within-speaker differences in F1 between the vowels of the four syllables. The average F1 difference between /dɨ/ and /di/ in female participants with asymmetrical compensatory pattern (7 participants) was 19.27 Hz (t = 1.96, p > .05, 95 % CI [93.78 32.63]), 4.17 Hz (t = 1.87, p > .05, 95 % CI [-0.13 7.3]) between /gɨ/ and /dɨ/, and 3.73 Hz (t = 0.44, p>.05, 95 % CI [-11.72 19.18]) between /gu/ and /gɨ/.

In female participants with symmetrical compensatory pattern (4 participants), the average F1 frequencies were lower for every syllable. However, none of these differences was significant (/di/: -2.53 Hz, t= -0.16, p > .05, 95 % CI [-31.47 26.42]; /dɨ/: -3.56 Hz, t = -0.17, p > .05, 95 % CI [-41.2 34.08]; /gɨ/: -2.89 Hz, t= -0.14, p > .05, 95 % CI [-41.31 35.54]; /gu/: -16.49 Hz, t = -1.17, p > .05, 95 % CI [-42.21 9.22]). In female participants who reacted inconsistently to the opposite perturbations (3 participants), the average F1 frequencies were also lower for every syllable compared to the female participants with the asymmetrical compensation pattern. Again, none of these differences was significant (/di/: -9.8 Hz, t = -0.5, p > .05, 95 % CI [-45.61 26.0]; /dɨ/: -23.22 Hz, t = -0.91, p > .05, 95 % CI [-69.78 23.34]; /gɨ/: -23.43 Hz, t= -0.9, p > .05, 95 % CI [-70.97 24.11]; /gu/: -30.53 Hz, t= -1.75, p > .05, 95 % CI [-62.34 1.28]).

The F2 model indicated significant within-speaker differences between F2 values of the investigated vowels. In female participants with asymmetrical compensatory pattern (7 participants) the average F2 difference between /dɨ/ and /di/ was -302.96 Hz (t =-6.3, p < .05, 95 % CI [-390.76 -215.13]), -169.52 Hz (t =-5.26, p < .05, 95 % CI [-228.29 -110.76]) between /gɨ/ and /dɨ/, and -1298.46 Hz (t =-19.08, p < .05, 95 % CI [-1422.68 -1174.34]) between /gu/ and /gɨ/.

←96 | 97→

In female participants with symmetrical compensatory pattern (4 participants), the average F2 frequencies were lower for every syllable except for /di/. However, none of these differences was significant (/di/: 33.61 Hz, t = 0.47, p > .05, 95 % CI [-95.68 162.89]; /dɨ/: -6.53 Hz, t= -0.07, p > .05, 95 % CI [-188.9 175.8]; /gɨ/: -15.74 Hz, t= -0.16, p > .05, 95 % CI [-194.0 162.54]; /gu/: -5.24 Hz, t = -0.18, p > .05, 95 % CI [-57.98 47.51]). In female participants who reacted inconsistently to the opposite perturbations (3 participants), the average F2 frequencies were lower for every syllable except for /dɨ/ and /gu/ compared to the female participants with the asymmetrical compensation pattern; only the difference for the syllable /gu/ was significant (83.23 Hz, t = 2.33, p < .05, 95 % CI [18.0 148.48]). The remaining three differences were not significant (/di/: -11.31 Hz, t = -0.13, p > .05, 95 % CI [-171.23 148.63]; /dɨ/: 132.12 Hz, t= 1.07, p > .05, 95 % CI [-93.57 357.61]; /gɨ/: -107.42 Hz, t= -0.89, p > .05, 95 % CI [-327.84 113.42]).

The F1 and F2 frequencies produced by male participants were on average lower for every syllable compared to the formants produced by female participants, however, these differences were only significant for F2 values produced for the syllable /di/ (-270.41 Hz, t = -3.58, p < .05, 95 % CI [-408.26 -132.55]).

Figure 5: The average F1–F2 vowel space produced by all participants during the baseline phase (no perturbation) for the four syllables /di/, /dɨ/, /gɨ/, and /gu/. The data is split by participants’ gender.

Overall, the observed F1–F2 space of the vowels /i/, /ɨ/, and /u/ was consistent with previous descriptive studies of Russian vowels (Lobanov, 1971; Bolla, 1981). As expected, there was no statistically significant and no perceivable difference in F1 between the investigated vowels (previous research on formant perception indicates that on average participants do no perceive F1 differences below 50 Hz; Oglesbee & Kewley-Port, 2009). The vowels were differentiated most prominently by F2 with /i/ having the highest and /u/ the lowest values; F2 of /ɨ/ lay between the other two vowel categories. F2 was higher in /dɨ/ compared to /gɨ/ likely due to coarticulation. Furthermore, the initial F1–F2 vowel space did not significantly differ between the three participant groups which exhibited different compensatory patterns during the perturbation phases of the experiment.

4. Discussion

In the current investigation we presented results from a bidirectional auditory perturbation experiment conducted with native speakers of Russian. ←97 | 98→During three perturbation phases of the experiment, participants had to produce the close central unrounded vowel /ɨ/ while its F2 frequency was perturbed in opposing directions depending on the preceding consonant (/d/ or /g/). The bidirectional perturbation was intended to increase the demands associated with the experimental task since participants had to coordinate their corrective movements in two different ways depending on the perturbation direction to produce the target vowel /ɨ/. Based on the recurrent observation that participants counteract the applied auditory perturbation, we expected that the baseline F2 values for the two syllables /dɨ/ and /gɨ/ would diverge over the course of the three perturbation phases since the magnitude of the perturbation increased in opposing directions from one perturbation phase to another. The two consonantal contexts (alveolar vs. velar) were chosen to evaluate the potential influence of physical restrictions on the success of the adaptation outcome.

The average adaptation behavior observed during the study confirmed our main hypothesis. The GAMM model estimated for the normalized F2 frequency suggested that participants were able to adapt simultaneously to two opposing F2 perturbations and employ different strategies to produce ←98 | 99→the vowel /ɨ/ depending on the direction of the applied perturbation. These results are qualitatively in line with previous articulatory and auditory perturbation studies which show that most speakers are able to remap their initial articulatory-to-acoustics mapping under aggravated speech conditions (e.g., Gay et al., 1981; Savariaux et al., 1995; Feng et al., 2011).

Furthermore, our results are consistent with findings by Rochet-Capellan & Ostry (2011) who demonstrated that participants are able to simultaneously develop multiple strategies to produce the same target vowel. Adding to these results, our data shows that the results obtained by Rochet-Capellan & Ostry (2011) in the context of F1 are generalizable to F2.

The compensatory effects observed in F2 frequencies for both perturbation directions were absent in our F1 data. This result serves as evidence for the validity of the applied experimental manipulations.

The application of generalized additive mixed modelling (GAMM) allowed us to investigate the evolution of the adaptation process over time. Particularly, we were able to observe participant-specific differences in the spatial and temporal dimensions of the compensatory changes and to understand individual differences in how participants were generally able to cope with the demands of the experimental task.

While one group of participants was almost immediately able to compensate for bidirectional formant perturbations during the first perturbation phase, other participants needed longer periods of time to do so and started to compensate only in the second or the third perturbation phase. Also, a few participants failed to identify the appropriate compensatory adjustments altogether. Since these speakers also tended to change the initial direction of their compensatory movements throughout the experiment, their behavior can be best described as exploratory. In several instances, the directional changes of the compensatory movements were quite abrupt as revealed by the non-linearities of the modelled F2 curves.

Although the required corrective tongue movements were set along the same movement trajectory (forward vs. backward), participants had to figure out two quite different strategies to produce the vowel /ɨ/. In particular, they first had to identify the direction of the applied frequency shift for both experimental syllables and then to adjust their tongue movements appropriately.

←99 | 100→

It appears plausible that the correct identification of the F2 perturbation direction was not a trivial task since it took different amounts of time for different participants before they eventually started to consistently compensate for the applied perturbations if at all. Most convincing in this regard is the observation that some participants initially followed the perturbation but started to counteract it after a while. These observations suggest that all participants without exception were indeed perceiving the auditory errors caused by the perturbations, however, not all were able to figure out the appropriate articulatory adjustments in order to minimize them. One potential reason for this might be their inability to identify the correct perturbation direction.

This hypothesis can also explain the observation that in typical (unidirectional) auditory perturbation studies, beside a group of participants who counteract the perturbation, there are most likely a few participants who appear to follow the perturbation. Our data suggest that both reactions to auditory perturbations (counteracting and following) can be understood in more general terms as exploratory compensatory behavior where participants wander through the formant space in order to find the appropriate corrective movements to produce the intended acoustic output. In line with this idea is the observation from the current study that no participant actually followed the applied perturbations in both directions (upward and downward).

The observed temporal non-linearities and abrupt directional changes of the compensatory responses challenge the idea that speakers exhibit either auditory or somatosensory feedback preference during speech production (Lametti et al., 2012). Strictly following the idea of feedback preference, speakers with auditory feedback preference should have always reacted consistently to the applied auditory perturbations, i.e., independently of the perturbation direction. At the same time, we should expect that a subset of speakers with a preference for somatosensory feedback should virtually ignore the auditory perturbations. However, the examination of individual adaptation patterns revealed that both assumptions do not appear to be true.

First, among the 18 participants there were no speakers who ignored the applied perturbations, which might have suggested that they dis-prefer the auditory feedback channel during speech production. Secondly, ←100 | 101→and more importantly, for several participants the direction and the magnitude of the compensation were not identical for both perturbation directions, which it should be under the assumption that a speaker exhibits a permanent preference for the auditory feedback channel. Furthermore, several participants were able to acquire the two appropriate compensatory strategies after some practice. That is, the ability to develop a consistent compensatory strategy does not seem to depend on speakers’ preference for auditory or somatosensory feedback. The observation that the compensation magnitude was not identical for the two opposite perturbation directions across all participants deserves further attention. The estimated GAMM model suggested that although 17 participants compensated for the upward perturbation, only five of them did it simultaneously for the downward perturbation. (Additionally, a single participant significantly shifted her F2 frequency upwards independently of the applied perturbation direction.) We dubbed this difference in compensatory profiles as asymmetrical and symmetrical compensation patterns.

One potential explanation for the observation that far less participants were able to compensate for the downward perturbation is the articulatory effort associated with the required forward compensatory movement of the tongue. Whereas the backward movement of the tongue from the central position of the /ɨ/ is physically less restricted, the extent of the forward movement is limited by the alveolar ridge and the upper incisors. Since we do not have articulatory data of participants’ palatal shapes, this possibility cannot be ruled out completely. However, there is some evidence which undermines this hypothesis.

Taking the introduced idea of physical restrictions further, we have to assume that the forward movement of the tongue in /dɨ/ should be even more restricted compared to /gɨ/ as the tongue has already a more advanced position in the first syllable. This positional difference between /dɨ/ and /gɨ/ is supported by the results on participants’ initial F1–F2 formant space presented in the section 3.3. Based on this fact, we should expect that the participants who were able to compensate for the downward perturbation were able to do so preferably for the syllable /gɨ/. However, from the six participants who significantly upshifted their F2 during the perturbation phases, three did it for /dɨ/ and the other three did it for /gɨ/. That means ←101 | 102→that there was no advantage for the syllable /gɨ/ as there ought to be assuming that the distance from the produced vowel to the physical limit (i.e., alveolar ridge/upper incisors) played a crucial role for the success of the forward compensatory tongue movement. This interpretation is further supported by the GAMM modelling as the inclusion of the interaction between the perturbation direction and the syllable did not improve the overall fit.

The idea of the influential role of physical constraints on the compensatory movement can be restated in more abstract terms of somatosensory categories. In these notions, the upper limit of the compensatory movement is no longer assumed to be a physical boundary (i.e., alveolar ridge) but rather a somatosensory category boundary of the neighboring speech sound. In the particular case discussed in the above paragraph, both explanations can be used interchangeably without inducing different interpretations of the results. However, the reformulation of this hypothesis in somatosensory terms allows us to evaluate the potential influence of the somatosensory sound categories on the compensation magnitude in the case of the backward tongue movements since these were generally physically less restricted in both cases of /dɨ/ and /gɨ/.

At first glance, it is conceivable that the higher distance from /ɨ/ towards the somatosensory boundary of /u/ compared to the distance between /ɨ/ and /i/ facilitated the compensation for the upward perturbation. However, as pointed out by Katseff et al. (2012), if we assume that speech sounds are also defined in sensorimotor space, compensatory magnitude should be restricted not only by a neighboring speech sound but foremost by the size of the sensorimotor region of the perturbed category. That means that when speakers deviate too much from the sensorimotor region of the perturbed category, the magnitude of the compensation should decrease. This hypothesis is, however, substantially challenged by the experimental data.

Comparing the compensation magnitudes for the upward perturbation between /dɨ/ and /gɨ/ reveals that participants compensated in both syllables in comparable amounts despite the fact that the somatosensory distance between /dɨ/ and /gu/ was higher compared to /gɨ/ and /gu/. The absence of a difference in the compensation magnitude between /dɨ/ and /gɨ/ was supported by the GAMM modelling as the inclusion of ←102 | 103→the interaction between the perturbation direction and the syllable did not improve the overall fit. Furthermore, the examination of individual adaptation patterns revealed that a subset of speakers was able to compensate 100 % of the applied perturbation even when it reached 520 Hz and thereby induced a high degree of somatosensory error. Speaking to the same issue, for some participants who changed their F2 frequency in the same direction as the applied perturbation, the mismatch between the auditory and somatosensory error grew ever higher over the course of the experiment.

Taking all this evidence into account, the emergence of the asymmetric compensatory pattern is difficult to explain in terms of the violation of somatosensory boundaries. Consistent with this idea is the fact that there were no systematic differences in participants’ initial F1–F2 formant spaces which could predict their different compensatory profiles (symmetrical, asymmetrical, and non-consistent).

Without resorting to the somatosensory boundaries, we can think of one alternative explanation for the emergence of the asymmetric compensatory pattern. Central to this hypothesis is the idea that the asymmetric pattern emerged due to an asymmetry in the phonemic space of the Russian high vowels. In particular, while /i/ appears only after palatalized consonants in Russian, both /ɨ/ and /u/ follow only non-palatalized ones (cf. Bolla, 1981). The palatalization contrast is an important part of the Russian phonology and is very present for Russian speakers. The most common acoustic feature associated with palatalized consonants is the high F2 frequency at the beginning of the following vowel. This acoustic feature is so important for the perception of palatalization by Russian speakers that even cross-spliced syllables containing non-palatalized consonants and vowels with high initial F2 frequency are perceived as palatalized (cf. Bondarko, 2005). We think that this perceptual effect might have occurred during our experiment.

Since the baseline F2 values of /i/ and /ɨ/ are on average substantially closer to each other compared to /ɨ/ and /u/, it seems reasonable that most participants classified instances of /ɨ/ shifted towards /i/ as phonemic errors of palatalization and corrected for them by lowering their F2. On the other hand, only a few participants reacted to the F2 perturbation of /ɨ/ towards /u/ as it did not induce a change of palatalization status of the perceived syllable. Presumably, those participants who reacted by the same amount ←103 | 104→to the downward perturbation as to the upward perturbation were more sensitive to general F2 changes independent of the phonemic status of the perceived syllable. Unfortunately, we do not have participants’ perceptual profiles which could settle this question completely.

5. Conclusion

Despite a growing body of research, the factors which induce the inter-individual outcome variability during sensorimotor learning are still much debated. Our investigation has shown that there is merit in varying task parameters within the same experimental session and in analyzing the data of perturbation experiments taking the temporal dimension of the adaptation process into account. By doing this, we could show that the inter- and intra-participant variability present during somatosensory learning in speech is beyond the predictions of the hypotheses which ascribe this variability exclusively to the characteristics of speakers’ internal models of speech motor control.


We gratefully acknowledge support from DFG grant 220199 to JB. We are also grateful to two anonymous reviewers for their useful suggestions. We thank Felix Golcher for his advice on the statistical modelling of the data. We thank Miriam Oschkinat for her support during data acquisition and Yulia Guseva for her help with the preparation of the manuscript. We also thank all participants who took part in the study.


Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.

Baayen, R.H., van Rij, J., de Cat, C., & Wood, S.N. (2016). Autocorrelated errors in experimental data in the language sciences: Some solutions offered by generalized additive mixed models. arXiv preprint arXiv:1601.02043.

Baayen, R.H., Vasishth, S., Kliegl, R., & Bates, D. (2017). The cave of shadows: Addressing the human factor with generalized additive mixed models. Journal of Memory and Language, 94, 206–234.

←104 | 105→

Bates, D., Mächler, M., Bolker, B. & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48.

Baum, S.R., & McFarland, D.H. (1997). The development of speech adaptation to an artificial palate. The Journal of the Acoustical Society of America, 102(4), 2353–2359.

Bolla, K. (1981). A Conspectus of Russian Speech Sounds. Budapest: Hungarian Academy of Science.

Bondarko, L. V. (2005). Phonetic and phonological aspects of the opposition of ‘soft’ and ‘hard’ consonants in the modern Russian language. Speech Communication, 47(1), 7–14.

Brunner, J., Ghosh, S., Hoole, P., Matthies, M., Tiede, M., & Perkell, J.S. (2011). The influence of auditory acuity on acoustic variability and the use of motor equivalence during adaptation to a perturbation. Journal of Speech, Language, and Hearing Research, 54(3), 727–739.

Brunner, J., Hoole, P., & Perrier, P. (2011). Adaptation strategies in perturbed /s/. Clinical Linguistics & Phonetics, 25(8), 705–724.

Cai, S., Boucek, M., Ghosh, S.S., Guenther, F.H., & Perkell, J.S. (2008). A system for online dynamic perturbation of formant trajectories and results from perturbations of the Mandarin triphthong /iau/. In Sock, R., Fuchs, S., Laprie, Y., (Eds), Proceedings of the 8th International Seminar on Speech Production 2008, Strasbourg, France, 65–68.

Feng, Y., Gracco, V.L., & Max, L. (2011). Integration of auditory and somatosensory error signals in the neural control of speech movements. Journal of Neurophysiology, 106(2), 667–679.

Gay, T., Lindblom, B., & Lubker, J. (1981). Production of bite-block vowels: Acoustic equivalence by selective compensation. The Journal of the Acoustical Society of America, 69(3), 802–810.

Ghosh, S.S., Matthies, M.L., Maas, E., Hanson, A., Tiede, M., Ménard, L., Guenther, F.H., Lane, H., & Perkell, J.S. (2010). An investigation of the relation between sibilant production and somatosensory and auditory acuity. The Journal of the Acoustical Society of America, 128(5), 3079–3087.

Hastie, T., & Tibshirani, R. (1987). Generalized additive models: some applications. Journal of the American Statistical Association, 82(398), 371–386.

←105 | 106→

Houde, J.F., & Jordan, M.I. (1998). Sensorimotor adaptation in speech production. Science, 279(5354), 1213–1216.

Jones, J.A., & Munhall, K.G. (2000). Perceptual calibration of F0 production: Evidence from feedback perturbation. The Journal of the Acoustical Society of America, 108(3), 1246–1251.

Jones, J.A., & Munhall, K.G. (2003). Learning to produce speech with an altered vocal tract: The role of auditory feedback. The Journal of the Acoustical Society of America, 113(1), 532–543.

Katseff, S., Houde, J., & Johnson, K. (2012). Partial compensation for altered auditory feedback: A tradeoff with somatosensory feedback? Language and Speech, 55(2), 295–308.

Kuznetsova, A., Brockhoff, P.B. & Bojesen-Christensen, R.H. (2016). lmerTest: Tests in Linear Mixed Effects Models. R package version 2.0-30.

Lametti, D.R., Nasir, S.M., & Ostry, D.J. (2012). Sensory preference in speech production revealed by simultaneous alteration of auditory and somatosensory feedback. Journal of Neuroscience, 32(27), 9351–9358.

Lobanov, B.M. (1971). Classification of Russian vowels spoken by different speakers. The Journal of the Acoustical Society of America, 49(2B), 606–608.

MacDonald, E.N., Goldberg, R., & Munhall, K.G. (2010). Compensations in response to real-time formant perturbations of different magnitudes. The Journal of the Acoustical Society of America, 127(2), 1059–1068.

Mitsuya, T., Munhall, K.G., & Purcell, D.W. (2017). Modulation of auditory-motor learning in response to formant perturbation as a function of delayed auditory feedback. The Journal of the Acoustical Society of America, 141(4), 2758–2767.

Munhall, K.G., MacDonald, E.N., Byrne, S.K., & Johnsrude, I. (2009). Talkers alter vowel production in response to real-time formant perturbation even when instructed not to compensate. The Journal of the Acoustical Society of America, 125(1), 384–390.

Oglesbee, E., & Kewley-Port, D. (2009). Estimating vowel formant discrimination thresholds using a single-interval classification task. The Journal of the Acoustical Society of America, 125(4), 2323–2335.

Purcell, D.W., & Munhall, K.G. (2006). Adaptive control of vowel formant frequency: Evidence from real-time formant manipulation. The Journal of the Acoustical Society of America, 120(2), 966–977.

←106 | 107→

Perkell, J.S., Guenther, F.H., Lane, H., Matthies, M.L., Stockmann, E., Tiede, M., & Zandipour, M. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their discrimination of the contrasts. The Journal of the Acoustical Society of America, 116(4), 2338–2344.

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved January 5, 2017 from

van Rij, J., Wieling, M., Baayen, R.H., & van Rijn, H. (2017). itsadug: Interpreting time series and autocorrelated data using GAMMs. R package version, 2.3.

Rochet-Capellan, A., & Ostry, D.J. (2011). Simultaneous acquisition of multiple auditory–motor transformations in speech. Journal of Neuroscience, 31(7), 2657–2662.

Sato, M., Schwartz, J.L., & Perrier, P. (2014). Phonemic auditory and somatosensory goals in speech production. Language, Cognition and Neuroscience, 29(1), 41–43.

Savariaux, C., Perrier, P., & Orliaguet, J.P. (1995). Compensation strategies for the perturbation of the rounded vowel [u]; using a lip tube: A study of the control space in speech production. The Journal of the Acoustical Society of America, 98(5), 2428–2442.

Sóskuthy, M. (2017). Generalised additive mixed models for dynamic analysis in linguistics: a practical introduction. arXiv preprint arXiv:1703.05339.

Trudeau-Fisette, P., Tiede, M., & Ménard, L. (2017). Compensations to auditory feedback perturbations in congenitally blind and sighted speakers: Acoustic and articulatory data. Plos One, 12(7), e0180300.

Villacorta, V.M., Perkell, J.S., & Guenther, F.H. (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. The Journal of the Acoustical Society of America, 122(4), 2306–2319.

Wood, S.N. (2006). Low-rank scale-invariant tensor product smooths for Generalized Additive Mixed Models. Biometrics, 62(4), 1025–1036.

Wood, S.N. (2017a). Generalized additive models: An introduction with R. Chapman & Hall/CRC Texts in Statistical Science.

Wood, S N. (2017b). MGCV: Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation. R package version, 1.8–19.

←107 | 108→←108 | 109→