Show Less
Open access

Acoustics of the Vowel

Preliminaries

Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

Materials Part III

Materials Part III

The third part of the Materials section presents exemplary series of vowel sounds and related acoustic analyses linked to the third part of the main text, including further indications on previously published data. ← 127 | 128 →

Note on the Method

Empirical basis

As mentioned in the introduction, the empirical basis of this treatise—and the basis of the series of vowel sounds selected for presentation here—consists of recordings from various areas of everyday life, the entertainment sector and art, that is, stage voices in music and straight theatre. (For an additional investigation of sounds of birds imitating human utterances, see Section M10.A.)

The recordings were collected over a time period of more than 20 years with different techniques related to different sound qualities, and they represent utterances of speakers different in age and gender, producing vowel sounds in different contexts, with different durations and different vocal efforts. However, such variation is not a shortcoming but an intention here, since this treatise focuses on the psychophysical question of the vowel (see the introduction and Section 13.7): given that different vowel sounds are perceived as being related to a single vowel quality—in contrast to the variation of other vocal sound characteristics—, which describable physical characteristic or which ensemble of physical characteristics may be said to represent that quality?

Concerning the acoustic characteristics of vowel sounds, the sound examples presented here were produced in isolation or in word context by native German or Swiss-German speakers, with a few exceptions, and the vowel qualities correspond to Standard German. Because of the psychophysical perspective adopted here, and because of the large fundamental frequency range considered—including many high-pitched vowel sounds produced in isolation or in the context of high-pitched speech by untrained children, women and men as well as by professional actresses and actors—, no principal difference is made between speaking and singing for isolated vowel sounds or extracted vowel nuclei and no corresponding indication is given in the figures which would relate to a classificatory system of modes of vowel production.—Acoustic analysis as well as perceptual identification relates to sounds produced in isolation or extracted as vowel nuclei from words.

Concerning the acoustic characteristics of pitch contours, the examples presented here (see Section 8.2) only concern contours of speech. Thereby, they relate to utterances of speakers of different languages (see the corresponding figure legends). ← 128 | 129 →

Whereas one part of these recordings forms the basis of single, published investigations undertaken in the past, which included listening tests, another part is unpublished and the corresponding recordings have not been subject to any further identification tests, apart from the identification by the author: in the course of creating this publication, for each of the sound series of a single figure presented in the Materials section, the author has evaluated the perceptual vowel quality of each sound separately. Moreover, only sounds are presented for which the intended and the perceived vowel quality correspond.

Acoustic analysis

With regard to the acoustic analysis of the sounds in general and to the calculation of fundamental and formant frequencies in particular, automatically calculated values using routines from the PRAAT Software (Boersma & Weenink, 2015) related to corresponding standard parameters are given in the figures of Chapters 7 to 10.

Acoustic analysis was conducted on isolated vowel sounds or on extracted vowel nuclei and concerned F0, spectrum, formant frequencies and LPC curve. (Note that the digital version of the Materials further includes pitch contour, spectrogram, formant tracks and comparison of three formant patterns and three LPC curves related to the three standard parameter settings for children, women and men.)

For longer vowel sounds, a middle sound fragment of 0.3 s, and for shorter sounds, a middle vowel nucleus excluding onset and offset was analysed.

The fundamental frequency of a sound fragment was calculated as average value using the Praat command To Pitch. Calculated values were perceptually crosschecked. If calculation errors occurred, the parameters “pitch floor” and “pitch ceiling” were adjusted.

The spectrum of a sound fragment was calculated as average spectrum for 0–5.5 kHz.

The formant frequencies of a sound fragment were automatically calculated as average values of LPC analysis using the Praat command To Formant (robust), with standard parameters according to the age and/or gender of the speaker and for a frequency range of 0–5.5 kHz. For ill­ustration purposes, an LPC curve was calculated related to the analysis window in the middle of the sound fragment analysed. ← 129 | 130 →

Please note:

     Spectrum and numerical formant frequencies are calculated as averaged for the entire sound fragment analysed, but the LPC curve is related to a single window in the middle of the fragment. As a consequence, for a few sounds, the LPC filter curve does not correspond to the vowel spectrum and the numerical formant pattern.

     Because of automatic calculation and averaged values, calculated F1 for sounds of / i, y, u / at middle and high fundamental frequencies is sometimes given as slightly below F0. In these cases, F1 can be estimated as roughly matching F0.

     A few of the calculated frequencies of the formants considered deviate so strongly from the sound spectrum and its amplitude minima and maxima that they are set in parenthesis or have been replaced by a rough estimation related to the spectrum. Exceptions are the sounds produced by birds for which the automatically calculated formant frequencies are given without consideration of their validity.

For longer recordings of speech (see Section M8.2), only the pitch contour was analysed and perceptually crosschecked. If major calculation errors occurred, the parameters “pitch floor” and “pitch ceiling” were again adjusted.

Illustrations

Each figure includes a series of vowel sounds (represented as vowel spectra) or examples of speech (represented as pitch contours). The subject matter of illustration is explained in the text and indicated in short form in the figure legend.

A vowel spectrum is given as the sound pressure level (SPL) in dB/Hz (y-coordinate) for a frequency range of 0–5500 Hz (x-coordinate). If, in the text, a vowel spectrum is considered in relation to calculated formants and/or to an LPC curve, this curve is also shown; if not, only the spectrum is presented. Below a spectrum, the following indications are given in the first line: figure number and number of the spectrum in the figure, vowel quality, fundamental frequency (F0), identification number of the speaker, gender of the speaker (w=woman/female, m=man/male), age group of the speaker (C=children, A=adults; note B=birds) and record number (R) of the recording in the database. For some figures, depending on the context of consideration, selected formant frequencies are indicated in addition in the second line. ← 130 | 131 →

Since the single vowel spectra relate to single vowel sounds, the vowel quality is given in square brackets. Note that in the figures, the vowel quality of /a–α / is represented by the character “a” with no further differenciation.

Pitch contours of speech are given as the pitch frequency in Hz (y-coordinate) over a time range in s (x-coordinate). Below a pitch contour, the following indications are given in the first line: figure number and number of the contour in the figure, [speech] as the mode of vocal expression and the content of recording, identification number of the speaker, gender and age group of the speaker and record number (R) of the recording in the database. In the second line, the overall F0 range for all contours of a speaker presented in a figure is given.

Note that the order of sound presentation in relation to vowel qualities and to F0 is not uniform throughout the entire Materials section; for each single section, this order accords to the subject matter illustrated and to the choice of the author.

Digital version of the Materials

More details on the method and, as mentioned, an extended documentation of the results of acoustic analysis are provided in the digital version of the Materials at:

http://www.phones-and-phonemes.org/vowels/acoustics/preliminaries ← 131 | 132 →

M7     Unsystematic Correspondence between Vowels, Patterns of Relative Spectral Energy Maxima and Formant Patterns

M7.1      Inconstant Number of Vowel-Specific Relative Spectral Energy Maxima and Incongruence of Vowel-Specific Formant Patterns

Figures 1 to 3 show examples of sounds of the back vowels /u, o / and of /a–α / exhibiting only one relative spectral energy maximum within their vowel-specific frequency range ≤ c. 1.5 kHz. Each series corresponds to sounds produced by speakers of one speaker group (children, women, men). Note that for the sounds of /a–α /, a dominant first harmonic is ignored here when interpreting relative spectral energy maxima. Note also that the examples 1, 3 and 4 in Figure 1 perceptually represent /ɔ / rather than /a–α /.

For each of the speaker groups and each of the three vowels in question, Figures 4 to 6 show three examples exhibiting two relative spectral energy maxima within their vowel-specific frequency range ≤ c. 1.5 kHz, as is usually assumed to be the “normal” case for sounds of these vowels.

Note that the spectra of the sounds of /u, o / shown in Figures 1 to 3 cannot be interpreted as a general manifestation of “formant merging”: if these spectra are compared with the spectra of the corresponding vowel sounds shown in Figures 4 to 6, the lowest spectral envelope peaks occur at similar frequency levels, given similar F0. Thus, the first spectral envelope peak of all sounds corresponds to the vowel quality in question, whereas the second spectral envelope peak for the sounds shown in Figures 4 to 6 may be related to an additional sound “colouring” that, however, does not possess vowel-differentiating value. Figure 7 illustrates this phenomenon by direct comparison of selected sounds of /u, o / in Figures 1 to 3 with selected sounds of /u, o / in Figures 4 to 6.

Figures 8 and 9 show examples of sound pairs of the vowels / i / and /e /, each pair produced by speakers of one speaker group, for which differences in F0 and F1 are small but differences in the higher vowel-related spectral parts are substantial, up to F2 of the second sound matching or exceeding F3 of the first. Figure 10 shows more sound pairs of this kind but, in this case, comparing sounds of children and men, in order to document the phenomenon in its very extreme.

For earlier accounts, see Maurer, Landis, and d’Heureuse (1991), Maurer and Landis (1995). ← 132 | 133 →

M7.2      Partial Lack of Manifestation of Vowel-Specific Relative Spectral Energy Maxima

Figures 11 and 12 show examples of sounds of the vowels /a–α / and of /o / with “flat” or “sloping” spectral portions in their vowel-specific frequency range < c. 1.5 kHz which are lacking a clearly determinable peak. Note that the perceived vowel quality of some sounds intentionally produced as /a–α / lies in between /α / and /ɔ /, and of some sounds intentionally produced as /o / in between /o / and /ɔ /. Note also that for the sounds of /a–α /, a dominant first harmonic is again ignored here when interpreting relative spectral energy maxima. (For cases of “sloping” lower spectral portions in sounds of /u /, see Section M7.1, Figures 1 to 3.)

Figures 13 and 14 show corresponding observations for sounds of front the vowels / i, e / with “flat” higher spectral portions in their upper vowel-specific frequency range of 1.5–5 kHz which are lacking a clearly determinable pattern of vowel-related peaks. ← 147 | 148 →

M8     Lack of Correspondence between Vowels and Patterns of Relative Spectral Energy Maxima or Formant Patterns

M8.1      Dependence of Vowel-Specific, Relative Spectral Energy Maxima and Lower Formants ≤ 1.5 kHz on Fundamental Frequency

Figure 1 shows examples of sounds of the vowels /o, ø, e / produced at different F0 by a woman (/o /), a man (/ø /) and a child (/e /; age 8). In the frequency range of F0 of c. 200–400 Hz, the second partial is generally dominant thus indicating a shift of the lowest spectral peak with rising F0, which is also indicated by the corresponding calculated F1. In more detail: For the sound series of the vowel /o /, the shift in F0 is 170–400 Hz, the frequency shift of the dominant second harmonic is 340–800 Hz and the shift of calculated F1 is c. 380–800 Hz. (Note that for the sound at F0 = 400 Hz, the first calculated formant value at 560 Hz is ignored here because it is associated with a bandwidth of 928 Hz and, as a consequence, the LPC filter curve does not show a corresponding peak.)—For the sound series of the vowel /ø /, the shift in F0 is c. 110–360 Hz, the frequency shift of the dominant harmonic (third harmonic up to F0 = 167 Hz, then second harmonic) is c. 330–720 Hz and the shift of calculated F1 is c. 350–710 Hz.—For the sound series of the vowel /e /, the shift in F0 is c. 210–360 Hz, the frequency shift of the dominant second harmonic is c. 420–720 Hz (dominance is weak but constant) and the shift of calculated F1 is c. 420–720 Hz.

Figure 2 shows examples of sounds of the vowels /u, y, i / produced at different F0 by a woman (/u /), a child (/y/; age 13, transition to adolescence) and a woman (/ i /). For all sounds, the first partial is generally dominant thus indicating a shift of the lowest spectral peak with rising F0, which is also indicated by the corresponding calculated F1. (Note that for higher levels of F0, the calculation of F1 is methodically unsubstantiated; however, the calculated values correspond to the dominant first harmonics.) In more detail: For the sound series of the vowel /u /, the shift in F0 is c. 220–870 Hz, as is true for the frequency shift of the first dominant harmonic and the shift of calculated F1 is c. 230–870 Hz.—For the sound series of the vowel /y/, the shift in F0 is c. 210–710 Hz, as is true for the frequency shift of the first dominant harmonic, and the shift of calculated F1 is c. 380–740 Hz. (Note the problem of automatic calculation of F1 for the example in Figure 2-14.)—For the sound series of the vowel / i /, the shift in F0 is c. 210–830 Hz, as is true ← 158 | 159 → for the frequency shift of the first dominant harmonic and the shift of calculated F1 is c. 240–900 Hz.

Note the very pronounced spectral differences for the three sounds of / i, y, u / in the frequency range of F0 of 700–800 Hz which reinforces the thesis of a parallelism between differences in perceived vowel quality and related acoustic differences, that is, the thesis of vowel-specific harmonic spectra of high-pitched sounds.

However, as mentioned in Section 8.1, indications for an F0-dependence of the lower spectral peaks and lower formants ≤ 1.5 kHz are not systematic: above all, the indications in question relate to frequency ranges of F0, to vowel qualities and to single speakers and their phonation characteristics, including vocal effort.

Concerning the F0 ranges, the indications for the F0-dependence in question are generally weak or absent for F0 < c. 200 Hz for the sounds of all vowels (see, for example, Figure 1 in this chapter, the corresponding sounds of /ø /).

Concerning vowel quality, the indications of the F0-dependence in question are particularly evident in the sounds of / i, y, e, ø, o, u / but often unsystematic, weak or even absent for the sounds of /ε / and of /a–α /. In terms of an illustration, Figure 3 shows examples of sounds of /a–α / produced by a child (age 13, transition to adolescence) on different F0. The harmonic spectrum strongly varies and peak and formant estimation is difficult to conduct. However, no clear indication of a relation between F0 and the lower spectral envelope is evident.

Concerning single speakers and their phonation characteristics, including vocal effort, Figure 4 shows examples of sounds of /o / produced at different F0 by a woman; in contrast to the corresponding sound series in Figure 1, only a very weak indication of a relation between F0 and the lower spectrum is evident.

But, as mentioned in Section 8.1, although the indications for the dependence discussed here prove to be unsystematic, the findings of intelligible vowel sounds at fundamental frequencies > 500 Hz (see next chapter) and of formant pattern ambiguity (see Chapter M9) force us to relate the lower spectral peaks and the lower formants to fundamental frequency.

In addition, such a dependence can also be observed for the second formant for cases of sounds of back vowels (see, for example, Section 10.1, Figure 1). ← 159 | 160 →

In the context of such F1 shifts with rising F0, “inverted” frequency levels of the lowest spectral peak and of calculated F1 can be observed for two sounds of two different vowels: where statistical values give lower formant frequencies for F1 for one vowel quality than for the other, higher values can be found for sounds of the former than for sounds of the latter if F0 variations are included into the investigation. Figures 5 shows examples of such cases in terms of sound pairs of /o, u / and /e, i /. (The sound pairs produced by children, women and men are presented separately.) The lowest spectral peaks < 1.5 kHz for the sounds of /u / are above those of the sounds of /o /, as is the case for the sounds of / i / compared with the sounds of /e /. Moreover, no clear indication of a second peak < 1.5 kHz and a corresponding marked F2 is manifest for the sounds of /o, u /, and the calculated F2 for the sound pairs of /e, i / are also “inverted”, i.e. F2 for the sounds of / i / is found below F2 for the sounds of /e /.

This observation foreshadows formant pattern ambiguity of vowel sounds, as documented in detail in Chapter M9.

For earlier accounts, see Maurer, Landis, and d’Heureuse (1991), Maurer and Landis (1995, 1996, 2000); see also Traunmüller (n.d.) for synthesised examples. ← 160 | 161 →

M8.2      Vowel Perception at Fundamental Frequencies above Statistical Values of the Respective First Formant Frequency

Figure 6 shows intelligible high-pitched sounds of the vowels /y, e, ø, ε, o / at F0 of c. 750 Hz, and Figure 7 exhibits intelligible high-pitched sounds of the corner vowels / i, a, u / at F0 of c. 850 Hz. Note again the pronounced spectral differences for these high-pitched sounds of different vowels supporting the thesis of a parallelism between differences in perceived vowel quality and related acoustic differences, that is, the thesis of vowel-specific harmonic spectra.

Figures 8 to 10 show examples of speech extracts of untrained speakers, journalists, TV hosts and actresses and actors, which manifest pitch contours for utterances of single speakers exceeding age- and gender-related statistical F1 of the vowels / i, y, u / (450 Hz for children, 400 Hz for women and 350 Hz for men). The ranges of F0 indicated—overall ranges for the speech sounds of a single speaker or a group of speakers (see below)—were determined acoustically in terms of approximations by listening to the sounds. (Please ignore some errors in the graphics exceeding the verified ranges given below. These errors are due, for example, to background noise or music, or the sound of an audience or to automatic pitch calculation.) The order of presentation within a figure accords, firstly, to the number of examples per speaker or a group of speakers, and secondly, to the identification number of the speaker.

Figure 8 shows pitch contours of speech extracts produced by untrained speakers, journalists, TV hosts and actresses talking on TV (not acting), to experience in every day life:

     The examples for speaker 172 (see pitch contours 8-1 to 8-3) relates to extracts of a woman selling grilled chicken in a market in Paris. Overall range of F0 = c. 220–700 Hz (excluding very high-pitched exclamations).

     The examples for the two speakers subsumed under the ID number 379 and for the speaker 380 (see pitch contours 8-4 to 8-6) relate to extracts of two American women and one American man demonstrating infant child directed speech. Overall range of F0 = c. 200–800 Hz for the women (except one higher peak at c. 1 kHz) and c. 150–600 Hz for the man.

     The examples for speaker 336 (see pitch contours 8-7 and 8-8, the latter from 0.7 to 2.5 sec.) relate to extracts of a female Indonesian singer talking in a TV show and to an exclamation of her name during the show. Overall range of F0 = c. 350–950 Hz. ← 170 | 171 →

     The two examples for the speakers subsumed under the ID number 348 (see pitch contours 8-9 and 8-10) relate to extracts of two female TV hosts announcing the results of a singing contest (announcements in English). Overall range of F0 = c. 200–700 Hz.

     The example for speaker 135 (see pitch contour 8-11) relates to two sentences of a boy (age 6). Range of F0 = c. 220–600 Hz.

     The example for speaker 174 (see pitch contour 8-12) relates to an extract of a female North American journalist speaking on television. Range of F0 = c. 175–600 Hz.

     The example for speaker 217 (see pitch contour 8-13) relates to an extract of a North American woman talking about her child on television. Range of F0 = c. 160–550 Hz.

     The example for speaker 220 (see pitch contour 8-14) relates to an extract of a female French doctor talking on television. Range of F0 = c. 250–520 Hz.

     The example for speaker 238 (see pitch contour 8-15) relates to an extract of a male French TV host. Range of F0 = c. 130–420 Hz (exceeding only gender-related statistical F1 of the vowels / i, y, u /).

     The example for speaker 383 (see pitch contour 8-16) relates to an extract of a French woman talking on television in a TV spot. Range of F0 = c. 220–830 Hz.

     The example for two speakers subsumed under the ID number 379 (see pitch contour 8-17) relates to an extract of a female French journalist (first part) questioning a French woman on the street, and the answer of the latter (second part). Overall range of F0 for the utterances of both women = c. 230–600 Hz.

Figure 9 shows pitch contours of speech extracts of performing actresses (film, comic, voice-over, dubbing):

     The example for speaker 216 (see pitch contours 9-1 and 9-6) relates to extracts of a female Swiss narrator of fairy tales. Overall range of F0 = c. 150–900 Hz.

     The examples for speaker 177 (see pitch contours 9-7 to 9-9) relate to extracts of a French comic actress performing on stage. Overall range of F0 = c. 180–780 Hz.

     The examples for speaker 178 (see pitch contours 9-10 to 9-12) relate to extracts of another French comic actress performing on stage. Overall range of F0 = c. 200–850 Hz.

     The examples for speaker 212 (see pitch contours 9-13 to 9-15) relate to extracts of the speech of a French actress in a cartoon. Overall range of F0 = c. 300–700 Hz. ← 171 | 172 →

     The examples for speakers 251 (see pitch contours 9-16 to 9-18) relate to extracts of two British actresses performing as the voices of the two main characters in a computer-animated fantasy film. Overall range of F0 = c. 150–800 Hz.

     The examples for speaker 276 (see pitch contours 9-19 to 9-21) relate to extracts of a French comedy actress performing on stage. Overall range of F0 = c. 400–780 Hz.

     The example for speaker 175 (see pitch contour 9-22) relates to an extract of a North American actress performing as a female character in a film. Range of F0 = c. 270–700 Hz (excluding one high-pitched exclamation at F0 of c. 880 Hz).

     The example for speaker 223 (see pitch contour 9-23) relates to an extract of a German actress dubbing a female character in a film. Range of F0 = c. 220–780 Hz (excluding one high-pitched exclamation at the end).

     The example for speaker 234 (see pitch contour 9-24) relates to an extract of a French comic actress performing on stage. Range of F0 = c. 200–850 Hz.

     The example for speaker 258 (see pitch contour 9-25) relates to an extract of a French actress performing as the voice of a female character in an animation film. Range of F0 = c. 220–780 Hz.

     The example for speaker 275 (see pitch contour 9-26) relates to an extract of a German comic actress performing on stage. Range of F0 = c. 180–850 Hz.

     The example for speaker 291 (see pitch contour 9-27) relates to an extract of a British actress performing in a fantasy film. Range of F0 = c. 100–700 Hz.

     The example for speaker 296 (see pitch contour 9-28) relates to an extract of a German comic actress. Range of F0 = c. 150–600 Hz.

     The example for speaker 350 (see pitch contour 9-29) relates to an extract of a North American actress performing as a female character in a film. Range of F0 = c. 160–900 Hz (excluding some very high-pitched exclamations).

     The example for speaker 398 (see pitch contour 9-30) relates to an extract of a North American actress performing as a female character in a TV series. Range of F0 = c. 300–980 Hz. ← 172 | 173 →

Figure 10 shows pitch contours of speech extracts of performing actors (film, comic, voice-over, dubbing):

     The examples for speaker 225 (see pitch contours 10-1 to 10-4) relate to speech extracts of a Swiss comic actor performing as a female character. Overall range of F0 = c. 220–780 Hz.

     The examples for speaker 163 (see pitch contours 10-5 to 10-7) relate to extracts of an Indonesian comic actor performing on stage in a Drama Gong. Overall range of F0 = c. 300–600 Hz.

     The examples for speaker 169 (see pitch contours 10-8 and 10-10) relate to extracts of a German actor dubbing a male character in a film. Overall range of F0 = c. 100–700 Hz.

     The examples for speaker 214 (see pitch contours 10-11 to 10-13) relate to extracts of a Japanese Kabuki actor. Overall range of F0 = c. 250–700 Hz.

     The examples for speaker 297 (see pitch contours 10-14 to 10-16) relate to extracts of speech of another Swiss comic actor performing in a TV show. Overall range of F0 = c. 130–620 Hz.

     The examples for speaker 194 (see pitch contours 10-17 and 10-18) relate to extracts of a French comic actor performing on stage. Overall range of F0 = c. 130–700 Hz.

     The example for speaker 394 (see pitch contours 10-19 and 10-20) relates to extracts of two French actors performing as the voices of male characters in an animation film. Overall range of F0 = c. 310–650 Hz.

     The example for speaker 171 (see pitch contour 10-21) relates to extracts of speech of a German actor dubbing the voice of a male character. Range of F0 = c. 180–550 Hz.

     The example for speaker 274 (see pitch contour 10-22) relates to extracts of speech of a Swiss actor performing as ventriloquist. Range of F0 = c. 120–600 Hz.

     The example for speaker 294 (see pitch contour 10-23) relates to an extract of speech of a North American actor performing as the voice of a female character in a comedy-variety film. Range of F0 = c. 200–800 Hz.

     The example for speaker 351 (see pitch contour 10-24) relates to an extract of speech of a German comic actor performing in a TV show. Range of F0 = c. 150–580 Hz (excluding one high- pitched exclamation at F0 of c. 780 Hz).

For earlier accounts, see Maurer and Landis (1996, 2000), Maurer, Mok, Friedrichs, and Dellwo (2014), Friedrichs, Maurer, and Dellwo (2015), Friedrichs, Maurer, Suter, and Dellwo (2015). ← 173 | 174 →

M8.3      “Inversions” of Relative Spectral Energy Maxima and Minima and “Inverse” Formant Patterns in Sounds of Individual Vowels

For each of the vowels /a–α, o, u / and for each speaker group, Figures 11 to 13 show pairs of sounds produced at different fundamental frequencies exhibiting “inverse” relative spectral maxima and minima in terms of “inverse” spectral envelope curves ≤ 1.5: whereas a relative minimum in the spectral envelope occurs for one sound of a pair, a peak for the other sound is manifest, and vice versa; however, the perceived vowel quality is maintained. The same holds true for comparisons of the respective calculated filter curves and, for most cases, for comparisons of patterns of manifest formants. ← 183 | 184 →

M9     Ambiguous Correspondence between Vowels and Patterns of Relative Spectral Energy Maxima or Formant Patterns or Complete Spectral Envelopes

M9.1      Ambiguous Patterns of Relative Spectral Energy Maxima and Ambiguous Formant Patterns

Figures 1 to 21 show series of sounds of different vowels produced at different F0 but exhibiting similar patterns of relative spectral energy maxima and/or similar patterns of calculated formant frequencies within their supposed vowel-specific frequency range related to statistical F1 and F2. In all cases, the actual differences of the patterns for the sounds of different vowels presented in a single series are far smaller than the observable differences (variations) of corresponding patterns for sounds of a single vowel.—In some series that include sounds at high fundamental frequencies, the overall spectral envelopes and the harmonic spectra are considered for the comparison in question.

For each series, roughly estimated average frequencies of the two lower relative spectral energy maxima and/or of the calculated frequencies F1–F2 are given below in terms of model patterns for the sounds compared. Exceptions concern a few comparisons of sounds of back vowels, for which only a single spectral peak is manifest in the sound spectra (for these comparisons, the corresponding peak frequency is given), and an additional exception concerns a comparison of sounds /a–α, u /, for which only the spectrum as such > 1.5 kHz is considered.

The first sound series shown include sounds of the vowels /a–α, o, u /, divided into two groups, one presenting sounds of different speakers, the other presenting sounds of single speakers. The second series shown include sounds of front vowels, again divided into the two groups mentioned. (Figures 9 and 11 include exceptions that illustrate the ambiguity discussed for sounds of different and of single speakers.) Within a series, the sounds are organised according to fundamental frequency.

Comparisons of sounds of back vowels and of /a–α / produced by different speakers: ← 187 | 188 →

Figure 1    Sounds of /a–α, o, u /; model pattern of spectral peaks and/or of calculated formant frequencies = 600–1200 Hz

Figure 2    Sounds of /a–α, o, u /; model pattern of spectral peaks and/or of calculated formant frequencies = 600–1050 Hz

Figure 3    Sounds of /a–α, o /; model pattern of spectral peaks and/or of calculated formant frequencies = 660–1320 Hz

Sounds of /u / are included in the first three series because the first harmonic corresponds to F1 of the model pattern in question; however, no clear spectral indication can be found for F2 even if LPC analysis gives a (weak) second formant at a frequency level which corresponds to the model pattern of a series.

Comparisons of sounds of back vowels and of /a–α / produced by single speakers:

Figure 4    Three comparisons of sounds of /a–α, o, u / produced by a man and two women; model pattern of spectral peaks and/or of calculated formant frequencies = 600–1200 Hz

Figure 5    Two comparisons of sounds of /a–α, o / produced by a man (sounds sung by a tenor); model pattern of spectral peaks and/or of calculated formant frequencies = 600–1200 Hz for the first comparison, similar spectral peaks and spectral envelopes for the second comparison

Figure 6    Sounds of /a–α / and of /u / produced by a woman which exhibit comparable spectral envelopes < 1.5 kHz

Figure 7    Sounds of /ɔ, o, u / produced by a woman; model pattern of spectral peaks and/or of calculated formant frequencies = one clear peak at c. 550 Hz (exceptionally, sounds of the vowel /ɔ / are included in order to show a possible shift in perceived vowel quality from /ɔ / to /o / related to two levels of F0 of c. 175 Hz and c. 260 Hz)

Figure 8    Two comparisons of sounds of /o, u / produced by two children (age 12 and 6); model patterns of spectral peaks and/or of calculated formant frequencies = one clear peak at c. 400 Hz (first sound pair) and at c. 520 Hz (second sound pair), respectively.

Comparisons of sounds of front vowels produced by different speakers:

In contrast to many other comparisons presented in this chapter, the ambiguity illustrated in Figures 9 to 11 does not always relate to substantial differences in F0 but also to the configuration of the levels of the harmonics, to the spectrum above F2 and to the levels of calculated formants including F3. This is the case particularly for direct com ← 188 | 189 → parisons of sounds of /e / and of /ø /, and of / i / and of /y/, respectively. Moreover, a sound produced with creak phonation is exceptionally included into the comparison (see the first vowel spectrum of Figure 9).

Figure 9    Sounds of /ø, e, y, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 330–2000 Hz; note the ambiguity for sounds of /ø, i / for the single speaker 391 and the ambiguity for the sounds of /ø, y/ for the single speaker 376

Figure 10    Sounds of /ø, e, y, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 350–2150 Hz

Figure 11    Sounds of /ø, e, y, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 420–2150 Hz; note the ambiguity for sounds of /ø, y/ for the single speaker 402

Figure 12    Sounds of /ε, e, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 500–2250 Hz

Figure 13    Sounds of /ε, e, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 600–2450 Hz

Figure 14    Sounds of /e, i /; model pattern of spectral peaks and/or of calculated formant frequencies = 400–2600 Hz

Figure 15    Sounds of /ε, e, y/; model pattern of spectral peaks and/or of calculated formant frequencies = 500–2000 Hz

Figure 16    Sounds of /ε, ø, y/; model pattern of spectral peaks and/or of calculated formant frequencies = 430–2000 Hz

Figure 17    Sounds of /ε, ø, y/; model pattern of spectral peaks and/or of calculated formant frequencies = 475–1900 Hz

Figure 18    Sounds of /ε, y/; model pattern of spectral peaks and/or of calculated formant frequencies = 650–1950 Hz

Comparisons of sounds of front vowels, produced by single speakers:

Figure 19    Two comparisons of sounds of /ε, e, i / produced by two women; model patterns of spectral peaks and/or of calculated formant frequencies = 510–2550 Hz and 600–2400 Hz, respectively

Figure 20    Three comparisons of sounds of /e, i / produced by three children (age 7 to 9); model patterns of spectral peaks and/or of calculated formant frequencies = 450–3000 Hz and 400–3000 Hz, respectively

Figure 21    Three comparisons of sounds of /ø, y/ produced by a man, a woman and a child (age 12); model patterns of spectral peaks and/or of calculated formant frequencies = 320–1600 Hz, 320–2000 Hz and 400–2000 Hz, respectively

For earlier accounts, see Maurer and Landis (2000). ← 189 | 190 →

M9.2      Ambiguous Spectral Envelopes

For the frequency range relevant for the perceived vowel qualities in question, many of the sound series presented in the previous chapter do not only show similar patterns of vowel-related spectral peaks and similar patterns of calculated F1–F2 but also similar vowel-related spectral envelope shapes for sounds of different vowels, including similar patterns of calculated F1–F2–F3 for sounds of front vowels (for all calculated formant frequencies refer to the online digital version of the Materials).

M9.3      Ambiguity and Individual Vowels

The series in Section M9.1 present ambiguities as discussed here for all combinations of the long German back vowels and /a–α / and for all combinations of the long German front vowels. Thus, the ambiguities are not a phenomenon of overlapping F1–F2 spaces of neighbouring vowel qualities but, in most cases, a consequence of the dependence of vowel-specific, relative spectral energy maxima and lower formants ≤ 1.5 kHz on fundamental frequency, interrelated with an observable variation of higher vowel-related spectral parts for sounds of front vowels.

However, two restrictions apply.

Concerning the sounds of back vowels and of /a–α / investigated, the demonstration of a possible ambiguity of the lower spectral envelope and of F1–F2 is unquestionable for comparisons of sounds of /u / and of /o /, and of /o / and of /a–α /. For the comparison of sounds of /u / and of /a–α /, however, the demonstration of a possible ambiguity is limited to similar calculated F1–F2, but because of high F0 of the sounds of /u /, this calculation is methodically unsubstantiated. Further direct comparison of the spectral envelope and the configuration of the levels of the harmonics generally provides no clear indication. Notwithstanding, it is important to consider the fact that sounds of /u / can be produced at a level of F0 that can corresponds to F1 of sounds of /a / and that, in such cases, exhibit a dominant first harmonic.

Concerning the sounds of front vowels investigated, the demonstration of a possible ambiguity, which is related to differences in F0 of the sounds compared, does not concern the direct comparisons of sounds of /e, ø /, and of / i, y/. As mentioned, in such cases, the ambiguity relates to the configuration of the levels of the harmonics, to the spectrum above F2 and to the levels of calculated formants including F3. This phenomenon is again illustrated in the following three figures. ← 212 | 213 →

Figure 22    Three sound pairs of /y, i /, each pair produced by single female speakers; model patterns of spectral peaks and/or of calculated formant frequencies = 290–2150 Hz, 315–2100 Hz and 350–2100 Hz, respectively

Figure 23    Sounds of /y, i / produced by different male speakers; model pattern of spectral peaks and/or of calculated formant frequencies = 230–2050 Hz

Figure 24    A sound pair of /ø, e / produced by a single male speaker; model pattern of spectral peaks and/or of calculated formant frequencies = 350–1700 Hz ← 213 | 214 →

M10  Lack of Correspondence between Patterns of Relative Spectral Energy Maxima or Formant Patterns and Age- and Gender-Related Speaker Groups or Vocal-Tract Sizes

M10.1    Similar Patterns of Relative Spectral Maxima and Similar Formant Patterns ≤ 1.5 kHz for Different Age- and Gender-Related Speaker Groups or Vocal-Tract Sizes

Figure 1 shows sounds of the vowel /o / produced by a child (age 8), a woman and a man. Each speaker produced sounds at different F0 in a way that allowed for a comparison of the sounds of the three speakers (representing the three main speaker groups according to age and gender) at different and similar F0. The comparison shows that age- and gender-related differences ≤ 1.5 kHz as given in formant statistics for citation-form words can decrease or even disappear if F0 of the vocalisations correspond for children, women and men. In this regard, comparisons of vocalisations of /o / are of special interest (and shown first) because an F0-dependence of the lower spectral frequency range can be observed for F0 clearly below statistical F1, and because the frequency range ≤ 1.5 kHz covers the entire range related to the vowel identity in question.—Data for speakers, ranges of F0 and calculated F1 and F2:

Spectra 1-1 to 1-6    Child; F0 = 196–322 Hz, F1 = 424–624 Hz, F2 = 777–1092 Hz

Spectra 1-7 to 1-13    Woman; F0 = 162–320 Hz, F1 = 363–576 Hz, F2 = 804–1141 Hz

Spectra 1-14 to 21    Man; F0 = 129–326 Hz, F1 = 343–577 Hz, F2 = 672–1143 Hz ← 217 | 218 →

Figure 2 demonstrates this phenomenon for sounds of the vowel /e / produced by a child (age 10), a woman and a man, concerning the lowest spectral peak and F1.—Data for speakers, ranges of F0 and calculated F1:

Spectra 2-1 to 2-6    Child; F0 = 180–330 Hz, F1 = 395–563 Hz

Spectra 2-7 to 2-13    Woman; F0 = 160–325 Hz, F1 = 389–622 Hz

Spectra 2-14 to 2-21    Man; F0 = 122–336 Hz, F1 = 370–566 Hz (excluding the last sound for which automatic calculation of F1 does not provide a reliable result)

Similar indications as shown for sounds of /e / can be found for sounds of /ø /.

Figure 3 demonstrates this phenomenon for sounds of the vowel /u / produced by a child (age 8), a woman and a man. However, only the first lower peak and calculated F1 are discussed because, for several sounds, an interpretation of F2 lacks methodological substantiation.—Data for speakers, ranges of F0 and calculated F1:

Spectra 3-1 to 3-6    Child; F0 = 237–492 Hz, F1 = 273–492 Hz

Spectra 3-7 to 3-13    Woman; F0 = 177–498 Hz, F1 = 300–502 Hz

Spectra 3-14 to 3-21    Man; F0 = 138–519 Hz, F1 = 303–519 Hz

Figure 4 demonstrates this phenomenon for sounds of the vowel / i / produced by a child (age 8), a woman and a man, concerning the lower spectral peak and calculated F1.—Data for speakers, ranges of F0 and calculated F1:

Spectra 4-1 to 4-6    Child; F0 = 247–533 Hz, F1 = 267–534 Hz

Spectra 4-7 to 4-13    Woman; F0 = 177–518 Hz, F1 = 279–525 Hz

Spectra 4-14 to 4-21    Man; F0 = 134–534 Hz, F1 = 216–550 Hz

Similar indications as shown for sounds of / i / can be found for sounds of /y/.

With regard to sounds of /a–α /, a compilation of corresponding sound series similar to those presented for the other vowels often encounters some difficulties for two main reasons: spectral peaks and formant patterns often do not shift markedly with rising F0, and children often produce a very open /a /, while many adults produce an intermediate sound of /a–α / or even a sound of /α /, although all speakers speak the same language and live in a geographically limited area. However, Figure 5 demonstrates a case of comparable vowel spectra and comparable formant patterns for sounds of /a / produced by a child (age 10), a woman and a man.—Data for speakers, ranges of F0 and calculated F1: ← 218 | 219 →

Spectra 5-1 to 5-6    Child; F0 = 196–329 Hz, F1 = 759–1055 Hz, F2 = 1341–1555 Hz

Spectra 5-7 to 5-13    Woman; F0 = 160–329 Hz, F1 = 706–1007 Hz, F2 = 1265–1503 Hz

Spectra 5-14 to 5-21    Man; F0 = 126–324 Hz, F1 = 758–898 Hz, F2 = 1232–1431 Hz

The sounds presented in the previous figures may lead to the question whether, with rising F0 and related shifts of the lower spectral peaks and of the calculated lower formants, the perception of age and gender of the speaker alters, i.e. whether the sounds of adults are perceived as produced by children at F0 > c. 260 Hz, and whether sounds of men are perceived as produced by women > c. 200 Hz. This may indeed be the case for the comparison of the sounds of some speakers, while it does not hold true for others. To demonstrate the latter, Figure 6 shows similar vowel spectra and similar formant patterns for sounds of the vowel /o / produced by a child (age 10), a woman (untrained speaker) and a man (classical opera singer, baritone). For these sounds, the perceived vowel quality corresponds very well. However, the baritone is always perceived as such at all F0 of his singing, which is represented in his vowel spectra by a so-called “singer’s formant cluster”. (Again, only the first lower peak and calculated F1 are discussed since most sounds exhibit only one spectral peak; for these sounds, the calculated F2 is weak and its role for vowel perception is questionable; see Section M7.1.)—Data for speakers, ranges of F0 and calculated F1:

Spectra 6-1 to 6-5    Child; F0 = 181–348 Hz, F1 = 377–674 Hz

Spectra 6-6 to 6-11    Woman; F0 = 168–332 Hz, F1 = 344–593 Hz

Spectra 6-12 to 6-17    Man; F0 = 127–325 Hz, F1 = 386–680 Hz

As a direct consequence of the documented observations, it follows that, for back vowels, the sounds of men (at higher F0) may exhibit higher vowel-related spectral peaks and higher calculated F1 or F1–F2 patterns than the sounds of women (at lower F0). The same holds true for the lowest spectral peak and calculated F1 of front vowels and may also occur when comparing sounds of adults and children. ← 219 | 220 →

Figure 7 shows such an “inversion” of expected age- and gender-related differences comparing sounds of the vowel /o / produced by a child and a man, selected from the sound series of the previous Figure 6. If the F0 of the sounds of the man substantially exceeds the F0 of a sound of the child, the first spectral peak and calculated F1 of the sounds of the man are also above the corresponding peak and F1 of the sound of the child (compare Spectra 7-1 to 7-3). The same holds true for calculated F2, but as mentioned, the measurement and perceptual role of F2 are in question. However, if the comparison relates to the sounds of the man at F0 corresponding to statistical values (given for citation-form words), the first spectral peak and calculated F1 (and F2) are found as lower for the man than for the child, as this is generally expected (see Spectra 7-4 and 7-5).—Data for speakers, ranges of F0 and calculated F1 (and F2), in the order of F0:

    “Inverted” age- or size-related difference

Spectra 7-1     Child; F0 = 223 Hz, F1 = 440 Hz (F2 = 764 Hz)

Spectra 7-2, 7-3    Man; F0 = 261–325 Hz, F1 = 511–680 Hz (F2 = 884–950 Hz)

“Expected” age- or size-related difference

Spectra 7-4     Man; F0 = 127 Hz, F1 = 430 Hz (F2 = 535 Hz)

Spectra 7-5    Child; F0 = 264 Hz, F1 = 538 Hz (F2 = 1069 Hz)

Figure 8 demonstrates this phenomenon < 1.5 kHz by comparing sel­ected sounds of the vowel /e / shown in Figure 2.—Data for speakers and ranges of F0 and calculated F1:

    “Inverted” age- or size-related difference

Spectra 8-1    Child; F0 = 222 Hz, F1 = 449 Hz

Spectra 8-2, 8-3    Man; F0 = 260–293 Hz, F1 = 506–566 Hz

“Expected” age- or size-related difference

Spectra 8-4    Man; F0 = 122 Hz, F1 = 370 Hz

Spectra 8-5    Child; F0 = 265 Hz, F1 = 518 Hz ← 220 | 221 →

Figure 9 demonstrates this phenomenon < 1.5 kHz by comparing sel­ected sounds of the vowel /u / shown in Figure 3.—Data for speakers and ranges of F0 and calculated F1:

    “Inverted” age- or size-related difference

Spectra 9-1    Child; F0 = 237 Hz, F1 = 273 Hz

Spectra 9-2, 9-3    Man; F0 = 410–519 Hz, F1 = 412–519 Hz

“Expected” age- or size-related difference

Spectra 9-4    Man; F0 = 138 Hz, F1 = 303 Hz

Spectra 9-5    Child; F0 = 257 Hz, F1 = 346 Hz

Figure 10 demonstrates this phenomenon < 1.5 kHz by comparing sel­ected sounds of the vowel / i / shown in Figure 4.—Data for speakers and ranges of F0 and calculated F1:

    “Inverted” age- or size-related difference

Spectra 10-1    Child; F0 = 247 Hz, F1 = 267 Hz

Spectra 10-2, 10-3    Man; F0 = 441–534 Hz, F1 = 444–550 Hz “Expected” age- or size-related difference

Spectra 10-4    Man; F0 = 134 Hz, F1 = 269 Hz

Spectra 10-5    Child; F0 = 263 Hz, F1 = 301 Hz

Comparisons are limited to children and men because the corresponding differences in the vocal-tract sizes are assumed to be highest.

For earlier accounts, see Maurer, Cook, Landis, and d’Heureuse (1992), Maurer, Suter, Friedrichs, and Dellwo (2015b); note also some related reflections in Potter and Steinberg (1950). ← 221 | 222 →

← 223 | 224 →

M10.2    The Dichotomy of the Vowel Spectrum

In Chapter 10.1, we have argued that the spectrum of a vowel sound needs a twofold rather than a uniform consideration, because only the vowel-related spectrum ≤ 1.5 kHz clearly depends on F0 and, therefore, is not generally specific to speaker groups and vocal-tract sizes. Figures 7 to 10 in the previous chapter illustrate this dichotomy of the vowel spectrum.

M10.A    Addition: Vowel Imitations by Birds

The following series show examples of vowel sounds of common hill mynah birds (Gracula religiosa) imitating vocal expressions and words of humans. The examples are selected on the basis of extensive recordings of 21 birds, most of them living in Indonesia. (However, they imitated words of different languages.) The spectra presented relate to vowel nuclei extracted from the expressions or words. Both the entire imitated expressions or words as well as the extracted sound fragments are perceptually recognisable.

In each of the series, the sound spectra are given in the order of the birds and of F0. (Note that in several cases, different sound spectra for the same vowel are shown for a bird, in order to document variations in F0 and the sound spectra.)—Acoustic analysis corresponds to the analysis as described in the Note on the Method section. LPC filter curves relate to a parameter setting of the LPC analysis according to the PRAAT standard for women. However, as mentioned in the text, the LPC analysis is not methodically substantiated.

Figure 11    Examples of sounds of imitated / i / in word context produced by five birds, with F0 ranging from c. 110–380 Hz; perceptual vowel quality is / i /, including intermediate qualities / i–j /, / i–y/ and / i–e /

Figure 12    Examples of sounds of imitated /e / in word context produced by five birds, with F0 ranging from c. 160–330 Hz; perceptual vowel quality is /e /, including intermediate qualities /e–i / and /e–ø /

Figure 13    Examples of sounds of imitated /a / in word context produced by twelve birds, with F0 ranging from c. 110–490 Hz; perceptual vowel quality is /a–α /, including intermediate quality /α– ɔ / ← 238 | 239 →

Figure 14    Examples of sounds of imitated /o / in word context produced by eleven birds, with F0 ranging from c. 80–410 Hz; perceptual vowel quality is /o /, including intermediate qualitiy /o–ɔ /

Figure 15    Examples of sounds of imitated /u / in word context produced by seven birds, with F0 ranging from c. 110–660 Hz; perceptual vowel quality is /u /, including intermediate quality /u–o /

Note that many of the sound spectra of these birds are similar to the vowel spectra of humans presented in the previous sections. However, for some examples of imitations of front vowels, the lower part of the spectral configuration < 1 kHz is “unexpected”. ← 239 | 240 →

Figure 13. Sounds of /a–α / in word context imitated by mynah birds.

← 242 | 243 →

← 243 | 244 →

← 244 | 245 →

M11  Lack of Correlation between Metho­dological Limitations of Formant Determination and Limitations of Vowel Perception

M11.1    Vowel Perception at Fundamental Frequencies > 350 Hz

The sound series presented in Sections M8.1 and M8.2 demonstrate that recognisable vowels can be produced at fundamental frequencies substantially exceeding the critical limit above which formants can no longer be reliably determined for methodological reasons.

M11.2    Lack of Correspondence between Methodological Problems of Formant Pattern Estimation at Fundamental Frequencies ≤ 350 Hz and Impaired Vowel Perception

The sound series presented in the Sections M7.1 and M7.2 demonstrate that vowel sounds produced at fundamental frequencies ≤ 350 Hz, for which the estimation of formant patterns proves questionable for reasons other than fundamental frequency—for instance, if expected relative spectral energy maxima are “missing” or if vowel-related parts of a spectrum spectra are “flat”—are not less recognisable than vowel sounds for which formant pattern estimation may be said to be unproblematic. ← 249 | 250 → ← 250 | 251 →