Show Less
Open access

Acoustics of the Vowel


Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

10 Lack of Correspondence between Patterns of Relative Spectral Energy Maxima or Formant Patterns and Speaker Groups or Vocal-Tract Sizes

10    Lack of Correspondence between Patterns of Relative Spectral Energy Maxima or Formant Patterns and Speaker Groups or Vocal-Tract Sizes

10.1    Similar Patterns of Relative Spectral Maxima and Similar Formant Patterns ≤ 1.5 kHz for Different Speaker Groups or Different Vocal-Tract Sizes

If sounds of a vowel are produced at equal fundamental frequencies by children, women and men, and if these sounds perceptually correspond with each other not only in terms of their general attribution to a vowel quality but also in terms of the respective “vowel-colour” variant—which makes for the greatest possible correspondence as regards perception—then, empirically, both the relative spectral energy maxima (if determinable) and the formant patterns (if methodically substantiated) often prove to be similar in the lower frequency range ≤ 1.5 kHz, apart from possible differences due to the different parameter settings involved in formant analysis. Expected age- and gender-related spectral differences decrease or disappear if the fundamental frequency of the utterances correspond for children, women and men.

Further, for sounds of back vowels and sounds produced by men at higher fundamental frequencies than women, it follows that the sounds of men (at higher F0) may exhibit higher relative spectral energy maxima (if determinable) and higher F1 or even F1–F2 patterns (if methodically substantiated) than the sounds of women (on lower F0), as holds true for F1 of front vowels. The same may also occur in a corresponding comparison of sounds of adults and children.

No statements are made here on /a–α / since our observations do net yet allow for general formulations for all sounds of /a–α / (see Section 8.1).

Thus, the question arises whether the lower range of the vowel spectrum mentioned is indeed dependent on age- and gender-related speaker groups, that is, on vocal-tract size. In the literature, this lower frequency range is referred to as being entirely vowel specific for all back vowels and, concerning F1, vowel specific for all other vowels.

In any event, the general statement that the sounds produced by children exhibit the highest, the sounds of women intermediate and the sounds of men the lowest patterns of vowel-specific relative spectral energy maxima and formant frequencies does not apply. ← 66 | 67 →

Within the frequency range of ≤ 1.5 kHz, vowel-specific patterns of relative spectral energy maxima (if determinable) and formant patterns (if methodically substantiated) often prove to be empirically independent of the age- and gender-related speaker group, that is, the vocal-tract size. Given strict perceptual correspondences, then, differences refer directly to the differences in fundamental frequency.

As mentioned, the possible relationship between fundamental frequencies and higher vowel-specific spectral envelope peaks or formants for sounds of front vowels is left open for discussion. In the present context, this also concerns the question of whether or not higher frequency ranges are in principal specific to vocal-tract sizes.

10.2    The Dichotomy of the Vowel Spectrum

As mentioned repeatedly, while the dependence of vowel-specific spectral characteristics and formants on fundamental frequency for the lower frequency range ≤ 1.5 kHz is easily understandable and reproducible empirically, this is not the case for the higher frequency range. At the same time, lower spectral ranges and lower formant frequencies are not generally specific to speaker groups and vocal-tract sizes. Whether this is also the case for higher spectral ranges and formant frequencies is still in question. Thus, the spectrum of a vowel sound needs a twofold rather than a uniform consideration.

The spectrum of a vowel proves to be dichotomous.

In this context, with regard to the sounds of front vowels, it is particularly important to consider that, in certain cases, higher relative spectral energy maxima (if determinable) and higher formants (if methodically substantiated) > 2 kHz may be simultaneously related to vowel identity and perceived speaker group: differences in this higher frequency range can often be observed for sounds of a front vowel produced by children, women and men if the speakers form these sounds at similar fundamental frequencies, even if there is no such difference found in the lower frequency range.

However, it is left open for further investigation whether this is also the case if men imitate so-called “female voices” or if adults imitate “children’s voices”. ← 67 | 68 →

10.3    Addition: Whispered Vowel Sounds and Speaker Groups or Vocal-Tract Sizes

No results of comparative studies of formant patterns for whispered vowel sounds of children, women and men have been published to date that have obtained a reference status as is the case for reference statistics of voiced vowel sounds referred to in Part II. However, the studies that compare whispered sounds of different speaker groups (limited in number and generally not including all vowels of a language) refer to corresponding differences between formant patterns.

Notwithstanding the reflections and comments made so far, these differences can be understood as an indication of a general relationship between patterns of relative spectral energy maxima and formant patterns on the one hand, and speaker groups, that is, average vocal-tract sizes on the other, including the lower frequency ranges.

This aspect and its significance regarding the relationship between vowels and related spectral characteristics is left open to discussion here and needs to be clarified and discussed elsewhere.

10.4    Addition: Vowel Imitations by Birds

Sounds of animals imitating utterances of humans are also of primary importance in the discussion of vowel sounds, related spectral characteristics, formant patterns, perceived speaker groups and vocal-tract sizes.

Fundamental in this respect is the question of how birds are able to imitate human sounds despite lacking the means of phonation and articulation—in particular, a corresponding vocal-tract resonance.

According to our own preliminary examination of vowel imitation by common hill myna birds who excel at such mimicry (results unpublished, although some clear examples are given in the Materials section), we conclude the following: if these birds imitate words, and if individual imitated vowel sounds are isolated as sound fragments in a way that they possess a quasi-static character in terms of quasi-static spectral characteristics (above all, that transitions are excluded), then vowel perception and a distinction of such sounds by humans is possible. For part of these sound fragments, complete F1–F2–F3 formant patterns comparable to patterns given for human sounds can be interpreted. For the remaining fragments, only a partial correspondence in formant patterns can be observed. (However, this statement must be relativised: strictly speaking, any calculation of vowel-related formant patterns of bird sounds is methodically unsubstantiated; see below.) ← 68 | 69 →

The fact that birds are able to imitate human vowel sounds with vowel-specific spectral characteristics and formant patterns comparable to those of humans contradicts, in its turn, a strict correspondence between the spectral characteristics of the produced sound and vocal-tract resonance. The same holds true for a strict correspondence between spectral characteristics of the produced sound and vocal-­tract size. Consequently, any critical investigation and discussion of vowels must focus on the possibility that the same sound characteristics can be produced under substantially different physical and physiological conditions.

Besides, if birds are able to mimic human utterances, they must be able to perceptually differentiate different vocal sounds. However, their perception cannot rely on any sensomotoric and conceptual experience of vowel production comparable to the experience of humans. Thus, it can be speculated that their perception relies on a more “abstract” acoustic “form” of the vowel sound. (Such speculation would meet the claim that a phenomenological approach to the physical representation of vowels is needed; see Part V.)

10.5    Addition: Resynthesis and Synthesis

Again, the lack of a general correspondence between patterns of relative spectral energy maxima or formant patterns and speaker groups or vocal-tract sizes can be evaluated and replicated using resynthesis and synthesis. ← 69 | 70 →