Show Less
Open access

Acoustics of the Vowel


Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

8 Lack of Correspondence between Vowels and Patterns of Relative Spectral Energy Maxima or Formant Patterns

8    Lack of Correspondence between Vowels and Patterns of Relative Spectral Energy Maxima or Formant Patterns

8.1    Dependence of Vowel-Specific, Relative Spectral Energy Maxima and Lower Formants ≤ 1.5 kHz on Fundamental Frequency

If investigated empirically and systematically, it becomes evident that the first spectral envelope peak—if it exists—and the first calculated formant of vowel sounds often depend on fundamental frequency.

For a range of fundamental frequencies ≤ 350 Hz for which formant ana­lysis is not critical in principle, this dependence is particularly evident in the sounds of the vowels /e, ø, o / at fundamental frequencies in the range of 200 Hz to 350 Hz.

For a range of fundamental frequencies > 350 Hz, this dependence is, above all, indicated in sounds of the vowels / i, y, u /, because the first harmonic generally exhibits the highest amplitude; thus, the lowest spectral peak rises with increasing fundamental frequency.

In addition, such a dependence can also be observed for the second formant for cases of sounds of back vowels.

For sounds of / ε / and of / a–α /, however, indications of a dependence of F1 on fundamental frequency may prove to be weak and corresponding observations may require a comparison of sounds with a very extended vocal range.

Moreover, the observation of a dependence of F1 on fundamental frequency is not only related to frequency ranges of the latter and vowel qualities but also to single speakers and their phonation characteristics, including vocal effort. (Note that marked differences in the vocal effort of vowel production have a substantial effect on spectral peaks and calculated formant frequencies, and this effect has to be taken into account when investigating the relationship between F0, spectral peaks and formants.) But although the indications for the dependence discussed here prove to be unsystematic, the findings of intelligible vowel sounds at fundamental frequencies > 500 Hz (see next chapter) and of formant pattern ambiguity (see Chapter 9) force us to relate the lower spectral peaks and the lower formants to fundamental frequency. ← 59 | 60 →

The possible relationship between fundamental frequency and higher vowel-specific spectral envelope peaks or formants > 1.5 kHz for sounds of front vowels is left open here for discussion.

These assertions hold true for vowel sounds produced by one and the same speaker. Thus, they apply to vowels and their physical representation.

In this respect, what is of particular importance is the observation that the dependence of lower spectral envelope peaks and lower formants ≤ 1.5 kHz does not represent a phenomenon generally related to “over­singing” the first formant of a vowel: most importantly, the shifts of F1 in the sounds of the vowels /e, ø, o / can already be observed at fundamental frequencies substantially below the corresponding statistical values for F1 as given in the literature for sounds produced in citation-form words. Moreover, given a range of fundamental frequencies of c. 200–350 Hz, the shifts of F1 for the sounds of the vowels /e, ø, o / are in many cases much more pronounced than for the sounds of the vowels / i, y, u /, although, for the former, the literature gives significantly higher statistical values for F1 than for the latter.

Also of particular importance—and foreshadowing formant pattern ambiguity of vowel sounds (see Chapter 9)—is the observation that, in many cases of sounds of a vowel produced by a single speaker, the shifts of F1 in relation to fundamental frequency exceed the F1 differences of two neighbouring vowels as given in formant statistics for a corresponding speaker group (for speakers with corresponding vocal-tract size). In line with this, the shifts mentioned also exceed speaker-group differences in F1 for that same vowel as given in the format statistics mentioned.

Vowel-specific relative spectral energy maxima ≤ 1.5 kHz (if determinable) and calculated vowel-specific formant patterns (if methodologically substantiated) are dependent on fundamental frequency.

8.2    Vowel Perception at Fundamental Frequencies above Statistical Values of the First-Formant Frequency

Speakers possessing a large vocal range and good phonation and articulation are able to form the sounds of the vowels / i, y, e, ø, ε, a, o, u / in a recognisable and distinguishable way up to a fundamental frequency of c. 700–800 Hz. Such sounds can be readily experienced up to a fundamental frequency of c. 600 Hz because they occur frequently ← 60 | 61 → in everyday speech, in particular among children and women. However, these sounds can also be evidenced for men using “falsetto”.

Speakers possessing excellent vocal abilities are even able to form the sounds of the corner vowels / i, a, u / in a clearly recognisable and distinguishable way up to a fundamental frequency of c. 800–1000 Hz. (Ongoing research also indicates that other vowels, too, are intelligible in this vocal range.)

Correspondingly, the respective sound spectra exhibit vowel-specific differences, even if these have to be described other than in terms of spectral envelopes and formant patterns, for example in terms of vowel-specific configurations in the levels of the harmonics (see below, Sections 13.2 and 13.3).

Note that a fundamental frequency of 700 Hz lies above the statistical F1 values given for sounds of all long German vowels produced by women or men, except for /a / of women. A fundamental frequency of 800–1000 Hz even lies above the statistical F1 values for all long German vowels, for both women and men (see Section 2.2).

The vowel quality of sounds produced at fundamental frequencies above statistical values of the vowel-related first-formant frequency is intelligible in principle.

The possibility of such vowel production and perception contradicts the designation of established, statistically determined formant patterns as “vowel-specific” patterns, irrespective of the methodological problems of determining envelope peaks and formant frequencies. At the same time, vowel perception and discrimination at such high fundamental frequencies confirms that lower spectral energy maxima (if determinable) and lower formants (if methodically substantiated) depend on fundamental frequency.

The vowel quality of sounds of back vowels and of /a–α / produced at fundamental frequencies > 500 Hz can be physically represented solely in terms of the first two or three harmonics and their amplitudes. This accentuates the basic problem of assuming that relative spectral energy maxima, that is, envelope peaks in closely delimited frequency ranges, are a pervasive physical characteristic of the sound of a vowel.

Here, the question of the maximal fundamental frequency up to which all vowels of any given language can in principle be produced in a recognisable way is left open for discussion. ← 61 | 62 →

8.3    “Inversions” of Relative Spectral Energy Maxima and Minima and “Inverse” Formant Patterns in Sounds of Individual Vowels

Given that spectral envelope peaks ≤ 1.5 kHz (if determinable) depend on fundamental frequency, pairs of sounds of a back vowel produced at different fundamental frequencies can exhibit “inverse” relative spectral maxima and minima in the form of “inverse” spectral envelope curves ≤ 1.5 kHz without any change in vowel perception: whereas we see a relative minimum in the spectrum for one sound, we may observe a spectral maximum for the other, and vice versa. The same holds true for comparisons between the respective calculated filter curves and formant patterns (if methodologically substantiated): where for one sound, the filter curve exhibits a relative minimum, for another sound, the curve may exhibit a maximum, and vice versa.

In the case of some front vowels, such “inversions” can also be observed for the higher vowel-specific frequency range, even if the question of the relationship between such “inversions” and fundamental frequency variation is left open here. 

This observation reaffirms the lack of a general correspondence between vowels, vowel-specific spectral envelope curves and corresponding formant patterns.

With regard to vowel-specific frequency ranges, the spectral envelope curves of two sounds of the same vowel produced at two different fundamental frequencies can exhibit “inverse” behaviour. The same holds true for formant patterns.

8.4    Addition: Whispered Vowel Sounds, Fundamental-Frequency Dependence of Vowel-Specific Spectral Characteristics and “Inversions”

As discussed in Section 5.5, formant statistics indicate increased vowel-specific formant frequencies F1 and F2 for whispered sounds when compared to voiced sounds. However, according to the corresponding recording procedures of the comparative investigations, this only applies to the lower range of fundamental frequency of the voiced sounds produced in citation-form words, comparable to relaxed speech in an enclosed space.

Given that a whispered sound exhibits higher first and second formants than a voiced sound of the same vowel and given that the latter’s fundamental frequency is gradually increased during its production, then in many cases it is possible to determine a certain fundamental fre ← 62 | 63 → quency for which F1 and F2 of the whispered and voiced sound correspond with each other.

Whether this represents an actual rule is left open here.

If the fundamental frequency of a voiced sound is increased further, then there will be cases in which F1 or F1–F2 of the whispered sound are lower than F1 or F1–F2 of the voiced sound.

In any event, the general statement that whispered sounds exhibit fundamentally higher vowel-specific formant patterns than voiced sounds does not apply.

Over the course of such experimentation, cases involving comparisons between whispered and voiced sounds exhibiting the described “inversions” may also be found.

8.5    Addition: Resynthesis and Synthesis

All the above aspects of the lack of correspondence between vowels and patterns of relative spectral energy maxima or formant patterns, discussed in relation to natural vowel sounds, can be evaluated and replicated using resynthesis.

The same holds true for resynthesis at fundamental frequencies > 350 Hz related directly to the harmonic spectra of natural vowel sounds.

The same also applies to synthesis involving formant patterns or harmonic spectra not derived directly from natural vowel sounds. ← 63 | 64 →