Show Less
Open access

Acoustics of the Vowel

Preliminaries

Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

Materials Part I

Materials Part I

The first part of the Materials section contains selected excerpts from the literature that are related to the first part of the main text. ← 97 | 98 →

M1     Prevailing Theory

Vowels

“Vowel […]. 1. (also vocoid) In phonetics, a segment whose articulation involves no significant obstruction of the airstream, such as [a], [ i ] or [u]. Strictly speaking, a glide such as [ j ] of [w] may also be regarded as a (brief) vowel in this sense. 2. In phonology, a segment which forms the nucleus of a syllable. 3. Any letter of the alphabet which, generally or in a particular case, represents a vowel in sense 2.” (Trask, 1996, p. 382)

“Vocoid […]. 1. A synonym for vowel in the phonetic sense of that term (sense 1), introduced in an effort to remove the ambiguity between the phonetic and the phonological sense of ‘vowel’. While possibly useful, the term has never become established. Pike (1943). 2. More narrowly, a vocoid in sense 1 which is also syllabic: a true vowel, as opposed to a glide or approximant. Sense 2: Laver (1994).” (Trask, 1996, p. 378)

“Vowels and Consonants. Phonetics has traditionally classified the segments of speech into two basic varieties which are called vowels and consonants. Once again, there has never been a straightforward definition of these terms. Early linguists in India also grappled with the concepts of vowel, consonant, and syllable around 800 BC, and they recognized that the three notions are hopelessly intertwined […]. The definitions used here will be similar to those of the ancient Sanskrit scholars, and in fact, the development of modern phonetics in the West owes much to the transmission of knowledge in translation from the Sanskrit sources.

A vowel is defined as a ‘vowel-like segment’ (what Pike […] termed a vocoid) that occupies the nucleus of a syllable. A segment is considered to be a vocoid when its articulation permits the relatively free passage of air through the center of the mouth. This definition is also rather loose, but in roughly familiar terms, most segments that are at least as open as an English w or y-sound (the latter is transcribed [ j ] in IPA) are vocoids, all others being non-vocoids. A consonant is then defined simply as a non-vocoid, no matter what syllable position it occupies. This imperfect dichotomy leaves room for a middle category, that of the semivowel, which is defined as a vocoid located outside the nucleus of a syllable. Semivowels, in spite of being vocoids, are usually regarded as a special sort of consonant (often called a ‘glide’) in the interests of preserving the consonant-vowel dichotomy. The interplay of consonants, vowels, and syllables in the speech stream is given a ← 98 | 99 → slightly different (more acoustic) view by Orlikoff and Kahane: ‘Consonants differ from vowels primarily by the amount of vocal tract constriction employed in their production […] Speech can be considered to be an overlay of consonants on the vocal signal. The dispersion of consonants results in an amplitude modulation of the acoustic energy that, for the most part, gives rise to our perception of syllables.’” (Fulop, 2011, pp. 8–9)

Speech production: source and filter

“The speech wave is the response of the vocal tract filter systems to one or more sound sources. This simple rule, expressed in the terminology of acoustic and electrical engineering, implies that the speech wave may be uniquely specified in terms of source and filter characteristics. In spite of the technical phrasing it is apparent that this statement also covers essentials of the phonetician’s concept of speech production.” (Fant, 1960, p. 15)

See also Chapter M4.

Formants

“The spectral peaks of the sound spectrum | P( f ) | are called formants. Referring to Fig. 1.1-2, it may be seen that one such resonance has its counterpart in a frequency region of relatively effective transmission through the vocal tract. This selective property of | T( f ) | is independent of the source. The frequency location of a maximum in | T( f ) |, i.e. the resonance frequency, is very close to the corresponding maximum in spectrum P( f ) of the complete sound. Conceptually these should be held apart, but in most instances resonance frequency and formant frequency may be used synonymously. Thus, for technical applications dealing with voiced sounds it is profitable to define formant frequency as a property of T( f ).

The basic principle of the theory of voiced sounds is that, to a first order of approximation, the filter function is independent of the source. The formant peak will thus only accidentally coincide with the frequency of a harmonic. The formant frequencies can change only as a result of an articulatory change affecting the dimensions of the various parts of the vocal tract cavity system and thus the filter function. Conversely, but with the limitations implied by the concept of compensatory forms of articulation, the formant frequencies provide information about the position of the speaker’s articulatory organs. If these formant frequencies are held constant and the fundamental frequency is raised one octave, the result is ideally that twice as many pulses ← 99 | 100 → per second are emitted from the voice organs. The distance between adjacent harmonics in the spectrum will be doubled, and the number of harmonics up to a certain fixed frequency limit will thus be halved. If a specific formant, for instance the first, comes close to the 6th harmonic at the lower pitch, it will be the 3rd harmonic that comes closest to the same formant in the case of the higher pitch. The concepts of formant frequency and harmonic number should not be confused.” (Fant, 1960, p. 20)

See also Chapters M4 and M6.

Vowel-specific formants

“Usually vowels can be quite well characterized in terms of the frequencies of just the first and second formants, but the third formant should also be measured for high front vowels and for r-colored vowels.”

(Ladefoged, 2003, p. 105)

Age- and gender-specific formants

“The length of the pharyngeal-oral tract depends on the physical size of the speaker. The length affects the frequency locations of all of the vowel formants; this fact helps us to predict where the formant peaks in the spectrum will appear for men, women, and children. A very simple rule relates the frequencies of the formants to the overall length of the tract from glottis through lips. The rule for this relation is:

Length Rule. The average frequencies of the vowel formants are inversely proportional to the length of the pharyngeal-oral tract. In other words, the longer the tract, the lower are its average formant frequencies.

The neutral vowel formants for the average man, with an oral tract 17.5 cm in length, are at 500, 1500, 2500 Hz, and so on, with the lowest formant at 500 Hz and frequency spacing of 1000 Hz between all formants.

An easy way to remember the neutral formant frequencies is to think of the odd numbers 1, 3, 5, 7, 9, and so on, because the formant frequencies of a uniform tube that is closed at one end and open at the other, like the pharyngeal-oral tract, are always odd multiples of the frequency of the lowest formant. For example, begin with the basic formant frequency, 500 Hz, as the unit or 1; then the formant frequencies above that are 500 × 3 = 1500 Hz, 500 × 5 = 2500 Hz, and so on. This method, calculating the formants above F1 as multiples of F1, applies only as a model of a neutral tract shape. ← 100 | 101 →

The pharyngeal-oral tract length of an infant is approximately half the length of that of a man. Therefore, following our Length Rule about formant frequency locations, the formants of a neutral-shaped infant tract in relation to a man’s would be at frequency locations that are a factor of the reciprocal of ½, or twice those of the man. On this basis the infant formant locations for a neutral vowel would be as follows: F1 is 500 × 2 = 1000 Hz, F2 is 1500 × 2 = 3000 Hz, F3 is 2500 × 2 = 5000 Hz, and so on.

Following the same procedure, a woman’s vocal tract, on the average, is about 15% shorter than that of a man. The ratio corresponding to this amount of shortening is approximately 5/6. The reciprocal of 5/6 is 6/5, which is equal to a factor of 1.20, which, when multiplied by the man’s neutral formant frequencies, gives the woman’s values of 20% higher: F1 is 500 × 1.2 = 600 Hz, F2 is 1500 × 1.2 = 1800 Hz, F3 is 2500 × 1.2 = 3000 Hz, and so on. […]

The Length Rule tells us approximately where we may find the formants for the very young as well as for older, larger persons. However, the neutral locations of F1 and F2 for an individual are also affected by the length proportions of the vocal tract between the oral and pharyngeal cavities (Fant, 1973, Chapter 4). In general, the location and spacing of formants F3 and above are more closely correlated with length of vocal tract than for F1 and F2. The average locations of F1 and F2 for an individual are also affected somewhat by language environment and training.” (Pickett, 1999, pp. 38–40)

See also Chapter M5. ← 101 | 102 →

M2     Prevailing Empirical References

Illustration: including radiation factor/radiation impedance

For a more differentiated graphic illustration, showing a 12db/octave slope of the source and a 6dB/octave intensity increase because of the radiation impedance, see Ladefoged (1996, p. 104), Figure 7.7 and the related comment: “Figure 7.7 shows a source-filter view of the production of a vowel. The spectrum of the glottal pulse is shown on the left of the figure. In this case we have taken the vocal folds to be vibrating at 100 Hz, so the components are at 100 Hz intervals. To the right of the spectrum is the set of curves specifying the vocal tract response. The output of the vocal tract can be regarded as the input to another box entitled ’radiation factor,’ which we must now take into account. […] these vibrations […] inside the mouth […] are not themselves the variations in air pressure that we hear. The air in the vocal tract vibrates so that the air particles at the open end between the lips move backward and forward. It is these movements that start the air outside the lips vibrating. The air between the lips acts like a piston, a source of sound producing variations in air pressure that radiate out from the lips just as the variations in air pressure radiate out from a source of sound such as a tuning fork. The movements of this piston of air are more effective in causing variations in pressure in the surrounding air at some frequencies than others. The higher the frequency, the greater the response of the surrounding air to the action of the air vibrating in the vocal tract. This effect, which we have termed the ‘radiation factor’ (‘radiation impedance’ is the term used in more technical books), can be regarded as a kind of filter that boosts the higher frequencies by 6 dB per octave. The curve representing the radiation factor is shown above the third box in figure 7.7.

The output produced at the lips depends on the vocal cord source, the filtering action of the vocal tract, and the further modifications produced by the radiation factor. Normally the vocal cord source is the same for each vowel, apart from variations of pitch. The vocal folds may be vibrating at 100 Hz, or at 200 Hz, as in the examples we have been considering, or at any other frequency in the range of the human voice. But irrespective of the fundamental frequency, the spectral slope of the cord pulse will usually be approximately −12 dB per octave. The filtering action of the vocal tract will be different for each position of the vocal organs, thus producing formants (peaks in the resonance curve) at different frequencies. The spectrum of the waveform beyond the lips (shown on the right of figure 7.7) will have peaks in re ← 102 | 103 → gions which depend on the filter characteristics of the vocal tract. The general slope of the output spectrum will be influenced by the slope of the spectrum of the glottal pulse (−12 dB/octave) and the radiation factor (+6 dB/octave). Taken together these two slope factors account for a − 6 dB/octave slope in the output spectrum. The major characteristics of the output spectrum – the formant peaks – are superimposed on this general slope. They are primarily dependent on the filtering characteristics of the vocal tract.” (Ladefoged, 1996, pp. 104–105)

Formant statistics by Fant et al.

With regard to the study of Fant (1959; see Section 2.1, Table 3), see also the later study of Fant, Henningsson, and Stalhammar (1969) concerning statistical formant patterns for long Swedish vowels produced by men.

Formant statistics for Standard German

Older studies concerning formant patterns of German vowels were published by Jørgensen (1969), Iivonen (1970, 1986), Rausch (1972), Wängler (1981), and Ramers (1988). For further indications of formant statistics for Standard German, see the online digital version of the materials.

Formant statistics for other languages

For further indications of formant statistics of other languages, see also the online digital version of the materials. ← 103 | 104 → ← 104 | 105 →