The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
1 Prevailing Theory
With respect to human utterances, the following is said to apply: The vocal cords—when oscillating and modulating air expelled from the lungs—produce a sound (a source sound), which is transformed by the resonances of the pharyngeal, oral and nasal cavities: depending on the position of the larynx, velum, tongue, lips and jaw, different shapes of these cavities are formed thus creating different resonance characteristics, allowing different vocal sounds (phones) to be produced and perceived accordingly. If a vocal sound is perceived to belong to a particular linguistic unit (more precisely, a basic linguistic unit, a phoneme), and if the cavity formed by the pharynx and the mouth remains open, then the sound produced is referred to as a vowel sound and its linguistic identity as a vowel quality or simply as a vowel (see the introduction).
According to this approach, the production of a vowel sound involves two quasi-independent processes: the production of sound and its transformation by resonance, termed phonation and articulation. Sound production or phonation is not vowel specific. By contrast, the respective resonance effect or articulation is vowel specific. The two-part model arising from such an understanding of speech production is known as the source-filter model of speech production.
Physiologically, the perceived linguistic identity of a vowel sound corresponds to a vowel-specific articulation in terms of an ensemble of possible positions of the vocal tract, which produce quasi-identical (that is, very similar) patterns of resonances.
Acoustically, the perceived linguistic identity of a vowel sound corresponds to vowel-specific spectral energy maxima, which are quasi-identical to the vowel sounds of the same vowel quality. In acoustic analysis, these spectral energy maxima appear as spectral envelope peaks, generally known as formants.
In cases of whispered vowels, phonation does not involve periodic sound, but noise.
In general, not all formants of a vowel but only the first two (lowest in their frequencies) correspond to a perceived vowel quality. The higher formants refer to other qualities of vocal expression. ← 14 | 15 →
In certain languages, exceptions to this rule concern sounds of high front vowels and of r-coloured front vowels. In such cases, the frequencies of the first two formants of sounds of two vowels are quasi-identical, and only the difference within the respective frequency of the third formant corresponds to the difference in the perceived vowel quality.
In general, children have a considerably smaller vocal tract than adults, just as women have a smaller tract than men. Because of this, the acoustic correspondence between vowel qualities and formant patterns, formulated above in general terms, are related to the different speaker groups of children, women and men in terms of age and gender: thus, for each group and the respective average vocal-tract length, the sounds of a given vowel correspond physiologically to a specific articulation involving a specific resonance pattern, and acoustically to a specific formant pattern.
The geometry, and thus the resonances, of the glottal area of the vocal tract vary for different types of phonation. Therefore, for example, the formant patterns of voiced and whispered vowel sounds of one perceived vowel quality differ substantially. Consequently, the acoustic correspondence between vowels and formant patterns must also be related to the various types of phonation: thus, for each single speaker group too, depending on the respective average vocal-tract length and type of phonation, the sounds of a given vowel correspond physiologically to a specific articulation involving a specific resonance pattern, and acoustically to a specific formant pattern.
Existing empirical reference values for formant patterns—formant statistics—predominantly concern voiced vowel sounds produced in citation-form words, comparable to relaxed speech with limited variation of fundamental frequency. Statistical reference values for vowel sounds involving other phonation types are rare. Further, the various kinds of phonation are related to different methodological problems of formant pattern estimation. The following discussion therefore concentrates on voiced vowel sounds. Only passing reference is made to vowel sounds involving other types of phonation. ← 15 | 16 →
Nasal vowel sounds are also related to specific methodological problems of formant pattern estimation and are therefore not considered here either. Hence, the following discussion is restricted to voiced oral vowel sounds.
The perception of vowel sounds can depend on the semantic context: in some cases, a vowel sound embedded in a syllable or a word may be perceived as a certain vowel quality, which, if extracted from the context and presented as an isolated sound fragment, may be perceived to have a different quality.
Whether or not the perception of vowel sounds can also depend directly on their syntactic context, for example when produced in nonsense syllables or non-words, is left open here.
Consequently, the discussion of the acoustic correspondence between vowels and formant patterns is further restricted to vowel sounds produced in isolation or extracted from a concrete syntactic or semantic context.
In general, single voiced oral vowel sounds that feature a perceivably constant vowel quality, a quasi-constant fundamental frequency and a quasi-constant loudness throughout their entire duration, exhibit the characteristics of a quasi-periodic sound wave. With regard to the physical representation of the vowel quality, the corresponding spectral characteristics of such vowel sounds can be described in terms of the average harmonic spectrum of a sound, including the respective spectral envelope and, if occurring, its peaks, and with the latter the corresponding formant patterns.
This does not apply to vowel sounds whose perceived vowel quality, fundamental frequency, or loudness are subject to substantial variation. So as to exclude the ensuing questions about a possible influence of such variations on the perception of vowel qualities and their spectral representation, the following discussion focuses on vowel sounds as monophthongs that possess quasi-constant sound characteristics. Vowel sounds lacking such sound characteristics are again discussed only in passing and by way of incidental comments. ← 16 | 17 →
In the first instance, the acoustic correspondence between vowels and formant patterns only applies to speakers and listeners belonging to the same speech community: quasi-constant vowel production and perception exist among the members of such a community, who accordingly attribute sound variations either to one and the same vowel quality or to different vowel qualities.
However, the methodological question of how to determine empirically the consistency of such an attribution is not discussed further here. The present discussion generally assumes that the vowel sounds considered, when subjected to a concrete identification test involving listeners of one speech community, specially trained for such a perception test, will exhibit a consistent attribution substantially above a 50% level for any given vowel quality.
Yet to be discussed elsewhere are correspondences that reach beyond one particular speech community as well as one particular linguistic community.
– vowel sounds are produced by individuals belonging to one of the three speaker groups of children, women, or men of a given speech community;
– vowel sounds are either produced as isolated voiced oral sounds or as voiced oral sound fragments extracted from their concrete syntactic and semantic context of production, with neither transitions at the beginning nor the end;
– vowel sounds are produced with a quasi-constant fundamental frequency and loudness and exhibit the characteristics of a quasi-periodic sound wave;
– vowel sounds are perceived as belonging to one vowel quality by other individuals of the same speech community;
then the following applies to the individual vowel sound:
– physiologically, its perceived linguistic identity as a specific vowel quality corresponds to a specific position of the vocal tract which, by means of (according to their frequency position) the first two (in some cases of high front vowels and r-coloured front ← 17 | 18 → vowels of certain languages the first three) resonances of the tract, transforms the source sound of the vocal cords to that sound;
– acoustically, its perceived vowel quality hence corresponds to the first two (or the first three) lower formants of the sound spectrum.
Given the same assumptions, for two vowel sounds perceived as two different vowel qualities, this implies that:
– physiologically, the difference in vowel perception corresponds to two different positions of the vocal tract, each with a different pattern of the lower two (or three) resonances;
– acoustically, the difference in vowel perception corresponds to two different patterns of the first two (or first three) lower formants of their respective spectra.
For the sounds of a particular vowel, albeit produced by speakers of different speaker groups, this implies that:
– physiologically, their perceived linguistic identity as the same vowel quality corresponds to different patterns of the first two (or first three) lower resonances of the vocal tract, related to the difference in average vocal tract length of the speaker groups compared;
– acoustically, their perceived linguistic identity as the same vowel quality hence corresponds to different speaker group-specific patterns of the first two (or first three) lower formants of the respective spectra.
These formulations are central to the prevailing theory of the physical representation of the vowel.
For isolated, voiced oral vowel sounds that possess quasi-constant sound characteristics and are produced by individuals belonging to a given speech community and a given speaker group of children, women, or men, the following applies:
– vowel sounds perceived as one vowel quality correspond to quasi-identical (that is, similar) R1–R2 (R1–R2–R3 in some cases of high front vowels and r-coloured front vowels in certain languages) and, at the same time, quasi-identical F1–F2 (or F1–F2–F3, respectively); ← 18 | 19 →
– vowel sounds perceived as different vowel qualities correspond to dissimilar R1–R2 (R1–R2–R3, respectively) and, at the same, dissimilar F1–F2 (F1–F2–F3, respectively).
Figure 1 is an illustration of this prevailing understanding of vowel production and perception, typical of many publications in the field. (The illustration is simplified in that it lacks any differentiation of the actual characteristics of the source spectrum on the one hand, and of the radiation impedance occurring when a sound is emitted into space on the other. This differentiation is not discussed further here because it is irrelevant to the present argument.)
Figure 2 shows examples of spectra, filter curves (LPC curves) and formant patterns (maxima of filter curves) of specially selected sounds of different vowels. This kind of illustration, which is limited to the acoustic perspective, is also widespread in the literature.