The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
5 Formant Patterns and Speaker Groups
If one further extends the reasoning developed in the previous chapter, namely that—according to prevailing theory—the intelligibility of a vowel sound is expected to relate to the respective fundamental frequency of the sound and the (statistically given) first formant frequency of the vowel, then, correspondingly, the “grade” of vowel perception should also depend upon the speaker group: vowel intelligibility should prove to be best for men, average for women and worst for children.
According to prevailing theory, the above relationship between fundamental and first formant frequencies, spectral characteristics and expected differences in the “grade” of intelligibility of different vowel qualities leads to the assumption that the “grade” of vowels varies for different speaker groups (children, women, or men).
Everyday experience also refutes this generalisation. Thus, again, a theory of vowels as elements of language that formulates a inherently qualitative relationship between age and gender on the one hand, and vowel perception on the other, stands in contrast with the—possibly (yet again!) “sensational”—characteristic of a voiced element of language being quasi-independent of a speaker’s constitution (if not impaired).
Vowels as such are related neither to age nor to gender. If direct comparisons of utterances of single speakers show that some speakers produce vowel sounds “better” (better in vowel intelligibility) than others, then, this has to do with the vocal abilities of the individual speakers investigated, not with vowels, speaker groups, or vocal-tract sizes (with the exception of very young children acquiring their first language). As a rule, vowels, as speech sounds of a given language, can potentially be produced with equal intelligibility by speakers of all general speaker groups. Vowels are not attributes of an individual, but elements of language. Vowels are “abstracted” from the individual. ← 38 | 39 →
In the literature, empirical reference values for vowel-specific formant patterns are given separately for each speaker group (children, women, or men), that is, in group-specific terms (see, for example, Chapter 2). In the first instance, these differences in formant patterns are not explained in terms of varying average fundamental frequencies, but in terms of varying average vocal-tract size.
This view leads to the assumption that each vowel is physically represented by three different speaker group-specific formant patterns, not only in terms of the different fundamental frequencies, but also in terms of the same fundamental frequency: in general, women and men are able to produce clearly recognisable vowel sounds at a child’s fundamental frequency—for instance, at around 250 Hz (see Section 2.1; note, in this context, that in the statistics of Hillenbrand et al., F0 differences between women and children do not exceed 20 Hz). Given such cases of sounds at similar fundamental frequencies, three sounds of the same vowel, produced by a man, a woman and a child respectively, are expected to exhibit three substantially different formant patterns, despite the similarity in vowel perception.
According to prevailing theory, the relationship between vowel-specific formant patterns and age- and gender-related speaker groups leads to the assumption that the physical representation of a vowel is based upon different formant patterns.
Such reasoning also leads to the assumption that women and men are capable of producing sounds of a given vowel with fundamental frequencies substantially higher than those of children, albeit with substantially lower corresponding formant patterns.
The problem that the particular sound configurations in question pose to the theoretical approach discussed here becomes particularly evident when considering corresponding sounds of the vowels /a, α, ɔ, o, u /, which are low-pass filtered with a cut-off frequency of 2 kHz (note that, for these vowels, statistical values of vowel-specific formant patterns F1–F2 for all three speaker groups discussed here are given as ≤ 2 kHz): then, neither different fundamental frequencies nor different higher spectral energy configurations can play a role in vowel perception and can explain why three different patterns of F1–F2 can be expected to represent the same vowel. ← 39 | 40 →
It goes without saying that the above also holds true for the restricted comparison between women and men.
The problem described here becomes particularly acute if, instead of natural vocalisations, corresponding sound configurations are studied by means of vowel synthesis, applying similar fundamental frequencies but different patterns F1’–F2’.
However, in its turn, such a conclusion runs counter the requirement of a psychophysical parallel between perceived vowel quality and physical representation: formant patterns are either vowel specific, which means that clearly distinct formant patterns do not represent the same vowel—regardless of the fundamental frequency—or they are, as such, not directly vowel specific. According to the first stance, the assumption of speaker group-specific formant patterns would have to be questioned. According to the second stance, the assumption of vowel-specific formant patterns in general would have to be questioned.
Disregarding the comment in the previous paragraph, the pursuit of the reasoning developed in Section 5.2 leads to the further assumption that a single formant pattern can represent two different vowels: given that the sounds of a vowel produced by a speaker of one speaker group exhibit higher vowel-specific formant frequencies than the sounds of the same vowel produced by a speaker of another speaker group, and that the fundamental frequency plays no substantial role in the physical representation of the vowel in terms of formant patterns, and also given that the vowel-specific formant frequencies of the sounds of the first speaker lie within the frequency range of the possible vowel-specific formant frequencies of the second speaker, then it must be possible to find cases of comparisons of two sounds, each produced by one of these two speakers, that exhibit similar vowel-specific formant patterns, yet are perceived as different vowels.
According to prevailing theory, the relationship between vowel-specific formant patterns and age- and gender-related speaker groups leads to the assumption that a single formant pattern can physically represent two different vowels.
Again, the problem that such sound configurations pose to the theoretical approach discussed here becomes particularly evident when considering corresponding sounds of the vowels /a, α, ɔ, o, u /, because ← 40 | 41 → the vowel-specific formant frequencies of the corresponding sounds of all speaker groups are given in formant statistics ≤ 2 kHz, and in such a frequency range, adults can reproduce sounds exhibiting any of the F1–F2 pattern found in sounds of children. The same holds true when comparing the sounds of men and women.
The problem described here becomes particularly acute again if replicated by means of vowel synthesis, above all including extensive variation of the fundamental frequency.
However, in line with the explanation given above, the assumption of a possibility of twofold representation, according to which a single formant pattern can correspond physically to the sounds of two different vowels, runs counter to the requirement of a psychophysical parallel between perceived vowel quality and physical representation. At the same time, indeed, it directly contradicts prevailing theory.
This consideration engenders a decided scepticism about the claim that vowel-specific formant patterns are both fundamentally and continuously dependent upon the speaker group, that is, upon vocal-tract size. A fundamental dependence is already difficult to understand from an intellectual standpoint because, as mentioned, vowels do not “have” an age or gender. Besides, the simple fact that sounds of back vowels can be synthesised at fundamental frequencies, observable in sounds of children as well as in sounds of men, paradigmatically illustrates the problem: if, in synthesis, F1–F2 is changed substantially but the fundamental frequency is held constant, in general, the perceived vowel quality also changes, irrespective of whether the F1–F2 of the synthesis corresponds to a pattern observed for natural sounds of a child or of a man.
At the same time, the above reflection suggests an alternative explanation for the existing empirical findings, which seemingly provide evidence for speaker group-specific formant patterns: vowel-specific spectral energy configuration, and with this this calculated formant patterns, can depend upon fundamental frequency.
It is remarkable that, in general, formant statistics deemed worthy of reference in the literature do not give frequency values of formant patterns of the different speaker groups for systemically varied fundamental frequencies. Thus, currently, there is no empirical evidence in the literature to support the claim that observed, speaker group-specific formant patterns of vowels should in principle not be attributed to the different—and simultaneously observed—fundamental frequencies of the respective sounds but, instead, to different average vocal-tract ← 41 | 42 → sizes. With regard to the first formant for all vowels, and probably also to the second formant for back vowels, the present reflection indicates that such evidence cannot be furnished.
As indicated, existing formant statistics suggest that, irrespective of fundamental frequency and perceived vowel quality, adults are capable of producing sounds for almost all variants of F1–F2 patterns as found in children’s vowels. Thus, even though adults have larger vocal tracts than children, for most vowels, they are nevertheless capable of producing sounds that exhibit the same vowel-specific formant patterns, above all F1–F2, as evidenced for the sounds of children.
If it is indeed the case that speakers of all three speaker groups are considered to be capable of producing the same vowel-specific patterns for a substantial part of vowels, then how are the pattern differences discussed above to be understood? (Many scholars assume that the schwa sound defines the midpoint of a speaker’s vowel space and plays a central role for the formant pattern differences discussed: because of different average vocal tract lengths and different resonance patterns of related open tubes of speakers of different age and gender, it is deduced that different vowel-related format patterns mirror different midpoint reference patterns. However, in the present context, such an assumption does not dispense from the question posed: sounds of schwa, too, can be produced on different fundamental frequencies, and the independence or dependence of related formant patterns on fundamental frequency for perceptually unaltered schwa quality has not yet been clarified.)
Even though existing statistical values list vowel-specific formant patterns for children exceeding those for adults, and for women exceeding those for men, there are nevertheless exceptions: in some cases, as shown by some statistics, single vowel-specific formant frequencies, or even vowel-specific formant patterns F1–F2 or F1–F2–F3, for sounds produced by men do not differ from those for sounds produced by women; they may even slightly exceed the latter. (Thus, remarkably, the formant patterns given by Fant, 1959, for a single male and a single female speaker do not show a consistent speaker group related difference; see Section 2.1, Table 3. Besides, there are cases in which the statistical F1 of women slightly exceeds the F1 of children, see, for instance, Section 2.1, Table 2, and the corresponding values for the vowel /ʌ /.) This raises the same question as above. ← 42 | 43 →
The relationship between vowel-specific formant patterns and age- and gender-related speaker groups described in terms of prevailing theory fails to explain why, despite different vocal-tract sizes, similar vowel-specific formant patterns are basically possible at least for the majority of vowels but are—according to theory—not realised (actually not produced).
In addition, this formulation could also prove to be generally applicable: it could prove to be the case that all vowel-specific formant patterns, F1–F2 and F1–F2–F3 as given in formant statistics for children, can also be produced by women and men. (With regard to this aspect, utterances of voice-over artists are of particular interest.)
Repeating and insisting: given a psychophysical perspective, the correspondence between intelligible vowel sounds and the vowel-related physical characteristics must be formulated as such. The formulation of speaker-independent and, in a strict and direct sense, vowel-specific acoustic features represents the touchstone for any acoustic theory of the vowel.
Empirical studies comparing voiced and whispered vowel sounds indicate substantial differences in the formant patterns related to the perceived vowel qualities. In particular, the first formant frequency of whispered sounds of a given vowel (and, according to some studies, the second formant frequency, too) are found on significantly higher frequency levels than those of voiced sounds. (As mentioned in Section 1.4, such differences are explained as a consequence of differences in the geometry, and thus the resonances, of the glottal area of the vocal tract for the two different phonation types in question.)
This finding relativises again the attempt to establish a direct correspondence between vowels and formant patterns: the sounds of the same vowel can exhibit different formant patterns, not only because of different average vocal-tract sizes but also because of different kinds of phonation acting upon a configuration of a single vocal tract.
Moreover, comparisons between published formant frequencies of whispered and voiced vowel sounds indicate that all F1, and the majority of F2 ≤ 1.5 kHz, of whispered sounds produced by men generally exceed the corresponding F1 and F2 of voiced sounds produced by women, given the same perceived respective vowel identities and notwithstand ← 43 | 44 → ing men’s larger vocal tract. The same applies to a comparison between whispered sounds of women and voiced sounds of children. Restricted to F1, this also applies to the comparison between whispered sounds of men and voiced sounds of children.
This observation relativises in turn the assumption of a correspondence between vocal-tract size and vowel-specific formant patterns: based on the values given in the literature, such a correspondence is documented only for sounds of one and the same phonation type, not for a comparison of sounds of different phonation types. Besides, it should be noted that the frequency differences of the lower formants for the sounds of a given vowel, which relate to different types of phonation, e.g. voiced versus whispered sounds, are in general greater than the corresponding formant frequency differences between the different speaker groups.
Thus, most importantly, vowel-related formant patterns produced by one vocal tract can differ more than vowel-related formant patterns produced by different vocal tracts with very different tract sizes.
Moreover, referring to Section 5.3, a single formant pattern seems able to physically represent different vowels not only if the corresponding sounds are produced by speakers belonging to different speaker groups, but also if an individual speaker varies his or her phonation.
Such consideration will be discussed further in Part III: comparisons between the formant patterns of voiced and whispered sounds, as documented in the literature, refer only to the average (lower) fundamental frequency of voiced vowel sounds produced in citation-form words, but not to a comparison including a systematic variation in fundamental frequency of voiced sounds. (Such an experimental arrangement assumes, once again, that formant patterns are independent of fundamental frequency and are, therefore, negligible when comparing voiced and whispered sounds.) ← 44 | 45 →