The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
6 Terms of Reference, Methods of Formant Estimation
Given that the terms “resonance” and “formant” are distinguished from each other, as a means of distinguishing the characteristics of the vocal tract from those of the sound spectrum, then the psychophysical question of the vowel relates to formants only. According to prevailing theory, it is assumed that, in the first instance, the spectrum of a vowel sound exhibits determinable relative energy maxima, which are related to vowel-specific frequency ranges, and that, as a rule, the frequencies of these relative spectral energy maxima correspond to calculated formant frequencies, for example, applying LPC analysis. (Note that, nowadays, formant frequencies are no longer derived as numerical values from the spectral envelope but, instead, are calculated as filters of an analytical model, although the corresponding numerical results are in many cases crosschecked on the basis of a spectrogram.)
As discussed in Sections 3.1 and 3.2, the sound spectra of back vowels and of /a–α / can exhibit only one single vowel-specific spectral energy maximum, although formant analysis using an analytical model (e.g. LPC analysis)—under involvement of “phonetic knowledge” and sometimes with interactive manual adjustment of parameter settings—indicates two vowel specific formants, often close in frequency. This contradicts the assumption that the number and frequency of relative spectral energy maxima, that is the envelope peaks, always correspond to analytically determined formants.
As mentioned in Section 4.2, due to the increasing frequency spacing of the harmonics, the higher the fundamental frequency, the more difficult it becomes to determine the spectral envelope and its peaks (for further details, see also Section 6.4). This in turn impedes the formulation of a general correspondence between relative spectral energy maxima and calculated formant frequencies.
Regarding the current procedures used in formant analysis and the corresponding numerical values of formant patterns, it follows that in many cases—and thus in principle—the term formant often does not designate a characteristic of the sound spectrum itself, but instead a construct or even artefact of the respective method of analysis. ← 45 | 46 →
In the current literature, the term formant—if distinguished from resonance—generally refers neither to any actual characteristic of the vocal tract nor to any actual characteristic of the sound spectrum. The term generally refers to filters of an analytical model. At the same time, formants are not determined on the basis of spectra but on the basis of such an analytical model.
Thus, the assumption that a direct correspondence exists between resonances as a physical property of the vocal tract, spectral energy maxima as a physical characteristic of the vowel sound produced and filter frequencies derived from methods used in the acoustic analysis of vocal sounds, loses its plausibility.
As discussed, prevailing theory supposes a relationship between vowel-specific formant patterns and age- and gender-related speaker groups and explains corresponding differences in terms of the respective average vocal-tract sizes.
It can be assumed that some women have larger vocal tracts than some men. Comparing the vowel sounds of these female and male speakers, the following constellation is of particular interest in the present context: the sounds of the female speakers in question exhibit fundamental frequencies corresponding to the average fundamental frequency values for women in general, as given in formant statistics, and the sounds of the male speakers in question exhibit substantially lower fundamental frequencies. Then, according to prevailing theory, the vowel-specific formants of these female voices would have to exhibit lower frequencies—despite comparatively higher fundamental frequencies—than the corresponding formant patterns of these male voices.
Extending such consideration, this comparison raises the question of a systematic investigation of the relationship between vocal-tract size and vowel-specific formant patterns within a single speaker group.
Besides the lack of an empirical basis for the questions raised here, the above reflections again point to the fact that prevailing theory does not claim that vowel-specific formant patterns depend in principle on age and gender, but that different vowel-specific formant patterns exist for different vocal-tract sizes: prevailing theory only refers to speaker group-specific differences in average vocal-tract sizes.) ← 46 | 47 →
The term “age- and gender-related speaker group” is related to the term “age- and gender-related average vocal-tract size”.
Concerning natural vocalisations, current analytical methods for determining formants apply a model-like procedure in order to calculate a specific configuration of source sound and filters which, by means of transformation of source by filters, “reproduces” a sound that best corresponds to the real sound. (The same applies to whispered vowel sounds, in relation to the source as noise.)
Such a procedure must not only assume certain characteristics of the source sound but also a certain number and certain characteristics of the filters involved in the frequency range under investigation. (Note that, according to prevailing theory, different numbers of formants are expected for a given frequency range in relation to different speaker groups because of their different average vocal-tract size. Thus, the number of filters for the analysis of a sound must be set accordingly.) How closely the characteristics of the source sound approach actual phonation remains open. The same applies to the question of whether the number of filters and their characteristics actually correspond to real articulation and its resonance.
Thus, formants cannot be determined reliably on the basis of a vowel sound alone. Analysis requires at least some prior knowledge of whether the sound under investigation has been produced by a man, woman, or child, assuming that this information is sufficient to deduce the number of filters (related to the frequency range of interest) to be used in formant analysis.
Besides, subsequent automatically calculated formant frequency values are often double-checked visually on the basis of the sound spectrogram: if the values calculated in the first step—based on analytical parameters according to existing standards and known speaker group—do not correspond to the relative spectral energy maxima of the analysed sound, then the number of filters is varied and analysis is performed until such a correspondence occurs. As a rule, the characteristic of the source sound is not altered. However, this only applies to cases where such an interactive analysis is able to produce vowel-specific numbers and frequencies of formants that correspond to the number and frequency ranges to be expected according to prevailing theory and established statistical patterns, and which are also clearly indicated in the spectrogram. If an interactive procedure of ana ← 47 | 48 → lysis yields no values with such a correspondence, then the respective vowel sounds are often excluded from further studies, irrespective of vowel perception. Exceptions include so-called “formant merging”, as discussed in Section 3.2.
Thus, current methods of formant analysis presuppose that researchers have the necessary analytical skills, that is, a knowledge of the existing phonetic principles and rules of interpretation as well as extensive first-hand experience of conducting such an analysis. This involves prior training because such an analysis involves contextual knowledge, the ability to visually compare numerical values with a corresponding sound spectrogram, together with the ability to interpret the latter visually, and also the skills to vary filter settings interactively and to perform the repetition of numerical analysis. Consequently, methods of formant analysis are not completely objectifiable. If they were, then researchers would play no part as individuals in such research.
Strictly speaking, methods of formant analysis are not fully objectifiable; accordingly, they cannot be fully automated.
Most importantly, these procedures are also very time consuming. Therefore, investigations based on very extensive samples of sounds are problematic with regard to method. This is the case particularly if the fundamental frequency is varied: then, specific problems of analysis aggravate the costly character of the method as such. Obviously, this holds true for all repetitions and verifications of existing investigations.
In addition to formant analysis not being fully objective and automated, it also depends on the respective fundamental frequencies of the sounds. To repeat: the higher the fundamental frequency, the more difficult it becomes to determine the spectral envelope peaks expected because the frequency spacing between the harmonics become too large to accurately define the spectral envelope. It also becomes increasingly difficult to determine the formants within any of the existing analytical frameworks.
With regard to critical limits of fundamental frequencies, above which methods of formant analysis become unreliable, two kinds of reference ← 48 | 49 → values need to be considered: firstly, half the frequency of the lowest first formant for a speaker group in terms of an average vocal-tract size, and secondly, the frequency of the lowest formant for a speaker group.
For a fundamental frequency above half of the first formant frequency (F0 > ½F1), the frequency spacing between the harmonics is already so extended that defining a spectral envelope and evaluating the calculated numerical formant frequencies becomes problematic. (Note that for such sounds, the formants may not be clearly indicated by at least two harmonics.) According to this first kind of limit, and referring to the standard values established by Hillenbrand et al. (1995) for F1 of / i / (the lowest average value for F1 in these reference statistics), formant analysis becomes critical for fundamental frequencies higher than:
– 226 Hz for sounds of children (involving short vocal tracts)
– 219 Hz for sounds of women (involving medium-sized vocal tracts)
– 171 Hz for sounds of men (involving long vocal tracts)
For a fundamental frequency above the lowest first (statistically given) formant frequency for a given speaker group, under the assumption of independence of formants from fundamental frequency, it is basically impossible to distinguish all F1 of all vowels produced by speakers of that group, not to mention the aggravated problem of determining the spectral envelope. According to this second kind of limit, and again referring to the above statistics, methods of formant analysis lack a methodological basis for fundamental frequencies higher than:
– 452 Hz for sounds of children (involving short vocal tracts)
– 437 Hz for sounds of women (involving medium-sized vocal tracts)
– 342 Hz for sounds of men (involving long vocal tracts)
Note that referring to the statistics of Pätzold and Simpson (1997) for German vowels, shown in Section 2.2, the limits would have to be set even on lower frequencies: ½F1 of / i / corresponds to 165 Hz for women (medium-sized vocal tracts) and to 145 Hz for men (long vocal tracts), respectively; F1 of / i / corresponds to 329 Hz for women and to 290 Hz for men or long vocal tracts, respectively.
In this context, attention should also be given to the fact that, according to several formant statistics, the frequency distance between F1 and F2 for sounds of some back vowels is given < 500 Hz. Thus, the frequency spacing of the first two harmonics in a spectrum of a sound ← 49 | 50 → on a fundamental frequency above this frequency limit exceeds the F1–F2 distance mentioned, which renders formant estimation obsolete within the existing theoretical framework.
The first lists of frequency limits given above for F0 > ½F1 suggests that methodologically speaking the analysis of vowel sounds of children and women must be considered problematic in general. The critical fundamental frequency value mentioned for children is considerably lower than the empirically determined average fundamental frequency that children exhibit when producing vowels in citation-form words, which can be considered as related to relaxed speech on a comparatively low fundamental frequency (see, for example, the statistics in Section 2.1). Thus, most vowel sounds produced by children in their everyday expression, exhibit substantially higher fundamental frequencies.—According to Hillenbrand et al. (1995), the mentioned critical fundamental frequency value for women corresponds to the average fundamental frequency of women producing vowels in citation-form words. In everyday speech, however, vowel sounds in a fundamental frequency range of up to one octave higher than this value are the norm. Moreover, according to Pätzold and Simpson (1997), the mentioned critical fundamental frequency value for women is again considerably lower than the average fundamental frequency generally given in vowel statistics.—The problem discussed here seems to be less pronounced among men than among women and children, but it nevertheless concerns a substantial part of their utterances.
The second list of frequency limits reveals that, for methodological reasons, any determination of formant patterns of vowel sounds exhibiting fundamental frequencies that exceed low first-formant frequencies does not make sense, since general rules for formant estimation can no longer be formulated. In this regard, particular consideration needs to be given to voices exhibiting extensive prosodic variations in fundamental frequency, which can be experienced in everyday speech and, very pronounced, in the field of art and entertainment. (Noticeable, with regard to everyday speech, the literature does not provide ample documentation of the occurrence and significance of such extensive variation in fundamental frequency, allowing for a validation of the significance of the methodological problem of formant estimation discussed here. However, in the Materials section, examples of corresponding utterances are documented; see Section M8.2.) ← 50 | 51 →
Within the prevailing theoretical framework, the reliability of formant analysis depends on fundamental frequency and the age- and gender-related speaker group, that is, vocal-tract size. Depending on the latter, formant frequency estimation becomes critical for fundamental frequencies above c. 175 Hz, and formant frequency estimation can no longer be methodologically substantiated for fundamental frequencies substantially above 350 Hz. Consequently, formant analysis cannot be applied to all cases of clearly intelligible vowel sounds.
A part of the literature tends to equate the methodological problem with a particular characteristic of vowel perception, which leads us back to the two assumptions discussed in Sections 4.1 and 5.1: firstly, that vowels produced by children and women are basically less intelligible than those produced by men; and secondly, that at least some vowels of sounds at a fundamental frequency substantially above 350 Hz can no longer be clearly distinguished. As suggested, however, both assumptions contradict actual vowel perception.
On the one hand, formant parameters in current procedures of formant analysis are defined prior to analysis of the sounds depending on the corresponding speaker group, that is, the assumed average vocal-tract size of the speakers. On the other hand, these parameter settings are sometimes interactively altered during the procedure if the calculated numerical values do not yield the expected number of formants in the expected vowel-specific frequency ranges compared to the respective spectrogram.
Thus, for example, with regard to sounds of a single speaker, LPC analysis involving standard parameters according to the related speaker group (average vocal-tract size) may yield the expected values for only a part of the sounds, whereas the analysis of other sounds may require the parameters to be set to the standard of another speaker group (average vocal-tract size) or to a setting that is entirely different from any speaker-group related standard given in the literature.
This reveals an inconsistency in how parameter settings are established: in the first instance, default settings of analytical parameters are related to specific vocal-tract sizes, whereas any corrections of these settings are related to the respective general (not vocal tract related) degree of “formant resolution” of the analysis. ← 51 | 52 →
As explained in Section 6.1, current methods of analysis yield no consistent and direct relationship between spectrum, spectral envelope and formant frequencies. Consequently, this raises the question of the existence of a general relationship between a natural vowel sound, the determined formant pattern and resynthesis.
Currently, resynthesis is indeed being used to examine the reliability of calculated formant patterns. However, this kind of verification is unable to substantially relativise the general problems of the existing methods of analysis described above: resynthesis is feasible only if formant analysis is not fundamentally at issue and only with regard to a limited variation of analytical parameters.
Moreover, the question of resynthesis must be discussed against the background of synthesised sounds as discussed in Section 3.1, indicating the possibility of substantial differences in formant patterns of sounds of one vowel: if a certain analytically determined formant pattern used in a resynthesis reveals an “expected” vowel identity in a perceptual test, then this does not mean that another determined formant pattern, based on a different parameter setting, and applied in a second resynthesis, in principle cannot reveal the same vowel identity. Further, the possibility cannot be excluded that there are cases of sounds for which, with regard to the perceived vowel quality, based on “unexpected” formant patterns may produce a better approximation to the quality of the natural sounds in question than based on “expected” formant patterns.
It is noteworthy that, if a sound is synthesised using a specific pattern of filters and filter bandwidths, the formant pattern of a subsequent analysis may differ from the synthesis filters if the number of filters used is not communicated to the scholar conducting the analysis.
Moreover, the problem of possible differences of filters used in synthesis and formant patterns obtained in analysis will be substantially enhanced if the fundamental frequency is varied independent of the filters. ← 52 | 53 →
It is also noteworthy that, if formant patterns are calculated outside the framework of prevailing theory, for example, using LPC analysis as a method to decompose any sound into a source and a set of filters, irrespective of the fundamental frequency and the perceptual quality and not relating the decomposition to existing formant or resonance statistics (and therefore not considering a direct relationship between spectral peaks and resonances of the vocal tract), and if the results of analysis are used in resynthesis, for many examples of natural utterances, resynthesis reproduces similar intelligible vowel qualities, even for very high fundamental frequencies. Obviously, then, formant patterns will sometimes deviate strongly from the statistical patterns given in the literature. ← 53 | 54 → ← 54 | 55 →