Show Less
Open access

Acoustics of the Vowel

Preliminaries

Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

Materials Part II

Materials Part II

The second part of the Materials section contains selected excerpts from the literature as well further indications and discussions relating to the second part of the main text. ← 105 | 106 →

M3     Vowels and Number of Formants

Formant merging

“If you know you are analyzing a low back vowel, don’t be surprised to find one thick bar on the spectrogram that really corresponds to two formants close together below 1’000 Hz.” (Ladefoged, 2003, p. 114)

Referring to vocalisations of /ɔ / as in caught: “When the formants are close together […] neither the wide- nor the narrowband spectrum gives a good indication of the formant frequencies. […] The first two formants appear as a single peak below 1’000 Hz. Their frequencies cannot be determined from these spectra.” (Ladefoged, 2003, pp. 119–120)

Spurious formant

“Sometimes it is not immediately obvious whether a particularly wide band represents one formant or two. Figure 5.8 is a spectrogram of the word bud, spoken by a female speaker of Californian English. There is a wide band below 1,000 Hz, but is this one formant or two formants close together as in Figure 5.7? Noting that there is a clear formant at about 1,500 Hz in Figure 5.8, and additional formants higher, we must take it that there is only a single formant below 1,000 Hz. It seems that there is some kind of extra formant near the first formant, making this dark bar wider. From the evidence of this one vowel it is impossible to say whether the additional energy is above or below the first formant. Further analysis of this speaker’s voice showed that there was often energy around the 1,000 Hz region, irrespective of the vowel. This spurious formant is not connected with the vowel quality, but is simply a characteristic of the particular speaker’s voice. This is a good example of the necessity of looking at a representative sample of a speaker’s voice before making any measurements of the formants.” (Ladefoged, 2003, pp. 114–115)

“Flat” vowel spectra

“Flat-spectrum stimuli, consisting of many equal‐amplitude harmonics, produce timbre sensations that can depend strongly on the phase angles of the individual harmonics. For fundamental frequencies in the human pitch range, many realizable timbres have vowel-like perceptual qualities. This observation suggests the possibility of constructing intelligible voiced speech signals that have flat-amplitude spectra.” (Schroeder & Strube, 1986) ← 106 | 107 →

M4     Vowels and Fundamental Frequency

Independence of formants and fundamental frequency

“Obviously, formant frequency is independent from the fundamental frequency […] Changes in formant frequency are due to changes in the shape of the vocal tract cavity or cavities; changes in pitch frequency to stretching of the vocal cords. If the two physiological events are independent, so are the acoustic results of each event […].” (Delattre, 1958/1980)

“[…] when a complex wave consists of a damped waveform repeated at regular intervals, the component frequencies will always have the same relative amplitudes as the corresponding components in the continuous spectrum representing the isolated occurrence of the damped wave. Consequently, altering the rate at which the vocal folds produce pulses will affect the fundamental frequency of the complex wave; but it will not alter the formants (the peaks in the spectrum), which correspond to the basic frequencies of the damped vibrations of the air in the vocal tract. It is in this sense that we may say that the formants of a sound are properties of the corresponding mouth shape.

[…] the formants which characterize a given vowel irrespective of the rate at which pulses are produced by the vocal cords […]

We saw in Chapter 6 that the pitch of a sound depends mainly on the fundamental frequency. Accordingly, when there is a variation in the rate at which pulses are produced by the vocal cords, there will be a change in the pitch of the sound (although there will be no change in the formants, and hence no change in the characteristic vowel quality). It is usually possible to alter the pitch of a vowel sound without altering its characteristic quality, because each of these factors is controlled by a separate physiological mechanism. As we have seen, the pitch depends on the action of the vocal cords, and the characteristic quality depends largely on the formants, which have certain fixed values for each particular shape of the vocal tract.” (Ladefoged, 1996, pp. 98–99)

See also the citation of Hillenbrand (n.d.) in Chapter M6.

“Undersampling” the formants I: formants at middle and high fundamental frequencies

“According to the undersampling account of the effects of f0 on vowel identifiability, the sparser distribution of harmonics at high f0s yields poorer definition of the peaks and valleys in the spectral envelope, creating a more ambiguous stimulus.” (Diehl, Lindblom, Hoemeke, & Fahey, 1996) ← 107 | 108 →

“However, in this range of frequency (500 to 1000 Hertz), you could not tell apart different vowels anyway, because the harmonics of the voice are so far apart that they are not ‘sampling’ the locations of the formants enough for you to tell where the formants lie. Therefore operatic writers only put words intended to be intelligible in the lower part of a soprano’s range.” (Moore, 2006, p. 11)

“Oversinging” the first formant

“For the U it is also by no means easy to find the pitch of the resonance by a fork, as the smallness of the opening makes the resonance weak. Another phenomenon has guided me in this case. If I sing the scale from c upwards, uttering the vowel U for each note, and taking care to keep the quality of the vowel correct, and not allowing it to pass into O, I feel the agitation of the air in the mouth, and even on the drums of both ears, where it excites a tickling sensation, most powerfully when the voice reaches f. As soon as f is passed the quality changes, the strong agitation of the air in the mouth and the tickling in the ear cease. […] The resonance of the mouth for U is thus fixed at f with more certainty than by means of tuning forks. But we often meet with a U of higher resonance, more resembling O, which I will represent by the French Ou. Its proper tone may rise as high as f’.” (von Helmholtz, 1885/1954, p. 110; c = 131 Hz, f = 175 Hz, f’ = 349 Hz)

“Above f’, the characterization of U becomes imperfect even if it is closely assimilated to O. But so long as it remains the only vowel of indeterminate sound, and the remainder allow of sensible reinforcement of their upper partials in certain regions, this negative character will distinguish U. On the other hand a soprano voice in the neighbourhood of f’’ should not be able to clearly distinguish U, O, A; and this agrees with my own experience.” (von Helmholtz, 1885/1954, p. 114; f’’ = 699 Hz)

“It is reasonable to assume […] that it is impossible to produce recognizable vowels at musical pitches very much higher than their first formants. […]

The following table is offered as a practical guide: Vowels start seriously losing intelligibility when the fundamental reaches these frequencies:

(i u y)     350 cps (roughly middle F)

(e o ø)    450 cps (roughly middle A)

ɔ œ)   600 cps (roughly high D)

(æ α a)   750 cps (roughly high G)”

(Howie & Delattre, 1962) ← 108 | 109 →

“[…] only very few correct identifications of isolated vowels can be expected when fundamental frequency reaches or exceeds the usual first formant of a vowel.” (Hollien, Mendes-Schwartz, & Nielsen, 2000)

“[…] vowel identifiability is inevitably compromised once f0 exceeds R1 […]” (Joliveau, Smith, & Wolfe, 2004)

“We have seen that female singers gain considerably in sound level by abandoning the formant frequencies typical of normal speech when they sing at high pitches. At the same time, F1 and F2 are decisive to vowel quality. This leads to the question of how it is possible to understand the lyrics of a song when it is performed with the ‘wrong’ F1 and F2 values. Both vowel intelligibility and syllable/text intelligibility can be expected to be disturbed. This aspect of singing has been studied in several investigations.

As a thought-provoking reminder of the difficulties in arranging well-controlled experimental conditions in the past, an experiment carried out by the German phonetician Carl Stumpf (1926) may be mentioned. He used three singer subjects: a professional opera singer and two amateur singers. Each singer sang various vowels at different pitches, with their backs turned away from a group of listeners who tried to identify the vowels. The vowels that were sung by the professional singer were easier to identify. Also, overall, the percentages of correct identifications dropped as low as 50% for several vowels sung at the pitch of G5 (784 Hz).

Since then, many investigations have been devoted to intelligibility of sung vowels and syllables (see, e.g. Benolken & Swanson, 1990; Gregg & Scherer, 2006; Morozov, 1965). Figure 12 gives an overview of the results in terms of the highest percentage of correct identifications observed in various investigations for the indicated vowels at the indicated pitches. The graph shows that vowel intelligibility is reasonably accurate up to about C5 and then quickly drops with pitch to about 15% correct identification at the pitch of F5. The only vowel that has been observed to be correctly identified more frequently above this pitch is /a /. Apart from pitch and register, larynx position also seems to affect vowel intelligibility (Gottfried and Chew, 1986; Scotto di Carlo and Germain, 1985).

Smith and Scott (1980) strikingly demonstrated the significance of consonants preceding and following a vowel. This is illustrated in the same graph. Above the pitch of F5, syllable intelligibility is clearly better than vowel intelligibility. Thus, vowels are easier to identify when the acoustic signal contains some transitions (Andreas, 2006). Incidentally, this seems to be a perceptual universal: changing stimuli are easier to process than are quasi-stationary stimuli. ← 109 | 110 →

The difficulties in identifying vowels and syllables sung at high pitches would result both from singers’ deviations from the formant frequency patterns of normal speech and from the fact that high-pitched vowels contain few partials that are widely distributed over the frequency scale, producing a lack of spectral information.

In addition, a third effect may contribute. Depending on phonation type, the F0 varies in amplitude. At a high pitch, F1 may lie between the first and the second partial. Sundberg and Gauffin (1982) presented synthesized, sustained vowel sounds in the soprano range and asked subjects to identify the vowel. The results showed that an increased amplitude of the F0 was generally interpreted as a drop in F1.” (Sundberg, 2013, pp. 86–88)

“Grade” of vowels

As discussed in Sections 4.1 and 4.2, prevailing theory gives reason to assume that a general but also discontinuous relationship exists between the intelligibility of vowel sounds and their fundamental frequency: accordingly, vowel sounds at lower fundamental frequencies would, as a rule, be more intelligible than vowel sounds at higher frequencies, but vowel intelligibility would also depend upon the respective relationships between fundamental frequency, harmonic spectrum and the vowel-specific formant pattern (as given in formant statistics).

Concerning the former, consider the following model cases:

     Comparison of two sounds of /ε / produced by a woman at F0 of 200 and 400 Hz, related to a common formant pattern F1–F2 = 600–2000 Hz (compare Section 2.2, the formant statistics for Standard German); F1 will be “undersampled” for the sound at higher F0, i.e. F1 lying in between the first and the second harmonics, whereas for the first sound, the third harmonic matches with F1 indicating a “sampled” formant pattern F1–F2 as a better condition for vowel perception.

     Comparison of two sounds of /ɔ / produced by a woman at F0 of 285 and 340 Hz, related to a common formant pattern F1–F2 = 570–1140 Hz (compare Section 2.2, the formant statistics for Standard German); F1–F2 will be “undersampled” for the sound at higher F0, i.e. F1 lying in between the first and the second, and F2 lying in between the third and the fourth harmonics, while for the first sound, the second and the fourth harmonics match with F1 and F2.

     And so on. ← 110 | 111 →

Concerning the latter, consider the following model cases:

     Comparison of two sounds of / i / produced by a woman at F0 of 200 and 300 Hz, related to a common formant pattern F1–F2 = 300–2700 Hz (compare Section 2.1, the formant statistics of Peterson and Barney, 1952); F1 and F2 will be “undersampled” for the sound at lower F0, with F1 lying in between the first and the second, and F2 lying in between the twelfth and the thirteenth harmonics, while for the second sound, the first and the ninth harmonics match with F1 and F2 indicating a “sampled” formant pattern F1–F2 as a better condition for vowel perception.

     Comparison of two sounds of /α / produced by a woman at F0 of 270 and 330 Hz, related to a common formant pattern F1–F2 = 660–990 Hz (compare Section 2.1, the formant statistics of Fant, 1959); F1 and F2 will be “undersampled” for the sound at lower F0, i.e. F1 lying in between the second and the third, and F2 lying in between the third and the fourth harmonics, while for the second sound, the second and the third harmonics match with F1 and F2.

     Comparison of two sounds of /u / produced by a woman at F0 of 200 and 300 Hz, related to a common formant pattern F1–F2 = 300–900 Hz; F1 and F2 will be “undersampled” for the sound at lower F0, i.e. F1 lying in between the first and the second, and F2 lying in between the fourth and the fifth harmonics, while for the second sound, the first and the third harmonics match with F1 and F2.

     And so on.

“Undersampling” the formants II: resonances and formants

If a basic distinction is made between the resonances of the vocal tract and the formants of the vowel sound produced, strictly speaking, only resonances can be undersampled in the sense of a large frequency distance between harmonics and no harmonic matching an existing resonance frequency. Formants in their turn are always a result of a method of measurement. ← 111 | 112 →

M5     Formant Patterns and Speaker Groups

Thesis of age- and gender-related differences in vowel-specific format patterns

“Because of shorter cavity lengths females […] have larger average formant spacings and higher average formant frequencies than males. Similar relations hold for children compared with adults […].” (Fant, 1960, p. 21)

“Men, women, and children generally differ with respect to average vocal tract length, which is significant for the formant frequencies, as we know. For this reason, the same vowel is usually represented by different formant frequencies in men, women, and children.

[…] average formant frequency differences between male and female adults are expressed as the percentages by which the three lowest formant frequencies of a given vowel in female adults exceed those in male adults (Fant, 1975). […] they vary considerably between vowels, particularly for the lowest two formants. […] these percentage differences occur similarly in various languages. The first formant frequency shows a maximum percentage difference in the open /a: / vowel of the Italian word caro. The second formant frequency shows high values for all front vowels. The difference, averaged over the entire set of vowels, amounts to 12%, 17%, and 18% for the three lowest formants. Children’s average formant frequencies are about 20% higher than those for female adults, or 32%, 37%, and 38% higher than those of male adults. Probably most of these differences are due to inequalities in the vocal tract dimensions between the various groups of speakers. Thus, younger children tend to have higher formant frequencies than older children because of their shorter vocal tracts.

If the proportions of the average female and male vocal tracts are compared, one finds that the female vocal tract is not merely a small-scale version of the male vocal tract. According to Nordstrom (1977), the average mouth length of a female adult is about 85% of that of the average male adult, while the female pharynx length is only 77% of the corresponding male value. In other words, the average female pharynx is much shorter than the average male pharynx, while the average difference is smaller with regard to the mouth.

If one computes the formant frequency differences that would result from these dissimilarities in the mouth and pharynx proportions between adult males and females, one finds a discrepancy between prediction and reality; the differences that have been found in the dimensions do not explain the actual formant frequency differences, according to ← 112 | 113 → Nordstrom (1977). The reason for this is not well understood. The existence of sex dialects, or ‘sexolects’, cannot be excluded; it is possible that females and males use a slightly different articulation of some vowels. The reason may be hidden in the largely unknown processes used by our sense of hearing and our brain in order to identify vowels.

We correctly infer that the actual reasons for the formant frequency differences between children and adult males and females are not understood in every detail. However, it is also interesting to see to what extent the voice timbre differences between these groups of speakers can be accounted for by the formant frequency differences. Colem (1976) has published an interesting investigation on this topic. In an experiment in which subjects tried to identify the sex of speakers by listening to the voice quality, he found that phonation frequency was a much more important factor than formant frequencies as illustrated in Figure 5.10; the average of the three lowest formant frequencies showed little or no correlation with maleness and femaleness in voice timbre. The faint trace of a correlation that appears to exist between the average of the three lowest formant frequencies and the perceived maleness or femaleness was due to an equally low correlation between phonation frequency and this formant frequency average.

It may be important to these results that the three lowest formant frequencies were not separated but were converted into an average in this investigation. It is not clear whether such an average catches all of the timbral voice differences between the sexes, and it is also possible that the results would have come out differently if the fourth formant had been included in the average; the higher the formant frequency, the more its frequency depends on nonarticulatory factors such as vocal tract length.

It seems clear that the perceptually most important difference in voice quality between the two sexes depends on phonation frequency rather than formant frequencies. The mean phonation frequency difference is almost one octave, which is much greater than the formant frequency difference. We realize that our brain is quite smart: it is more impressed by the great phonation frequency difference than by the small formant frequency difference when guessing the sex of a speaker.” (Sundberg, 1978)

Concerning indications of similar formant patterns for sounds of different vowels produced by speakers of different speaker groups, see, for example, the vowel synthesis experiment in Potter and Steinberg (1950), and the [e]–[ø] ambiguity reported by Fant, Carlson, and Granström (1974). See also the indications of similar F1–F2 for / U / and /u /, and for /ʌ / and /o / in the statistics of Hillenbrand et al. (1995), ← 113 | 114 → comparing the patterns of women and men, and of children and men, respectively.

Questioning this thesis: von Helmholtz (1885), Potter and Steinberg (1950)

“ […] the proper tones of the cavity of the mouth are nearly independent of age and sex. I have in general found the same resonances in men, women, and children. The want of space in the oral cavity of women and children can be easily replaced by a great closure of the opening, which will make the resonance as deep as in the larger oral cavities of men.” (von Helmholtz, 1885/1954, p. 105)

Note that this statement by von Helmholtz stands in contradiction to his self-experiment, on the basis of which he concluded a vowel-specific resonance for U at 175 Hz (see Chapter M2): particularly for the speech of children, the fundamental frequency is substantially above 175 Hz, not allowing for a production of U, if vowel-specific resonances are independent of age and gender.

“Audible Form and Vowel Identification: Form or pattern of the formant positions appears to be important in discriminating between sounds. One of the first results found was that, for a given vowel sound, the actual formant frequency positions for a man’s voice differ markedly from those for a woman’s or a child’s voice. To illustrate this difference the frequencies of the formants in the vowel sound [æ] as spoken by a man, a woman and a child are shown on the left hand side of Fig. 5 by short horizontal lines designated F1, F2, F3. […] Listening tests indicate that these three sounds are identified as the same vowel. Yet the values of the formant frequencies are quite different. Certainly we cannot regard a vowel as completely specified by fixed regions of energy concentration. […]

If we view the formant positions in relation to positions of fundamental frequency, they fall into better alignment. This suggests that the fundamental frequency of the voiced sounds might offer a means for normalizing the formant positions. However, this seems a dubious possibility because the formant positions for a given vowel are probably directly related to the dimensions of the vocal cavities and only incidentally related to fundamental frequency. For example, whispered vowels can be identified readily. Also there may well be cases of high fundamental frequency with large vocal cavities, and vice versa, that would need to be considered. ← 114 | 115 →

To obtain preliminary information on the question of how pitch affects vowel identification we have synthesized sounds having the same formant outlines but different fundamental frequencies. One such case is illustrated in Fig. 6. The two upper charts show the spectra for the [æ] (had) sounds of Fig. 5, for the adult male and child’s voices. The fundamental frequencies are 109 and 264 cycles respectively. The lower chart shows an unnatural spectrum, namely, the adult male’s formant outline with a fundamental frequency of 256 cycles, approximating that of the child’s voice. This frequency was chosen so that the peaks of the formants would not be shifted markedly in position. Sounds corresponding to the three spectra were synthesized by means of a spectrum generator […].

The first two synthesized sounds were readily identified by ear as [ae] sounds. The third sound, however, was neither the man’s nor the child’s [ae]. It seemed to be somewhere between the child’s [ae] and [ε]. This phonetic shift may indicate an association between fundamental frequency and formant position. But the shift could also arise if the ear assigns different pitch centers or positions to the energy concentrations representing the formants in the upper and lower cases.

The effects become more pronounced when the back vowels are used in such a comparison. Figure 7 shows spectra similar to the ones in Fig. 6, except that they are for the [α] (father) sound.

In this case, the first two sounds were clear [α’s]. The third sound was more like a child’s [ɔ] (awl) than the [α] (father). Here there is also a question of association or actual shift in the ear’s assignment of formant position. Still if one considers the bar positions of these sounds as illustrated in Fig. 8, there is some support for an association of fundamental frequency and formant position. […] We have seen that an increase in fundamental frequency seems to require that both bars be raised in frequency position to maintain the identification of a given vowel (Fig. 5). Hence, in the case of the [α] sound, the combination of adult formants with the child’s fundamental frequency shifts the sound toward the [ɔ]. It must be admitted, though, that the association of adult formants and child’s fundamental frequency is an unnatural one giving sounds that do not correspond to any of the natural sounds.” (Potter & Steinberg, 1950)

Exceptions in existing formant statistics

Although in formant statistics, the highest frequency values of vowel-specific formants are generally given for children, middle values for women and the lowest values for men, exceptions can be found. Some examples of such exceptions are listed below, ordered according to ← 115 | 116 → vowel quality. Abbreviations used are: “ * ” = values for the comparison of voiced vowel sounds, “ ** ” values for the comparison of whispered vowel sounds; “SinSp” = values for the comparison of the sounds of a single male and a single female speaker as given in Fant (1959); “Av” = average values for a speaker group in the statistics of Fant (1959). Examples of single formants or formant patterns for which higher frequency values are given for men than for women:

/ i /F1*-F2*-F3* (Fant, 1959, SinSp); F1* (Fant, 1959, A), F1* (compare Pols, Tromp, & Plomp, 1973, van Nierop, Pols, & Plomp, 1973)
/ y /F1* (Fant, 1959, SinSp; marginal difference for F2*), F1* (compare Pols et al., 1973, Van Nierop et al., 1973
/ e /F1*-F2* (Fant, 1959, SinSp)
ɵ /F1*-F2*-F3* (Fant, 1959, SinSp); F1* (Fant, 1959, A)
/ ε /F2* (Fant, 1959, SinSp)
/ æ /F2* (Fant, 1959, A); F2** (Sharifzadeh, McLoughlin, & Russell, 2012)
ɔ /F1** (Sharifzadeh et al., 2012; marginal difference F2**; marginal differences also for F1*-F2*)
/ o /F1* (Fant, 1959, SinSp)
ʊ /F1*-F2* (Fant, 1959, SinSp); F1* (Fant, 1959, A); F1*, F1**-F2** (Sharifzadeh et al., 2012)
/ u /F1* (Fant, 1959, SinSp); F2* (Fant, 1959, A); F1* (compare Pols et al., 1973, Van Nierop et al., 1973); F1* (Zee, 2003); F1** (Sharifzadeh et al., 2012)

See also Hillenbrand et al. (1995) for slightly higher F1 values of /ʌ / for women than for children.

“We have argued […] that for the vowels /u /, / i / and /y/ as well, F1 can be chosen so that its average value is higher for female speakers than for male speakers. However, F1 then becomes about equal to 2xF0 (490 Hz) which is much too high. The data on the vowels /u /, / i / and /y/ do not confirm the usual upward shift of formant frequencies for female speakers. We do not suggest that the anomaly for these three vowels reflects the actual resonance frequencies of the vocal tract.” (van Nierop et al., 1973)

Zee (2003) found lower F1 for women than for men for the vowel /u / when investigating formant frequencies of Cantonese vowels and comments his finding as follows: “In any case, it is not clear as to why the F1 value for [u] does not follow the general pattern.” ← 116 | 117 →

“In looking at the ranges for each vowel formant frequency for the male and female groups, the overlap between genders was considerable. In all cases, the highest formant value for the male group was markedly above the lowest formant value for the female group for each formant of both vowels. This would suggest that in some individual cases, the formants of a male speaker might be the same as, or even higher than, the formants of a female speaker.” (Gelfer & Bennett, 2013) ← 117 | 118 →

M6     Terms of Reference, Methods of Formant Estimation

Terms of reference

“Formant […]. A concentration of acoustic energy within a particular frequency band, especially in speech. Any given configuration of the vocal tract produces resonance, and hence formants, in certain frequency ranges. During the articulation of a vowel, these formants show up prominently in a sound spectrogram as thick dark bars; the three lowest of these, known as first, second and third formants (F1, F2 and F3) are highly diagnostic, and vowels are distinguished acoustically by the positions of these formants.” (Trask, 1996, p. 148)

“Some refer to a formant as a peak in the acoustic spectrum. In this usage, a formant is an acoustic feature that may or may not be evidence of a vocal tract resonance. Others use the term formant to designate a resonance, whether or not actual empirical evidence is found for it.” (Kent & Read, 2002, p. 24)

“Resonances, formants and spectral peaks: Unfortunately, the meaning of the word ‘formant’ has expanded to describe two or three different things. Fant (1960) gives this definition: ‘The spectral peaks of the sound spectrum | P( f ) | are called formants.’ Resonance frequencies are then defined in terms of the gain function T( f ) of the tract by ‘The frequency location of a maximum in | T( f ) |, i.e. the resonance frequency, is very close to the corresponding maximum in spectrum | P( f ) | of the complete sound.’ Fant then writes: ‘Conceptually these should be held apart but in most instances resonance frequency and formant frequency may be used synonymously.’ Benade (1976) uses a similar definition of formant: ‘The peaks that are observed in the spectrum envelope are called formants.’ More recently, the acoustical properties of the vocal tract are often modelled using an all-pole autoregressive filter (Atal and Hanauer, 1971). For many voice researchers, formants now refer to the poles of this filter model. To others, formant means the resonance frequency of the tract. Finally, many researchers, particularly in the broader field of acoustics, retain the original meaning: a broad peak in the spectral envelope of a sound (of a voice, musical instrument, room etc.). The original meaning of formant is also retained, almost universally, when discussing the singers formant and actors formant: these terms refer to a peak in the spectral envelope around 3 kHz (discussed below). As Fant observes, while these uses are often closely related, they are conceptually quite distinct. Further, the resonant frequency, ← 118 | 119 → the pole of the fitted filter function and the peak spectral maximum need not coincide. Moreover, it is now possible to measure resonances of the vocal tract quite independently of the voice. Consequently, it is sometimes essential to make a clear distinction among a resonance frequency (a physical property of the tract), a filter pole (a value derived from data processing) and a spectral peak (a property of the sound).” (Wolfe, Garnier, & Smith, 2009)

“Formant is used by James Jeans (1938) to mean the collection of harmonics of a note that are augmented by a resonance.

Formant was defined by Gunnar Fant (1960): ‘The spectral peaks of the sound spectrum | P( f ) | are called formants’.

Benade (1976) writes: ‘The peaks that are observed in the spectrum envelope are called formants’.

In its standards for acoustical terminology, the Acoustical Society of America (1994) defines formant thus: “Of a complex sound, a range of frequencies in which there is an absolute or relative maximum in the sound spectrum. Unit, hertz (HZ). NOTE-The frequency at the maximum is the formant frequency.” (Wolfe, n.d.)

“Does it matter? For the voice, a resonance at a frequency R( i ) gives rise to a spectral maximum at frequency F( i ) which may produce in a filter model a pole at frequency P( i ). Usually, the three frequencies have similar values. However, as Fant observed, they are conceptually distinct. Let’s take some examples:

     Consider a vocal tract with a resonance at 500 Hz, which is being excited by the larynx producing a fundamental frequency of 1 kHz (near C6, the high C for sopranos). There is no spectral maximum at 500 Hz. In this case there is a resonance R1 but no corresponding spectral peak F1. Here of course the difference does matter.

     Consider the singers formant or singing formant, a broad band of enhanced power noticed in the spectral envelope of classically trained male singers (and possible others) in a range. Sundberg (1974) attributes this formant to a clustering of the third, fourth and fifth resonances of the vocal tract. Here, where three resonances are thought to give rise to one formant, the distinction between formant and resonance is important.

     Consider a glottal source with a negative spectral slope, input to a vocal tract that (including radiation impedance) has a resonance at R1. The peak in the spectral envelope of the radiated sound in this case has a frequency less than R1. In this case, if one is estimating the spectral peak from the harmonic spectrum ← 119 | 120 → of the output voice, the difference between the two is less than the precision of the estimation, so the distinction is usually not important.

     Consider a musical wind instrument, whose bore radiates weakly below some frequency f, and which is excited by a reed or lip valve whose spectral envelope falls with frequency. Here the output sound has a spectral envelope peak that has nothing at all to do with the resonances of the bore.

     Consider this quote, from Stevens and House (1961): ‘When resonant frequencies are sufficiently close, however, they are not necessarily identical with the frequencies of the peaks in the spectrum. For example, when two resonances with bandwidths of about 100 cps are about 100 cps apart, the spectrum envelope may show only one prominence: the frequency of the peak will be somewhere between the two resonant frequencies. In the discussion that follows, the levels of the resonances will be defined to be the levels of the spectral envelope at the frequencies of the resonances (rather than at the spectral peaks).’

In our laboratory, the distinction is important. We routinely measure the resonances independently of the voice (Epps et al, 1997; Dowd et al, 1997; Joliveau et al, 2004a, b). We are often interested in comparing formants and resonances.

What to do? Our preference would be to retain the original meaning for the word formant. We prefer to say ‘A resonance at frequency Ri gives rise to a formant at frequency Fi. This may be modelled by a filter with a pole at frequency Pi’. While acousticians will broadly agree with this use, some members of the speech research and modelling community may not. We therefore suggest that, when discussing the voice, the word formant should be defined, to make it clear which meaning is intended. In principle, one could consider abandoning the word. However ‘broad peak in the spectral envelope’ is a long phrase, so it is useful to retain formant for that reason.

[…]

Whatever your choice of definition, you should make it clear. And, in literature and in discussions, prepare for some confusion. For instance, some researchers who use formant to mean resonance will also talk about ‘formant level’. When such people then talk of ‘formant level’, or say that the second formant is 10 dB lower than the first, I suspect that they refer to the amplitude of a peak in the sound spectrum. In a scientific talk, I have heard the sentence: ‘Trained sopranos tune the first formant near the note sung, but they usually don’t have a strong singer’s formant’. When that speaker said ‘first formant’ he presumably ← 120 | 121 → meant ‘first resonance’ and when he said ‘singer’s formant’ he meant a spectral peak probably due to two or more resonances. So we have the same person using the word in two of its three different meanings in the one sentence.” (Wolfe, n.d.)

“With regard to airway resonances, historical precedence and current usage of terminology are also slightly at odds. Joe Wolfe and colleagues suggest that the symbol R be used to stand separate from the symbol F for formant (Wolfe, 2014). The distinction is being made because a formant was originally defined as a peak in the output spectrum envelope radiated from the mouth (Hermann, 1894, 1895; Russell, 1929; Fant, 1960, p. 20). A similar definition appears in the current ASA standard of acoustic terminology (Acoustical Society of America, 2004), namely, that a formant is ‘a range of frequencies in which there is absolute or relative maximum in the sound spectrum. The frequency at the maximum is the formant frequency.’ As such, a formant involves both the source and the filter. However, as speech analysis and synthesis have progressed in a half century, the definition has not been universally maintained. Fant (1960, pp. 20, 53) defined formants as the poles of the transfer function of the supraglottal vocal tract, and labeled the pole frequencies F1, …, Fn and their bandwidths B1, …, Bn. He was followed in this path by many authors, such as Titze (1994, p. 156) or Stevens (1998, p.131). It is noteworthy that Flanagan (1965, p. 57) was aware of the dual definition (and possible evolution) by using the term ‘formant resonance.’ While Benade (1976) maintained the definition of ‘peaks in the spectral envelope of the radiated sound,’ Badin and Fant (1984) computed formant frequencies and bandwidths on the basis of x-ray area function resonances of the supraglottal vocal tract, not peaks in the output spectrum envelope. Story et al. (1996) did similar calculations based on magnetic resonance imaging (MRI). Differentiation between the formant frequencies and resonance frequencies of the vocal tract can be found in some papers comparing measurements from phonation (formants) to those derived from vocal tract impedance measurements or from calculations based on MRI or computer tomography (CT) data (resonance frequencies) (e.g., Stoffers et al., 2006; Vampola et al., 2013).

What is relevant here for nomenclature and symbolic notation is that the letter R is easily distinguishable from the letter F or f, both in speaking and writing. Hence, it is useful as a subscript to separate source and filter symbols. Discussion can continue on whether or not a formant is a meaningful representation of any particular resonance. Some authors describe resonances pertaining to the supraglottal airway only (assuming no coupling to the glottal or subglottal system), ← 121 | 122 → while others describe the net effect of complex interactions of multiple resonators above, below, and within the larynx. […]

Unfortunately, the common definition between a formant and a resonance is yet to be established.” (Titze et al., 2015)

Note that Titze et al. (2015) propose a new and consistent terminology for the frequencies, magnitudes and bandwidths of harmonics, resonances and formants.

Spectrum Envelope: The term spectrum envelope refers to an imaginary smooth line drawn to enclose an amplitude spectrum. Figure 3-17 shows several examples. This is a rather simple concept that will play a very important role in understanding certain aspects of auditory perception. For example, we will see that our perception of a perceptual attribute called timbre (also called sound quality) is controlled primarily by the shape of the spectrum envelope, and not by the fine details of the amplitude spectrum. The examples in Figure 3-17 show how differences in spectrum envelope play a role in signaling differences in one specific example of timbre called vowel quality (i.e., whether a vowel sounds like / i / vs. /a / vs. /u /, etc.). For example, panels a and b in Figure 3-17 show the vowel /α / produced at two different fundamental frequencies. (We know that the fundamental frequencies are different because one spectrum shows wide harmonic spacing and the other shows narrow harmonic spacing.) The fact that the two vowels are heard as /a / despite the difference in fundamental frequency can be a ttributed to the fact that these two signals have similar spectrum envelopes. Panels c and d in Figure 3-17 show the spectra of two signals with different spectrum envelopes but the same fundamental frequency (i.e., with the same harmonic spacing). As we will see in the chapter on auditory perception, differences in fundamental frequency are perceived as differences in pitch. So, for signals (a) and (b) in Figure 3-17, the listener will hear the same vowel produced at two different pitches. Conversely, for signals (c) and (d) in Figure 3-17, the listener will hear two different vowels produced at the same pitch.” (Hillenbrand, n.d., pp. 16–17)

Methods of formant estimation I: general aspects

“The difficulties involved in measuring formant frequencies have been well known since the early days of the spectrograph, and involve errors related to ( i ) the ambiguous definition of the object to be measured, ( ii ) spectral features of the speech wave, ( iii ) intermodulation distortion, (iv) the spectrographic record, and (v) the measuring procedure: ← 122 | 123 →

     A formant is seen both as a spectral prominence in the speech wave and as a filter property of the vocal tract; a definition comprising both components contradicts itself; a definition embracing just the first component presupposes that the relevant information for speech perception is immediately available in the speech wave; a definition based on the second part alone is production oriented and sees the true formant value as a vocal tract pole frequency that is being measured from its (sometimes poor) reflection in the speech wave.

     The resolution of the spectral envelope depends on the interval between the partials, which is equal to the fundamental frequency; a spectral peak may be asymmetrical within the formant band; individual spectral peaks become less well defined as they approach each other or as their bandwidths increase. […]

Lindblom’s advice is thus still valid today. It is still necessary to apply one’s knowledge and experience of speech production and expected envelope shapes to the problem of how to select samples to measure and where to look for spectral peaks.” (Wood, 1989, referring to Lindblom, 1962)

“[…] At this point we should remember that an LPC filter lumps together several aspects of speech production […]. An LPC spectrum represents not only the formant frequencies due to the resonances of the vocal tract but also the effects of the lip radiation and the spectrum of the pulse from the vocal folds. Nevertheless, the peaks in the LPC spectrum are usually good indicators of the formant frequencies. Problems may arise when two formants are close together, in which case the spectrum may appear to have only a single peak corresponding to both of them, or when one formant has a lower amplitude, so that it appears as only a kink in the curve representing another formant. These problems lead us to another way of considering LPC analysis.

It is also possible to analyze an LPC expression so as to determine the exact frequencies corresponding to the poles (which, however, may not be exactly those of the formants in the vocal tract transfer function). For every pair of LPC terms we get a pair of numbers corresponding to the frequency and the bandwidth of a pole in the filter. We know […] that there will be a formant at 500Hz, 1,500 Hz, 2,500 Hz, and so on in a neutral vowel for a speaker with a vocal tract of 17.5 cm. In general, for such a speaker there will be one formant for every 1,000 Hz interval. So with a 10,000 Hz sample rate and an upper frequency limit of 5,000 Hz, we can expect to find five formants. This will require ten LPC terms. If we want to allow two further terms to account for higher ← 123 | 124 → formants that may be influencing the spectrum or a pole due to the glottal pulse shape, then we should make a twelve-point LPC analysis. If the speaker might have a shorter vocal tract so that we could only expect four formants below 10,000 Hz, then we could use a ten point LPC.

Choosing the right number of coefficients for an LPC analysis is somewhat of an art. If one chooses too many, the analysis will produce poles corresponding to spurious formants; if one chooses too few, formants may be lumped together because the higher formants or the glottal pulse may require more complex specification. The problem is compounded by the fact that an LPC analysis is equivalent to trying to model the spectrum using only poles, and there may be zeros (antiresonances) in the vocal tract transfer function. There certainly will be antiresonances in any vocal tract shape that contains the equivalent of a side tube, such as the oral cavity in the case of a nasal sound. LPC analysis is not reliable for nasalized vowels. A general rule of thumb for the number of coefficients is the sample rate in kHz plus 2, e.g. 10,000 Hz = 10 kHz plus 2 equals 12. But a better rule is to use several different analyses with different numbers of coefficients and see which gives the most interpretable results.” (Ladefoged, 1996, pp. 210–212)

“Good spectrograms are a great help in determining where the formants are. This is often not as easy one might imagine. You have to know where to look for formants before you can find them. The best practical technique is to look for one formant for every 1,000 Hz. The vowel ə, for example, has formants at about 500, 1,500 and 2,500 Hz for a male speaker (all slightly higher for a female speaker). Other vowels will have formants up or down from this mid range. But there are exceptions to this general rule of one formant per 1,000 Hz. It would be more true to say that there is, on average, one formant for every 1,000 Hz. Low back vowels may have two formants below 1,000 Hz, but nothing between 1,000 and 2,000 Hz, and then the third formant somewhere between 2,000 and 3,000 Hz.” (Ladefoged, 2003, pp. 113–114)

Methods of formant estimation II: methodological limits related to F0

“[…] in the case of female speech, formant analysis is extremely difficult. The fundamental frequency is so high that formants are often poorly defined. […] We had difficulties in determining the position of a formant in about 40% of the 300 vowel segments, if no a priori knowledge was used.” (Van Nierop et al., 1973)

“[…] because formant frequencies are hard to determine when fundamental frequency is higher than about half of the frequency of the first formant.” (Sundberg, 1987, pp. 124–125) ← 124 | 125 →

“Accurate measurement of formant frequencies is important in many studies of speech perception and production. Errors in formant frequency estimation by eye, using a spectrogram, or automatically, using linear prediction, have been reported to be as high as 60 Hz at F0 < 300 Hz. This exceeds the typical auditory difference limens (DLs) for formant frequencies and is also greater than some of the variation that one would like to study, e.g. the acoustic effects of varying vocal effort. The problem becomes substantially worse when F0 is as high as 500 to 600 Hz, which is not uncommon in the speech of women and children at high vocal efforts.” (Traunmüller & Eriksson, 1997)

“Measurements of the frequency position of the formants, considered as the resonances of the vocal tract, are affected by substantial errors when F0 is as high as it is when people communicate over large distances. This holds for LPC-based methods as well as when using visual inspection of spectrograms.” (Traunmüller & Erikkson, 2000)

“The problem is that it is difficult to determine reliably the resonance frequencies of the tract from the sound alone, using either spectral analysis or linear prediction, once F0 exceeds 350 Hz (Monson and Engebretson, 1983), and essentially impossible once F0 exceeds 500 Hz.” (Joliveau et al., 2004)

“[…] it is difficult to determine unambiguously the frequencies of the resonances with a resolution much finer than f0/2.” (Swerdlin, Smith, & Wolfe, 2010)

Methods of formant estimation III: “One wonders, for example, if the source-filter theory of speech production would have taken the same course of development if female voices had been the primary model early on.”

“To a large extent, the early work in acoustic phonetics focused on the adult male speaker. There were a number of reasons for this focus, including social and technical factors. Only rather recently has the study of acoustic phonetics been broadened to encompass significant research on populations other than men. This is not to say that children and women were neglected altogether in the early history of acoustic speech research. Peterson and Barney’s (1952) classic study included acoustic data on vowels for men, women and children, making it clear that acoustic values vary markedly with age and gender characteristics of speakers […].

The problem is that the research effort given to the speech of women and children has been on a smaller scale than that given to the speech of men. Consequently, there is a continuing need to ← 125 | 126 → gather acoustic data for diverse populations. The concentration on male speakers had several consequences, not all of which facilitated research on the speech of women and children. One consequence was the choice of an analyzing bandwidth (300 Hz for the ‘wide-band’ analy­sis) on early spectrographs that worked well enough for most adult male voices but was deficient for many women and children. The unsuitability of the analyzing bandwidth probably discouraged acoustic analyses of women’s and children’s speech.

The implications of the male emphasis may have reached even to theory; Titze (1989, p. 1699) commented, ‘One wonders, for example, if the source-filter theory of speech production would have taken the same course of development if female voices had been the primary model early on.’ Klatt and Klatt (1990, p. 820) remarked on the same point: ‘informal observations hint at the possibility that vowel spectra obtained from women’s voices do not conform as well to an all-pole [i.e. all formant] model, due perhaps to tracheal coupling and source/tract interactions.’ The acoustic theory for vowels […] assumed that the vocal tract transfer function is satisfactorily represented by formants (poles) and that antiformants (zeros) are required only for modifications such as nasalization. It is advisable to bear in mind that this theory is predicated largely on the characteristics of adult male speech and that it may have to be altered to account for the characteristics of both children and women.” (Kent & Read, 2002, pp. 189–190) ← 126 | 127 →