Show Less
Open access

Acoustics of the Vowel


Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access

13 Preliminaries

13    Preliminaries

13.1    Impediments to Adjusting Prevailing Theory

In response to the principal difficulties in intellectually re-enacting the prevailing theory of the acoustics of the vowel and in response to the empirical observations discussed in the previous chapters, there are several arguments against adjusting or modifying prevailing theory and the corresponding methods of acoustic analysis.

According to prevailing theory, formant patterns are deduced from patterns of vocal-tract resonances. The formulation of a substantial interrelation between these resonances and fundamental frequency in the production of vowel sounds would directly contradict the two-part model of source and filter and the corresponding understanding of phonation and articulation, namely, the production of a general source sound and its transformation by vocal-tract resonances. Fundamental frequency is a primary characteristic of the source, and resonances are a primary characteristic of the vocal tract. These resonances are independent of the sounds or noises affecting them. (Interactions of source and filter, as described in the literature, do not relate to the aspects discussed here.) This amounts to a fundamental conceptual obstacle when it comes to differentiating or modifying prevailing theory.

Current methods of formant analysis neglect fundamental frequency as a source characteristic in the calculation of filters. There is little scope for changing this approach within the existing procedural framework.

Besides, even if formants are not considered to be directly linked to vocal tract resonances, interpreting them solely as results of an analytical decomposition of a sound in a source and a set of filters, it proves difficult to imagine a corresponding method of acoustic analysis applicable to all recognisable sounds and all of the aspects discussed in Part III. This lack of projection itself impedes the modification of prevailing methodology.

The observable behaviour of vowel-specific patterns of relative spectral energy maxima (if determinable) and of formants (if methodically substantiated) cannot be formulated in terms of a general rule, such as relating these characteristics to fundamental frequency as a simple ratio, whether or not such a ratio is based on an auditory scale. Empirically, these characteristics prove to be unsystematic: in general, the shifts in the spectral envelope peaks and the formants discussed are distinctly evident only at fundamental frequencies above c. 200 Hz; the ← 78 | 79 → shifts affect the lower spectral frequency ranges and the higher ranges differently; thus, the shifts affect the entire vowel-specific frequency range of back vowels in a direct way but only affect the vowel-specific frequency range of front vowels partly; the shifts relate to vowel quality, yet in parallel, they also relate to the frequency levels of the spectral envelope peaks or formants in question; in addition, a strong variation in vocal effort also affects the frequency location of the spectral envelope peaks and the calculated formants.

Because of this lack of systematic empirical evidence and because there is no uniform method for analysing vowel-specific acoustic characteristics, including all utterances allowing for vowel perception, no robust basis exists for a further differentiation of the description of the vowel-specific spectral characteristics within the prevailing approach to relate to patterns of spectral peaks or patterns of formants.

These reflections, experiences and observations constitute the scepticism expressed in this treatise about attempting to adjust or modify prevailing theory and related methods of further analysis.

13.2    Prevailing Theory as an Index

Given that a voiced vowel sound is produced in isolation and that it exhibits a quasi-constant periodic spectral characteristic, and given its unambiguous perception as belonging to a specific vowel quality (related to a particular language), then, its average harmonic spectrum, measured for the entire duration of the respective sound, is said to be vowel specific: for a frequency range concerning the physical representation of all vowels of the corresponding language, a series of harmonics quasi-identical in number, frequencies and levels, can only be found for other sounds of the same vowel but not for other sounds of any other vowel. Such a statement is formulated in terms of a hypothesis here.

The same holds true for corresponding sounds that are isolated from a particular syntactic and semantic context and that are analysed accordingly as sound fragments.

Obviously, a direct comparison of harmonic spectra always relates to sounds at quasi-identical fundamental frequencies.

Harmonic spectra, as claimed here, are vowel-specific and, further, may also prove to be orthogonal in vowel representation: on their basis, the respective sounds are expected to be reproducible without any change in the perceived vowel quality. ← 79 | 80 →

Hence, the fundamental aspects of the problems discussed in the previous parts of this treatise can neither be attributed to dynamic processes occurring within a sound nor to the particular characteristics of the syntactic and semantic context, nor indeed to special perceptual processes. Nor can these aspects be relativised accordingly. On the contrary, they constitute an ensemble of individual problems that first needs to be explained, just as the physical representation of the vowel itself, as a phenomenon, needs to be clarified.

Given that voiced vowel sounds are compared at similar fundamental frequencies, and given that the spectral envelope is determined by the amplitude values of the harmonics, obviously, such an envelope is also vowel specific. However, concerning spectral envelope peaks, no simple statement can be derived if all fundamental frequencies of intelligible vowel sounds are considered.

Given that voiced vowel sounds are compared at similar fundamental frequencies, and given a methodological substantiation, it can be expected that calculated formant patterns (including formant bandwidths) may, in most cases, also prove to be vowel specific and that, on their basis and not altering fundamental frequency, the respective sounds can be reproduced without substantial change in the perceived vowel quality.

Thus, prevailing theory “hints” or “points” at the basic characteristic of the physical representation of vowel quality in an indexical manner. Prevailing theory proves to be an index of this representation.

13.3    Excursus: Vowel Quality and Harmonic Spectrum

To repeat: given that a voiced vowel sound is produced in isolation and that it exhibits a quasi-constant periodic spectral characteristic, and given its unambiguous perception as belonging to a specific vowel quality, then its average harmonic spectrum, measured for the entire duration of the respective sound, is said to be vowel specific. For a frequency range concerning the physical representation of all vowels of a language, a series of harmonics quasi-identical in number, frequencies and levels can only be found for other sounds of the same vowel but not for other sounds of any other vowel.

At first glance, such a statement seems trivial. But it is not.

To say that a harmonic spectrum of a vowel sound is specific for the perceived vowel quality—given the above conditions for the sounds under investigation—is not to say that all sounds of a vowel have very ← 80 | 81 → similar spectra of this kind. As shown, large spectral variations can be found for the sounds of one vowel, particularly if vocal effort is varied during the sound production, if sounds of different speaker groups are compared and if different speaking and singing modes and styles, including stage voices, are also considered.

Therefore, an attempt to directly assess the spectral difference related to a perceptual difference of two vowels simply by calculating an average harmonic spectrum for all sounds of one vowel at a given fundamental frequency and comparing it with the similarly averaged harmonic spectrum of the other vowel may, in many cases, not result in a clear spectral difference, that is, in a frequency limit from which the two averaged spectra begin to diverge with no overlap. Exceptions may occur at high fundamental frequencies because the perceived vowel quality is represented by a greatly reduced number of harmonics.

Considering both the direct relation between harmonic spectrum and perceived vowel quality on the one hand and the observably large variation of harmonic spectra for sounds of single vowels on the other, and speculating that instead of looking at a static spectral configuration we should consider looking at a kind of spectral foreground-background relation, another attempt may provide more evidence.

If the harmonic spectrum of a reference sound of a vowel is compared with both the spectra of other sounds of the same vowel and the spectra of sounds of a second vowel, then there will be a frequency limit above which the spectrum of the reference sound diverges from any spectrum of the sounds of the second vowel, but not from any spectrum of the sounds of the same vowel.

More precisely, any single sound of a vowel compared with sounds of another vowel (given similar fundamental frequencies of the sounds) is assumed to be describable in terms of a relation of maximal spectral similarity and subsequent—related—spectral difference: for a (lower) frequency range, the harmonic spectrum of the single sound of the first vowel of comparison can resemble some other harmonic spectra of the second vowel, but if the maximum of this frequency range of possible resemblance is reached, its spectrum differs from all the spectra of the second vowel sharing the maximal similarity, while still resembling some other spectra of the first vowel.

This principle is taken here as the most conservative but also the most promising approach and basis for future research on the acoustics of the vowel: it is testable and falsifiable in a fully objective manner for all levels of fundamental frequency of comparison, it does not need ← 81 | 82 → further differentiations related to speaker groups or vocal effort or speaking or singing styles or modes and, therefore, its testing does not require any integration of further phonetic knowledge. Moreover, it also applies to synthesised sounds which are produced using a harmonic synthesiser. Thus, it “hints” or “points” at the basic characteristic of the physical representation of vowel quality in a much stronger manner than prevailing theory, i.e. it is a stronger index of this representation.

Moreover, if developed in more detail, it leads to an entire system comprising various possible relations of spectral similarities and related spectral differences of sounds of all vowels for a given language.

Although formulated on the basis of a very extended knowledge of vowel spectra, obviously, these short reflections are but general assumptions open to further clarification and empirical verification or falsification, and even if they can be empirically demonstrated as valid, they would still remain a fragmentary and temporary basis in the course of reformulating the acoustics of the vowel. Therefore, the drawbacks of investigating the harmonic spectrum—above all, the impossibility of comparing spectra related to very different fundamental frequencies directly, and the impossibility of including the analysis of vowel sounds not exhibiting quasi-static periodic characteristics of the sound wave—are not further discussed. The same applies to the limitation of the principle formulated, namely, that it only relates to vowel-specific spectral differences but not to a full determination of vowel-related acoustic characteristics.

However, for further advances in the investigation of the acoustics of the vowel, an assessment of the reliability of every given statement is needed, and the possibility of a falsification plays a crucial role in this assessment: it is the falsification of a generalised assumption of vowel-related formant patterns that called for this treatise.

Up to now, concerning the acoustics of the vowel, there are only two statements that apply to all vowel sounds: vowel sounds, perceived as isolated single sounds, are intelligible and therefore, the vowel quality must be physically represented in the corresponding sound wave and its characteristics. According to this view, an investigation of the harmonic spectra is one of the most promising approaches, even if it is limited to quasi-constant voiced vowel sounds. ← 82 | 83 →

Such a step-by-step procedure will be needed as long as there is no objective and orthogonal method to describe the acoustic characteristics that physically represent the perceived vowel quality, including all types of vowel sounds (see also below). However, during this procedure, a kind of rule-based knowledge will emerge and provide a basis for the development of an objective and comprehensive method.

13.4    “Forefield”

All of the above leads to the conclusion that, at present, no theory of the acoustic representation of the vowel exists. However, empirical evidence exists that indicates the possibility of such a theory and that will contribute to its development. Thus, it is currently in its preliminary stages.

13.5    Two Approaches

Prevailing theory is characterised by its explanation and description of vowel sounds within a physical model unspecific to speech: all kinds of sounds and noises are transformed by filters in the same way, irrespective of whether or not they concern utterances (speech events).

One possible way to respond to the difficulties of understanding prevailing theory in terms of its intellectual re-enactment and to the fact that empirical findings can contradict its predictions might be to supplement the existing source-filter model or to replace that model by another physical model external to language and speech.

Another approach might be to assume that the production and formation of vocal sounds is speech specific and, based on such a pre­mise, to develop a method for describing vowel sounds in form-related terms. This second approach assumes that the vowel sound and its manifestations elude description within a purely physical model.

Whether this covers all of the possible approaches is left open for discussion here.

As explained in Section 13.1, there are substantial reasons for scepticism about the possibility of adjusting prevailing theory and the related methods of acoustic analysis. One further and important aspect, in addition to the arguments already mentioned, concerns the following consideration: it would be possible for humans to produce a vowel-unspecific source sound and transform that sound using vocal-tract resonances, thereby producing the respective vowel-specific physical characteristics according to which listeners perceive vowels, both un ← 83 | 84 → ambiguously and independently of fundamental frequency. But it belongs to the actual acoustic phenomenon of the vowel sound to systematically deviate from this. The empirical evidence for vowel sounds suggests that humans do not produce such sounds as systematically as physics and physiology seem to predetermine. This contrast might prove fundamental for future theory building.

Elsewhere, the author has formulated the state of affairs as follows: ( i ) either resonances as such, and thus the corresponding pharyngeal, oral and nasal resonance patterns of the vocal tract, fail to represent in full the physical quantity to which language and speech directly refer, but another physical quantity can be found instead; if this is the case, then it is simply a matter of replacing the existing (physical) model with another rather than adopting a fundamentally different perspective; ( ii ) or, aside of the human voice, no construction, no instrument and no process can be found to exist in physics that would explain and allow for the production of vowel sounds including basic variations of sound characteristics, for example, fundamental frequency and phonation type; then, the physical representation of human voice cannot be related to a simple voice-independent physical quantity, but instead, the voice would produce a “substance” or “quantity”.

Based on all the reflections, experiences and observations presented, this treatise belongs to the second kind of undertaking. This calls for a corresponding phenomenology and for theory building.

13.6    Phenomenology

On the one hand, the existing documentation of vowel sounds hitherto published is no more than fragmentary and on the other, the methods for describing their acoustic characteristics have substantial shortcomings and limitations. Thus, as argued above, a phenomenology is needed, that is, a step by step build-up of systematic compilations of vowel sounds related to individual languages, including the variation of all relevant production parameters. In its course, attempts for describing acoustic characteristics related to vowel qualities in terms of knowledge-based rules will become possible (see above).

In the first instance, such a phenomenology refers to the vowel sounds of a particular language, produced in isolation or detached from sound context, exhibiting quasi-constant spectral characteristics and allowing for high scores of vowel identification in listening tests, involving listeners of the speech community of that particular language. ← 84 | 85 →

13.7    Theory Building

As said, vowel sounds perceived as isolated single sounds can be intelligible. This fact is central to human voice and speech. With regard to such sounds, the psychophysical question rises as to which describable physical characteristic or which ensemble of physical characteristics may be said to represent the perceived vowel qualities.

Theory building thus faces a threefold challenge. Firstly, it must produce a uniform, systematic and orthogonal method to describe vowel-specific acoustic characteristics. Only such a descriptive method enables a systematic synthetic reproduction of vowel sounds, based on empirically determined characteristics of natural vocalisations, and thus the verification of the significance of corresponding analyses. Secondly, in relation to the phenomenology discussed, theory building must deduce hypotheses that predict the physical representation of vowel quality irrespective of the individual cases of the vowel sounds, thus extrapolating the phenomenological description. These hypotheses must satisfy the requirements of verification and falsification on the one hand, and be transferable to different languages on the other. Thirdly, theory building must seek to explain empirical findings and the hypotheses deduced from such findings. ← 85 | 86 → ← 86 | 87 →