Show Less
Open access

Acoustics of the Vowel


Dieter Maurer

It seems as if the fundamentals of how we produce vowels and how they are acoustically represented have been clarified: we phonate and articulate. Using our vocal chords, we produce a vocal sound or noise which is then shaped into a specific vowel sound by the resonances of the pharyngeal, oral, and nasal cavities, that is, the vocal tract. Accordingly, the acoustic description of vowels relates to vowelspecific patterns of relative energy maxima in the sound spectra, known as patterns of formants.
The intellectual and empirical reasoning presented in this treatise, however, gives rise to scepticism with respect to this understanding of the sound of the vowel. The reflections and materials presented provide reason to argue that, up to now, a comprehensible theory of the acoustics of the voice and of voiced speech sounds is lacking, and consequently, no satisfying understanding of vowels as an achievement and particular formal accomplishment of the voice exists. Thus, the question of the acoustics of the vowel – and with it the question of the acoustics of the voice itself – proves to be an unresolved fundamental problem.
Show Summary Details
Open access



This treatise is bound to raise many questions, which are not discussed in detail here. Moreover, according to previous experiences regarding academic discussions, some of the considerations and arguments presented here are likely to be refuted on principle.

Some major issues discussed in the literature have not been considered in depth in this text, so that its main argument could be presented in straightforward, general and clear terms. Moreover, in-depth consideration of the issues mentioned has also been dispensed with because they appear in a different light from the perspective adopted here and, thus, they need to be discussed in another context than is usually the case. Within the following exemplary comments, however, some indications are given.

Against the background of the present reflections, experiences and observations, we conclude that explaining the lacking distinctiveness of expected spectral energy maxima in terms of the characteristics of auditory perception as formant merging without taking into account the entire systematics of empirically observable, vowel-specific spectral characteristics—in particular their dependence on fundamental frequency and the possible ambiguity of spectral envelope peaks and of formant patterns—is questionable.

The same holds true for normalisation attempts with regard to the presumed general differences between the vowel-specific formant patterns among children, women and men: such normalisation attempts would have to be approached quite differently if the comparisons of the formant patterns of the three speaker groups did not only include different but also similar fundamental frequencies of the sounds of all groups.

The same also applies when attempting to generally relate formant shifts, which occur when the fundamental frequency for sounds of one vowel is raised, to paralinguistic characteristics, in particular vocal effort: low- and high-pitched sounds can both be formed loudly and softly, and the calculated formant patterns of vowel sounds do not only depend on the vowel quality but also on the fundamental frequency in principle. Hence, one has to expect the occurrence of ambiguous formant patterns for sounds produced with equal vocal effort. Thus, we conclude that the shifts of the lower formants with raising fundamental frequency and formant pattern ambiguity as such are not necessarily related to paralinguistic aspects. ← 87 | 88 →

To mention one last example, the same also applies when attempting to relate, in general terms, formant shifts that occur when raising fundamental frequency in singing to formant tuning: evidence given in the studies published on this matter does not allow for a conclusion of whether the documented observations refer to the idiosyncrasies of individual singers, to the stylistic characteristics of a particular singing technique or style with changes in vowel quality (possibly caused by so-called vowel modification), to vocal effort, or whether the observations indeed refer to the fundamental characteristics of vowel sounds.

However, the tendency in the literature to consider vowel sounds at lower fundamental frequencies—with limited frequency variation—to be characteristic of speech, and vowel sounds at middle and higher fundamental frequencies—with extensive frequency variation—to be characteristic of singing, and the tendency to conduct investigations based on these assumptions, needs to be refuted in its turn: neither in everyday life, nor in the entertainment sector, nor indeed in musical art and vocal interpretation is there any such thing as “normal speech” for which, in contrast to singing, a single “average” fundamental frequency could be statistically determined.

The corresponding indications in existing formant statistics are not representative of experienceable speech and observable acoustic characteristics of vowel sounds: they are only representative of sounds uttered into a microphone in a small room in a relaxed and quasi-monotonous manner. (Such a restricted formulation still lacks contextual relativisation in terms of a particular language and “culture”.) There is no essential difference between the fundamental frequency ranges for speaking on the one hand, and for singing on the other, no matter how these categories are determined and distinguished in a scientifically reasonable way. If one attentively listens to everyday utterances and to utterances in theatre and film—nowadays easily accessible due to television—, the corresponding experiences make this plain, and both fields of experience need to be integrated into a phenomenology of vowel acoustics (see the Materials section for corresponding examples).

With this consideration in mind, as mentioned, some of the major aspects discussed and interpreted in the literature appear in a different context than is often reflected upon.

As indicated at the beginning of this afterword, the present critical take on the prevailing theory of vowel acoustics must, in turn, prompt scepticism, as has already become evident in many scholarly debates, ← 88 | 89 → together with the respective counterarguments. This text has attempted to take into account these arguments. Additional comments follow below.

Whatever the extraordinary and often surprising role of perception in the recognition of speech sounds, this role neither relativises the fact that isolated vowel sounds with a quasi-static sound characteristic can be intelligible beyond a concrete syntactic and semantic context nor that their harmonic spectra are vowel specific. Thus, a psychophysical approach to such vowel sounds, that is, a theory of the relationship between perceived vowel quality and physical characteristic or ensemble of characteristics, must not only be deemed possible but also necess­ary. The psychophysics of vowel sounds constitutes the basis for an investigation of human voice and speech.

In particular, however, there has been a lack of a robust empirical, extensive, systematic and representative documentation of the aspects discussed in this treatise, and this reason is considered paramount here.

We adopt the viewpoint—and, therefore, have written this text—that any attempt to formulate such a theory in terms of formant patterns cannot be successful. Consequently, a different approach needs to be formulated.

What kind of explanation could be provided to explain the fact that most previous studies of vowel sounds, and thus of voice and speech, have not integrated such a line of argument? There seem to be several reasons for this shortcoming. In particular, however—and this reason is considered paramount here—, there has been a lack of a robust empirical, extensive, systematic and representative documentation of the aspects discussed in this treatise. Thus, as a consequence of the absence of such reference documentation, the discussion lacks a binding empirical basis any interpretation must account for. At the same time, the basis of a formulation of an alternative theory is lacking, too.

Thus, whereas existing individual values obtained in studies of vowel sounds apply to the specific conditions under which these data were gathered, the values are often interpreted in terms of a general physical representation of the vowel, which is empirically contradicted. Generalisation is the critical issue at stake.

To repeat: whereas average formant patterns (as determined statistically and separately for each of the three age- and gender-related speaker groups and related to average fundamental frequencies of relaxed and quasi-monotonous speaking into a microphone in a small, ← 89 | 90 → enclosed space) are in general vowel specific, the same does not hold true for substantial fundamental frequency variations evident as prosodic characteristics already in everyday language. Whereas for vowel sounds produced by men and involving a fundamental frequency variation of one octave but not exceeding 200 Hz by much, formant patterns (if methodically substantiated) of vowel sounds in most cases app­ear independently of fundamental frequency, the same does not hold true for the vowel sounds produced by the majority of women and by almost all children also involving a fundamental frequency variation of one octave. Whereas, in sound synthesis, a specific set of filters related to a specific fundamental frequency makes it possible to perceive a certain vowel quality, in many cases it does not hold true that the same vowel quality is perceived if the filter pattern remains constant but the fundamental frequency is significantly altered. And so on.

Because there is a lack of reliable, extensive, systematic and representative empirical references, including the documentation of variation of all basic production parameters needed in order to evaluate which physical characteristic is related to a single production parameter and which is in general related to vowel quality, and because, in many handbooks of phonetics, the acoustic characteristics of vowels are often treated briefly, in generalised and summary accounts yet without relativisation and problematisation, the reflections, experiences and observations reported in this treatise are partly unfamiliar, are rarely reconsidered and are in general not integrated when interpreting individual findings of other studies. In the first instance, this complicates the discussion within phonetics and psychophysics. Beyond this, however, attention has to be given to the significance of this lack of relativisation and problematisation for other areas of science—not only for fields such as speech recognition, speech pathology, audiology, or neuropsychology, but also for the investigation of voice as such, including philosophy and art, and for voice and speech education and training. How are these fields meant to relate to reliable basic knowledge and understanding of voice and speech production, and how are these fields meant to design reliable experiments if the unresolved problem of generalising individual measurements is not placed at the centre of understanding and investigation?

Moreover, some scholars are fundamentally critical of basing the psychophysics of the vowel on isolated vowel sounds and they question the recognisability and the linguistic function of such sounds. This critical position generally relates to a linguistic definition of the vowel as a vocoid and as syllabic. However, the previous reflections have shown ← 90 | 91 → that this treatise does not concur with the resulting notion of a fundamental opposition between isolated versus context-bound sounds, static versus dynamic spectral processes and “functionless” utterances versus those with a linguistic function. Much could be said in response both to such an opposition and a critical take on the psychophysics of isolated vowel sounds. Within the limited scope of the present study, however, only a few aspects can be mentioned (the problem as such is a matter for future debate and research):

     As said in the introduction, we take the stand that the recognisability of vowels (monophthongs) as single speech sounds perceived in isolation by listeners of a given speech community belongs to the elementarisation as a basic characteristic of vocal expression and speech and thus to the aptitude of the latter for a phonetic system of writing. Thus, structurally, isolated vowel sounds must be intelligible as such.

     Refuting the fundamental recognisability and the function of isolated sounds—their function in its broad sense, emotional and aesthetic qualities included—is borne out neither by any experience of art, vocal interpretation and entertainment, nor by everyday experience. (This order of denomination, artistic utterances first, everyday utterances last, is chosen to indicate that all phenomena discussed here may first be experienced in a direct way in the arts; then, when familiar with the correspondingly various types of possible utterances and expressions, they will continuously also get one’s attention in everyday utterances; see also the corresponding consideration in the introduction.)

     In this respect, it is worth pointing out the central role that is played by sustaining vowel sounds, sometimes for as long as possible, in musical composition and vocal interpretation—either in isolation or in a sound context and with or without fundamental frequency variation as a melody. The same holds true for basic and advanced voice training in the field of interpretation and performance.

     In this respect, it is also worth noting the occurrence of vowel sounds produced in isolation in vocal expressions such as exclamations or affirmations. (The German exclamations “Ahhh”, “Ohhh”, “Uhhh” and “Ihhh”, to give a paradigmatic example, have a different meaning depending on the context of expression and the vowels must be understood as such.)

     Dynamic processes are often represented and considered as formant transitions. Yet the lack of a general correspondence between vowel qualities and related formant patterns and the ← 91 | 92 → limited methodological reliability of formant determination—discussed in this study in relation to quasi-static sounds—must also be linked to dynamic descriptions. Thus, for instance, it is not evident how formant transitions for a sound produced at a fundamental frequency of approximately 200 Hz are supposed to be compared with those of another sound of the same vowel but at a fundamental frequency of 500 Hz.

Furthermore, whereas other scholars have recognised some or all of the problems discussed here, they often reproach the present kind of fundamental deliberation for not formulating a new theory. Such argumentation, however, does not correspond to the views and the stance of this treatise. If reasoned, well founded and applicable, any criticism of prevailing acoustic theory has its own intrinsic value, utterly irrespective of whatever it is that is offered or proposed beyond that criticism. Above all, it allows for an identification and formulation of challenges and, spurred by the need to resolve them, it drives the search for a new approach.

Pursuing a phenomenology and building a new theory requires a considerable effort along with the appropriate resources. Doing so presupposes that the scholarly community acknowledges the importance of such a venture. Any such acknowledgment, however, requires a comprehensible critique of prevailing theory to be advanced, together with a reinterpretation of previous empirical findings.

The author has also written this text because he does not know how far-reaching his contribution and that of his research colleagues is to the phenomenology and a new theoretical framework. However, two attempts are in progress. With regard to phenomenology, a research team is currently creating a large corpus of vowel sounds for Standard German, produced by children, women and men, including extensive variation of basic production parameters and including both untrained and trained speakers and singers (see Maurer, n.d.). In this way, we attempt to contribute to the creation of a systematic reference basis for vowels of single languages. With regard to theory, in a subsequent treatise, we will investigate in detail the thesis of vowel-specific harmonic spectra.

To conclude, the general significance of acoustic characteristics of vowel sounds should, as indicated, not be regarded as solely the subject of phonetics. Above all, it concerns the understanding of the voice as such. ← 92 | 93 →

The voice is currently attracting particular attention in the humanities. Deliberations in these fields are directly related to the knowledge and experience gained in artistic creation and interpretation, and there is a strong emphasis for the need for an interdisciplinary approach. In line with such a claim, the research culture in the aesthetics of the voice ought to adopt a particular stance toward the acoustics of the voice, too: namely, not to only to cite phonetics with regard to existing descriptions of vocal utterances, but to critically discuss these descriptions and link them to considerations and experiences of art, interpretation and entertainment. In this context, a call should emerge not to take the “Western” perspectives and production styles as the starting point of investigation for the acoustics of the voice, but initially to consider any vocal expression, habit and style of any cultural context as equivalent. In doing so and in facing the diversity of possible vocal expressions, at least in the first instance, no classification of „normal“ and „differing“ phenomena and no hierarchical order should be imposed, but a decided descriptive perspective should be adopted. As said, there should be no underestimation or misunderstanding of the fact that raising questions regarding voiced speech sounds raises questions regarding the voice itself.

Our vocal cords produce sound. The resonances of the pharyngeal, oral and nasal cavities could form its characteristics into a formant pattern that always and uniquely represents a vowel physically, and thus allows the listener to perceive it accordingly. Empirical investigation reveals, however, that the spectral characteristics of vowel sounds systematically deviate from such an option. This observation leads to the conclusion that, at present, we are but in the preliminary stages of understanding the physical representation of the vowel and, thus, its materialised form. ← 93 | 94 → ← 94 | 95 →