Voice Acoustics: an introductionSpeech science has a long history. Speech and voice acoustics are an active area of research in many labs, including our own, which studies the singing and speaking voice. This document gives an introduction and overview. This is followed by a more detailed account, sometimes using experimental data to illustrate the main points. Throughout, a number of simple experiments are suggested to the reader.
The table compares some pairs of phonemes that are pronounced with (nearly) the same articulation but with vocal fold vibration (voiced) and without vibration (unvoiced).
In fricatives, the tract is so constricted (by tongue, palate, teeth, lips or a combination) that sustained turbulent flow contributes broad-band sound to the spectrum. Plosives involve opening and/or closing of the tract with the lips (p, b) or the tongue (t, d; k, g) at different places of articulation. The sudden opening or closing and associated turbulence briefly produce broad band sound in plosives.
The sound of the ‘source’ interacts with the ‘filter’ (and also, as we'll see later, vice versa). Depending on how you position your tongue and the shape of your mouth opening, different frequencies will be radiated out of the mouth more or less well. Another experiment: sing a sustained note at constant pitch and loudness, while varying the opening of your mouth and the position of the tongue. This will allow you to produce most of the vowels of English and some other phonemes, such as the ‘ll’ in ‘all’ or the ‘r’ in ‘or’, as pronounced in some accents.
How you position your velum (soft palate) also makes a difference. In the normal (high) position, all of the air and sound goes through the mouth. Lower it and you connect the nasal pathway to the mouth and lower vocal tract. Lower it further and you seal the mouth off from the pathway from nose to larynx. For the next experiment, observe the differences between a nasal sound (‘ng’) and a non-nasal one (‘ah’), then try sealing and unsealing your nose with your fingers, and also opening and closing your mouth, which will tell you how completely your velum seals one of the pathways.
To a large extent, vowels in English are determined by how much the mouth is opened, and where the tongue constricts the passage through the mouth: front, back or in between. One can ‘map’ the vowels in terms of these articulatory details, or in terms of acoustic parameters that are closely related to them. Here are ‘maps’ for two different accents of English.
The frequencies on the axes correspond to bands of frequencies that are efficiently radiated, about which more later. The vertical axis on these graphs roughly corresponds to the jaw position (high or low) or the size of the lip opening. The horizontal axis corresponds to the position of the tongue constriction. We’ll return later to explain more about such maps, and how they may be obtained.
Vowel planes for two accents of English (Ghonim et al., 2007). These data were gathered in a large, automated survey in which respondents from the US (left) and Australia (right) identified synthesised words of the form h[vowel]d: a form in which most examples are real words. ‘short’ and ‘long’ indicate that more than 75% of the choices fell in these categories. You can map your own accent in this way on this web site.
Vowels and some other phonemes may be sustained over time: for them, the position of the articulators (and so the values of the well-radiated frequency bands) is relatively constant.
For other phonemes (such as the ‘p, b, t, d’ discussed above), the change in articulation is important. Consequently, so are the variations with time of the associated frequency bands, as is the broad band sound associated with the opening or closing (Smits et al., 1996; Clark et al., 2007). In the examples ‘p, b, t, d’, the mouth opening is obviously changing during the consonant. Experiment: try slowing down (a lot) the motion of opening and closing and see if you notice what seems like a change in the vowel.
Like vowels, liquids (r and l) and nasal consonants (n, m, gn) are voiced and have a characteristic set of spectral peaks. For these, the tongue provides a narrower constriction.
In speech, vowels are in a sense less important than consonants: you can often understand a phrase even –f –ll v–w–l –nf–rm–t–n –s –bs–nt. On the other hand, vowels are more important in singing, because the vowel is sustained to produce a note.
The separation of parts of the voice function into ‘source’ and ‘filter’ is practical, but one should remember that the distinction is incomplete. For instance, the geometry of the vocal folds affects not only the operation of the folds and thus the source, but also affects the acoustic properties of the ‘filter’. The geometry of the vocal and nasal tracts determines how they filter the sound, but the acoustical properties resulting from this geometry are thought to affect the operation of the vocal folds. We talk about these complications below.
Contrasting the voice with wind instruments
If we neglect the influence of the articulators on the larynx, we have the Source-Filter model. Superficially, it may seem obvious to a singer that the larynx and the articulators are independent: to many singers, particularly men in the low range, it seems that we can vary the pitch (~ the source) and vowel (~ the resonator/ filter) independently.
In contrast, an analogous argument would seem a very odd approximation to someone who plays brass instruments. A trombonist knows that the resonances in the bore of the instrument (~ the resonator/ filter) do indeed affect the motion of the player’s lips (~ the source). In fact, a brass player’s lips generally tend to oscillate at one of the frequencies of resonances in the bore. (See Acoustics of Brass Instruments) We shall return to this below, but let’s first note the following important quantitative difference between the two.
A trombone has a range that overlaps that of a man’s voice. However, the trombone is longer (a few metres) than a man’s vocal tract (0.2 m). The range of fundamental frequencies of the trombone lies within the range of the bore resonances. The range of the voice, especially of a man’s voice, usually lies below the frequencies of the vocal tract resonances. You are probably thinking that this difference – and therefore the approximation that the resonator doesn't affect the source – is most questionable for high pitches, when the fundamental of the voice enters the range of vocal tract resonances. You're right, and we’ll come back to this.
The Source-Filter model
In the Source-Filter model (Fant 1960), interactions between sound waves in the mouth and the source of sound are neglected. Although oversimplified, this model explains many important characteristics of voice production.
Let’s apply it to two of the sources that we have discussed already. In the spectrum of a whispered voice, the spectrum produced by turbulent flow between the vocal folds produces very many frequencies: it looks like a continuous line. For normal voiced speech, however, the motion of the vocal folds is a periodic vibration that modulates the flow of air, which produces a harmonic spectrum. (More details and graphs are given below.) The cartoon below uses these to illustrate the Source Filter model.
A schematic of the source-filter model, from Wolfe et al (2009). The periodic spectrum corresponds to normal speech and singing. (See What is a sound spectrum?) The vertical lines indicate the harmonics, and the lowest of these, about 140 Hz, is the fundamental frequency at which the vocal folds vibrate. The continuous spectrum corresponds to whispering. Vertical axes are logarithmic. One or other signal is input to the vocal tract,which we treat as a filter whose gain shows peaks at two frequencies in the range sketched. At the mouth, high frequencies are better radiated, as indicated in the next graph. The last pair of graphs sketch the spectra of the output sound. (The vertical axes of all graphs are logarithmic. (For linear graphs, we would replace + with *.) On another site, we give some practical examples of the source-filter model, with sound files.
The spectrum of the output sound depends on the spectrum of the laryngeal source, on the frequency dependent ‘gain’ of the vocal tract, on the efficiency of radiation from the mouth and nose and on interactions among these. We shall discuss these in the more detailed sections below.
In the example sketched above, there are maxima in gain near 500 and 1,500 Hz (corresponding to the vowel , as in ‘heard’). For a fundamental frequency of 140 Hz (as sketched), the third, ninth and neighbouring harmonics are more efficiently radiated than are other harmonics. Peaks in the radiation of the whispered sound occur at similar frequencies. In speech, these high power frequency bands – these broad peaks in the spectral envelope – are very important. The frequencies at which they occur are close to (but not exactly equal to) those of the peaks in the gain function of the tract.
There is a good reason why the various spectra in the preceding figure are sketched: most of them cannot be measured directly. So, in the illustrations below, where we illustrate the Source-Filter model experimentally, we have to resort to indirect measurements. We can’t measure the flow spectrum through the larynx, but we can measure the vibration of the vocal folds. Here, we do that using an electroglottograph (EGG): we apply a small radio frequency voltage across the neck using skin electrodes at the level of the vocal folds. The magnitude of the current that flows varies as the folds come into contact and separate. The spectra and sound files at the top of the figure are an EGG signal. Below that, we show the results of measurements of the resonances of the vocal tract, made at the mouth, during speech. This gives a quasi-continuous line whose peaks identify the resonances. It also shows the harmonics of the voice. (We discuss this technique here.) Below that are the spectra measured for that particular vowel, in the same gesture.
Here we contrast two vowels: At left is the vowel , as in ‘heard’ (like the example used in the preceding figure). At right, [o], as in ‘hot’. The top graphs and sound files are for experimental measurements of the vocal fold contact. Note that this measurement of the source shows little difference between the two vowels: the filter has little or no effect on the source. The next pair of graphs are measurements of the vocal tract, made from the mouth, during the vowel. (More on this technique here.) The broad peaks identify resonances of the vocal tract, the sharp lines are the harmonics. Here, because the tract is in a different configuration for the two vowels, the resonances occur at different frequencies. The next two rows show the voice output for voiced speech and for whispering, measured in the same vocal gesture. More detail on these examples here.
The spectrum of the output sound depends on the spectrum of the laryngeal source, on the frequency dependent ‘gain’ of the vocal tract, on the efficiency of radiation from the mouth and nose and on interactions among these. We shall discuss these in the more detailed sections below.
Before we leave this brief overview, it is worth noting that there is still much about the voice that is still incompletely understood. One of the reasons for this is the difficulties of doing experiments. Some of the data that we should like to know – the gain function of the vocal tract sketched above, the mass and force distribution in the vocal folds, for instance – are impossible to measure while the voice is operating, not only ethically but practically.
For most human physiology, much information has been obtained from other species, whose organs function in similar ways. When it comes to the voice, however, there is no such similar species – no-one is very interested in the voice of the lab rat. Much of our knowledge comes from experiments using just the sound of the voice as experimental input. Other knowledge comes from medical imaging. Another approach is to use a mathematical model: one can treat the vocal folds as collections of masses on springs, and the vocal tract as an oddly shaped pipe that transmits sound. The next step is to solve the equations for this simple system and to predict the sound it would make, and to see how this correlates with sounds of speech or singing. Another is to make artificial systems with the shape of the vocal tract and some sort of aero-mechanical oscillator at the position of the glottis. Yet other knowledge comes from other experiments and observations that are often, for practical and ethical reasons, somewhat indirect. Because of the importance of the human voice, these are all active research areas.
We now look more closely at some of the topics introduced above. Other reviews are given by, for example, Lieberman and Blumenstein, 1988; Stevens, 1999; Hardcastle and Laver, 1999; Johnson, 2003; Clark et al., 2007; Wolfe el al, 2009. References are given below.
The source at the larynx
To speak or to sing, we usually expel air from the lungs. The air passes between the vocal folds, which are muscular tissues in the larynx. If we get the air pressure and the tension and position of the vocal folds just right, the folds vibrate at acoustic frequencies. This means we have an oscillating valve, letting puffs of air flow into the vocal tract at some frequency f0.
These sketches illustrate the larynx, viewed from above, in position for phonation and for breathing.
Technically, we move the arytenoid cartillages closer than their separated breathing position, which brings the vocal folds closer to each other, called adduction (Scherer 1991). This reduced aperture between the folds is called the glottis. Compared to the breathing position, the narrow glottis restricts the flow of air, which in turn means that the steady pressure drop across the larynx is greater when the aperture is small. The higher pressure drop means that the speed of air through the glottis is high, but the small cross section means that the volume flow (in litres per second) is less. Experiment: take a deep breath and time how quickly you can breathe it out completely with your larynx relaxed. Now do the same, while pronouncing a whispered ‘ah’ and singing a (loud) ‘ah’. Which breath lasts longest (i.e. which has the lowest flow)?
Different registers and vocal mechanisms
How to cover a wide range of pitch? Let’s compare with musical instruments. On a violin or guitar, one can change the length of a string, but to cover a large range, one can also cross to a new string. In trumpet, trombones, clarinets, flutes etc, one can change the length of a pipe (with valves, a slide or keys) but one can also change registers, which means changing the mode of vibration in the pipe.
In the voice, we can change the muscle tension and the pressure to vary the pitch. However, to cover a range of a few octaves, we usually need to use different registers (Garcia, 1855). The distinctions among registers in singing are not always clear, however, because changing registers corresponds to both laryngeal and vocal tract adjustments (Miller, 2000). The vocal folds can vibrate in (at least) four different ways, called mechanisms (Roubeau et al., 2004; Henrich, 2006).
Although some people use M0 in speech, especially at the end of sentences, and coloratura sopranos are said to use M3 in their highest range, speech and singing usually use M1 and M2. Men and women typically change from M1 to M2 at about 350-370 Hz (F4-F#4) (Sundberg, 1987). Consequently, with their lower overall range, men typically use M1 for nearly all speech and most singing. However, in some styles of pop music and some operatic styles, men use M2 extensively: men who sing alto are usually using M2.
There is usually a pitch and intensity range over which singers can use either M1 or M2 (Roubeau et al., 2004), and trained singers are good at disguising the transition. Sometimes, as in yodeling, the transition is a feature. Experiment: if you try to produce a smooth pitch change or glissando over your whole range, you will probably notice a discontinuity: a jump in pitch and a change in timbre at a pitch somewhere near the bottom of the treble clef. This is where you change from M1 to M2. At the pitch of that break, you may also produce a break by singing a crescendo or decrescendo at constant pitch (see Svec et al., 1999; Henrich, 2006).
The next figure shows a spectrogram of a glissando through the four mechanisms.
A spectrogram plots frequency (vertical)
against time (horizontal) with sound level in colour or grey-scale.
This one shows the four laryngeal mechanisms on an ascending
glissando sung by a soprano. Notice the discontinuities in frequency
(clearer in the higher harmonics) at the boundaries M1-M2 and M2-M3.
The horizontal axis is time, dark represents high power, and the
horizontal bands in the broad band M0 section clearly show four
broad peaks in the spectral envelope. These may also be seen to
varying degrees in the subsequent harmonic sections.
Producing a sound
The processes that convert the ‘DC’ or steady pressure in the lungs into ‘AC’ or oscillatory air flow and vocal fold vibration are necessarily nonlinear. First, the ‘Bernoulli’ suction between the folds is proportional to the square of the flow velocity (see this link). Second, the collision of the folds when the glottis closes is also highly nonlinear (Van den Berg, 1957; Flanagan and Landgraf, 1968; Elliot and Bowsher, 1982; Fletcher, 1993).
Linear and nonlinear have been confused by postmodernism. In science, linear just means that the equation is a straight line, so a change in one variable produces a proportional change in the other. We show elsewhere that an oscillator with a linear force law vibrate as in a pure sine wave, which has just one spectral component. Conversely, anything with a nonlinear force law does not vibrate sinusoidally, and so has more than one frequency component.
Because of these nonlinearitities, the fold vibration is nonsinusoidal and therefore has many frequency components. In M1, M2 and M3, the motion is (almost exactly) periodic, so the spectral components are harmonic: a microphone or flow meter placed at any point in the tract would indicate components at the fundamental frequency f0 and its harmonics 2f0, 3f0 etc, as shown in the figures above. (Follow this link for harmonic spectra.)
Generally, the amplitude of harmonics decreases with increasing frequency, though there are important exceptions. The negative slope in the spectral envelope (called the ‘spectral tilt’) is different for types of speech or singing (Klatt and Klatt, 1990). To some extent, this slope is compensated by the response of the human ear, which is usually more sensitive to the higher harmonics than to the fundamental (see Hearing). More power in the high harmonics makes a sound bright and clear; weakening the high harmonics makes a mellow, darker or muffled sound. If you have a sound system with bass and treble or tone controls, or a sound editing program, you can experiment with strengthening and weakening the high harmonics using the treble or tone control. (Some filtered voice sound examples here.)
A breathy voice has a spectrum with a strongly negative slope. This voice is produced when the glottis doesn’t close completely. The spectral envelope is flatter (the higher harmonics are less weak) in loud speech or singing, which have an abrupt closure of the vocal folds and a short open phase of the glottis (Childers and Lee, 1991; Gauffin and Sundberg, 1989, Novak and Vokral, 1995). This flatter spectrum has relatively more power in the frequency range 1–4 kHz, to which the ear is most sensitive.
It is possible to make high-speed video images of the vocal folds using an optical device (endoscope) inserted in either the mouth or nose (Baken and Orlikoff, 2000; Svec and Schutte, 1996). Electroglottography (Childers and Krishnamurthy, 1985), which is described above, is less invasive but less direct. Although the flow through the glottis cannot be measured, it can be estimated from the flow from the mouth and nose, which can be measured using a face mask (Rothenberg, 1973) or from the sound radiated from the mouth. Both techniques require inverse filtering (Miller 1959), which in turn requires knowledge of or assumptions about the acoustic effects of the vocal tract.
When is the source independent of the filter?
As explained above, one cannot do the direct experiments that would allow us to answer this question directly, so we are obliged to rely on indirect evidence, or on theoretical or numerical models.
In a simple model, Fletcher (1993) uses the ‘Bernoulli’ nonlinearity in a simple but general analysis of resonator-valve interaction with different valve geometries. He derives equations and inequalities relating the natural frequencies of the valve, the resonance frequency of the filter (or resonator) and the fundamental frequency of the sound produced. Treating the vocal folds (or a trombonist’s lips) as a valve that opens when the upstream pressure excess is increased, this model gives results consistent with what we know about the voice and trombones: when the resonance falls at a frequency slightly above that of the valve, a sufficiently strong resonance can ‘control’ the oscillation regime. If the resonances are at much higher frequencies, they have little influence on the fundamental frequency at which the valve vibrates.
Resonances, spectral peaks, formants, phonemes and timbre
Acoustic resonances in the vocal tract can produce peaks in the spectral envelope of the output sound. In speech science, the word ‘formant’ is used to describe either the spectral peak or the resonance that gives rise to it. In acoustics, it usually means the peak in the spectral envelope, which is the meaning on this site. We discuss the different uses in more detail on What is a formant?, but for the moment note that ‘formant’ should be used with care.
In non-tonal languages such as English, vowels are perceived largely according to the values of the formants F1 and F2 in the sound (Peterson and Barney; 1952, Nearey, 1989; Carlson et al., 1970). F3 has a smaller role in vowel identification. F4 and F5 affect the timbre of the voice, but have little influence on which vowel is heard ((Sundberg, 1970). We repeat below the plots of (F2,F1) for two accents of English. Note that, in these graphs, the axes do not point in the traditional Cartesian direction: instead, the origin is beyond the top right corner. The reason is historical: phoneticians have long plotted jaw height on the y axis and ‘fronting', the place of tongue constriction, on the x. This choice maintains that tradition approximately.
These maps were obtained in a web experiment, in which listeners judge what vowel has been produced in synthetic words (Ghonim et al., 2007) in which F1, F2 and F3 are varied, as well as the vowel length and the pitch of the voice. Experiment: using that web site you can make a map of the vowel plane of your own accent.
We repeat the figure showing the vowel planes for US and Australian English measured in an on-line survey (Ghonim et al., 2007).
The vocal tract as a pipe or duct
To understand how the resonances work in the voice, we can picture the vocal tract (from the glottis to the mouth) as a tube or acoustical waveguide. It has approximately constant length, typically 0.15-0.20 m long, a bit shorter for women and children. However, the cross section along the length varies in ways that can be varied by the geometry of the tongue, mouth etc. The frequencies of the resonances depend upon the shape. The frequencies of the first, second and ith resonances are called R1, R2, ..Ri.., and those of the spectral peaks produced by these resonances are called F1, F2, ..Fi... (See this link for a discussion of the terminology.)
When pronouncing vowels, R1 takes values typically between 200 Hz (small mouth opening) to 800 Hz. Increasing the mouth opening gives a large proportional increase in R1. Opening the mouth also affects R2, but this resonance is more strongly affected by the place at which the tongue most constricts the tract. Typical values of R2 for speech are from about 800 to 2000 Hz. The resonant frequencies can also be changed by rounding and spreading the lips or by raising or lowering the larynx (Sundberg, 1970; Fant, 1960).
We’ll return to discuss this below, but for the moment, let’s note that, if the open end of a tube is widened, the resonant frequencies rise, which explains the mouth effect. Similarly, reducing or enlarging the cross section near a pressure node respectively lowers or raises the resonance frequency. Conversely, reducing or enlarging the cross section near a pressure anti-node respectively raises or lowers the resonance frequency. This explains some features of the tongue constriction. The nasal tract has its own resonances, and the nasal (nose) and buccal (mouth) tracts together have different resonances. The lowering the velum or soft palate couples the two, which affects the spectral envelope of the output sound (Feng and Castelli, 1996; Chen, 1997).
Nasal vowels or consonants are produced by lowering the velum (or soft palate, see Figure 1). The nasal tract also exhibits resonances. Coupling the nasal to the oral cavity not only modifies the frequency and amplitude of the oral resonances, but also adds further resonances. The interaction can produce minima or pole-zeros of the vocal tract transfer function, with resultant minima or ‘holes’ in the spectrum of the output sound (Feng and Castelli, 1996; Chen, 1997).
Resonances, frequency, pitch and hearing
Some comments about frequency and hearing are appropriate here. The voice pitch we perceive depends largely on the spacing between adjacent harmonics, especially those harmonics with frequencies of several hundred Hz (Goldstein, 1973). For a periodic phonation, the harmonic spacing equals the fundamental frequency of the fold vibration, but the fundamental itself is not needed for pitch recognition.
Except for high voices, the fundamental usually falls below any of the resonances, and so is often weaker than one of the other harmonics. However, its presence is not needed to convey either phonemic information or prosody in speech. The pass band of telephones is typically about 300 to 4000 Hz, so the fundamental is usually much attenuated. The loss of information carried by frequencies above 4000 Hz (e.g. the confusion of ‘f’ and ‘s’ when spelling a name) is noticed in telephone conversation, but the loss of low frequencies is much less important. (An experiment: next time you are put ‘on hold’ on the telephone, listen to the bass instruments in the music. Their fundamental frequencies are not carried by the telephone line. Can you hear their pitch? Of course, the are less 'bassy' than if you heard them live, but is the pitch any different?)
Our hearing is most sensitive for frequencies from 1000 to 4000 Hz. Consequently, the fundamentals of low voices, especially low men's voices, contribute little to their loudness, which depends more on the power carried by harmonics that fall near resonances and especially those that fall in the range of high aural sensitivity. (Another experiment: you can test your own hearing sensitivity on this site.)
Timbre and singing
Varying the spectral envelope of the voice is part of the training for many singers. They may wish to enhance the energy in some frequency ranges, either to produce a desired sound, to produce a high sound level without a high energy input, or to produce different qualities of voice for different effects. Characteristic spectral peaks or tract resonances have been studied in different singing styles and techniques (Stone et al., 2003; Sundberg et al., 1993; Bloothooft and Pomp, 1986a; Hertegard et al., 1990; Steinhauer et al., 1992; Ekholm et al., 1998; Titze, 2001; Vurma and Ross, 2002; Titze et al., 2003; Bjorkner, 2006; Garnier et al., 2007b; Henrich et al., 2007). In this laboratory, we have been especially interested in three techniques: resonance tuning, harmonic singing and the singers formant.
Now the mouth is open to the outside world, but the sound wave is not completely ‘free’ to escape, because of Zrad, the impedance of the radiation field outside the mouth. A pressure p at the lips is required to accelerate a small mass of air just outside the mouth, so the inertance is not zero, but is usually Zrad small. At high frequency, however, larger accelerations are required for any given amplitude, so Zrad increases with frequency. In a confined space (inside the vocal tract), acoustic flow does not spread out, so impedances are usually rather higher than Zrad.
As we explain in this link, Z in a pipe (or in the vocal tract) depends strongly on reflections that occur at open or closed ends. A strong reflection occurs at the lips, going from generally high Z inside to low Z in the radiation field. Suppose that a pulse of high-pressure air is emitted from the glottis just when a high pressure burst pulse returns from a previous reflection: the pressures add and Z is high. Conversely, if a reflected pulse of suction cancels the input pressure excess, Z is small. This effect produces the large range of Z shown in the previous figure. High output levels occur at the lips when the input impedance Z is a minimum.
For the sake of simplicity, let’s imagine the tract as a tube, nearly closed at the glottis but open at the mouth. In fact, for /3/ (the vowel in the word "heard"), the resonances shown in the figure above fall at the frequencies expected for an open cylindrical tube of length 170 mm, open at the mouth and nearly closed at the glottis. Now, for a simple tube with length L, open at the far end, the behaviour is shown by the dashed line in the preceding figure. The wavelengths that give maxima in Z are approximately λ1 = 4L, λ3 = 4L/3, λ5 = 4L/5, etc and so at frequencies, f1 = c/4L, f3 = 3c/4L = 3f1, f5 = 5c/4L = 5f1, etc. Minima occur half way between the maxima.
Now let’s add the glottis, giving a local constriction at the input. The solid line shows the new input impedance Z. The maxima in Z (pressure antinodes or flow nodes) are hardly changed. This makes sense: a local constriction (of small volume) at the input has little effect on a maximum in Z, where flow is small. For modes where the flow is large, however, the air in the glottis must be accelerated by pressures acting on only a small area. So the frequencies of the minima in Z (pressure node, flow antinode) fall at lower frequencies. If the glottis is sufficiently small, Z(f) falls abruptly from each maximum to the next minimum, which thus occur at similar frequencies. So do the maxima in the transfer functions.
So far, we haven’t mentioned the impedance of the subglottal tract leading to the lungs. This is difficult to measure. However, there are good reasons to expect no strong resonances in the audio frequency range. The lungs have complicated geometry, with successively branching tubes, extending to quite small scale at the alveoli. This is expected to produce little reflection in the range of frequencies that interest us (see Fletcher et al., 2006). As mentioned above, many of the obvious experiments for studying vocal tract resonances are impossible. A number of less obvious techniques exist, however. One of our papers reviews these (Wolfe et al, 2009).
Tract-wave interactions: Do the ‘source’ and the ‘filter’ affect each other?
As we explained above, the resonances of the vocal tract occur at frequencies well above those of the fundamental frequency – at least for normal speech and low singing. Further, the frequencies of vocal fold vibration (which gives the voice its pitch) and those of the tract resonances (which determine the timbre and, as we have seen, the phonemes) are controlled in ways that are often nearly independent. In most singing styles, the words and melody of a song are prescribed. Conversely, in speech, we have the subjective impression that we can vary the prosody independently of the phoneme – for example, one can often replace a key word in a sentence without changing the prosody at all.
As mentioned above, the voice is unlike a trombone or other wind instrument*, in which one of the resonances of the air column drives the player's lips or reed (respectively) at a frequency close to its resonant frequency. In the voice, there is usually no simple relation between the frequencies: a singer may cover a range of two or more octaves (i.e. vary the frequency by a factor of 4 or more) with relatively little change in the shape and size of the vocal tract. Further, although there is typically a difference of an octave (a factor of two in wavelength) between the fundamental frequencies of male and female singing voices, there is much smaller difference in the lengths of the tracts.
From this we can conclude that the resonances of the tract do not normally control the pitch frequency of the voice. Nevertheless, the glottal source and the vocal tract resonances may be interrelated in a number of ways. First, there are direct, physical interactions: the mode of phonation affects the reflections of sound waves at the glottis, and so affects standing waves in the tract (cf. Fig. 4). Second, pressure waves in the tract can influence the air flow through the glottis or the motion of the vocal folds. Third, there is the possibility that speakers and singers may consciously or unconsciously use combinations of fundamental frequency and resonance frequency for different effects, in particular to improve their efficiency. We discuss these in turn.
* Is there an acoustic instrument like the voice? Not really, but one can mention some similarities with the harmonica or mouth organ. In that instrument, the pitch is largely determined by mechanical properties of a metal reed that controls the air flow. The pitch may, however be affected by effects in the acoustic field nearby, e.g. cupping the hands over the instrument to ‘bend’ tones. Like the voice, the harmonica may produce sounds whose wavelengths are much larger than the size of the instrument and, like the voice, one can modify the spectral envelope by changing the geometry of the air space through which it radiates.
Does the glottis affect the tract resonances?
The glottis is very much smaller that the cross-section of the vocal tract, which is why, in the simplistic figure above, we treated the vocal tract as a pipe open at mouth and closed at glottis. This is an exaggeration, of course! The average opening of the glottis depends on what fraction of the time it is open (its ‘open quotient’) and how far it opens (Klatt and Klatt, 1990; Alku and Vilkman, 1996; Gauffin and Sundberg, 1989), which in turn depend on the voice register and pitch.
For a duct that is almost closed at one end and open at the other, the frequency of the first resonance increases as the opening increases. Various researchers have shown that, when the glottis is somewhat open for whispering, the resonance or formant peaks occur at higher frequencies (Kallail and Emanuel, 1984a,b; Matsuda and Kasuya, 1999; Itoh et al., 2002; Barney et al, 2007; Swerdlin et al., 2010).
Do pressure waves affect the vocal fold vibration?
This is an area in which it’s hard to do the experiments that would most clearly answer the question. However, there has been a lot of work on numerical models. Some of these predict that air through the glottis and the vocal fold vibrations depend on the pressure difference across the glottis and folds, and thus waves in the tract (Rothenberg, 1981; Titze, 1988, 2004). Not surprisingly, the phase of the pressure wave is important in these models: whether a pressure decrease outside the vocal folds will tend to open them will depend on when during a cycle it arrives.
Can one observe the effect of pressure waves on the motion of vocal folds experimentally? Hertegard et al., (2003) used an endoscope (camera looking down the throat) to film the larynx while singers mimed singing, and a tube sealed at the lips provided artificial pressure waves. They reported bigger vibrations when the pressure waves had frequencies near those of normal singing. In our lab (Wolfe and Smith, 2008), we used electroglottography (EGG, described above) to monitor the vocal fold vibration, and used a didjeridu to produce the pressure waves. We found that the didjeridu signal could drive the folds at a level comparable with those generated by singing. All the above evidence suggests that the standing waves in the ‘filter’ have a strong interaction with the source.
Do singers and speakers use tract resonances and pitch in a coordinated way?
If you want to sing or to speak loudly, you might want to take advantage of the resonances of the vocal tract to improve the efficiency with which energy is transmitted from the glottis to the outside sound field. The most studied example is the problem faced by sopranos. The range of R1 (about 300 to 800 Hz, roughly D4 to G5) overlaps approximately the range the soprano voice. If a soprano did nothing about this, she’d have a serious problem: First, for many note-vowel combinations, the f0 of the note would fall above R1. So she would lose the power boost from R1. This is particularly a difficulty for opera singers, who must compete with an orchestra, without the aid of a microphone. There is also the problem that her voice quality would tend to change when she crossed the R1 = f0 line.
Sundberg and colleagues pointed out that, in classical training, sopranos learn to increase the mouth opening as they ascend the scale (Lindblom and Sundberg, 1971, Sundberg and Skoog, 1997) and measured this opening as a function of pitch. They deduced that they were tuning R1 to a value near f0.
Our experiments, using acoustic excitation at the mouth, confirmed this (Joliveau et al., 2004a,b). When f0 was low enough, sopranos used typical values of R1 and R2 for each vowel. However, when f0 was equal or greater to the usual value of R1, they increased R1 so that it was usually slightly higher than f0. For vowels with low R1, this tuning of R1 to f0 starts at lower pitch, and it continues almost up to 1 kHz. Here is a web page about this research, including some sound files.
We don’t know exactly how they learn to do this: it might be that they respond, probably subsciously, when the sound is louder for a given effort. Or it may be that vibrations are easier to produce when the resonance is appropriately tuned. Either way, they could learn to reproduce this effect. A simple model shows that ‘outward opening’ valves (those that open in the direction of the steady flow) tend to be driven most easily at frequencies a little below the resonance: the model valves ‘drive’ inertive loads better than compliant ones (Fletcher, 1993).
What about other singers? In much of the alto range, and for some vowels in the high range of men’s voices, the same problem arises and, although it is much less studied, similar effects are occasionally, but not universally observed (Henrich et al, 2011). Further, some singers seem to tune R1 to the second harmonic (i.e. to 2f0) over a limited range (Smith et al., 2007). In another study, a practitioner of a very loud Bulgarian women’s singing style, was found to tune R1 to 2f0 (Henrich et al., 2007).
Finally, it is worth noting that it is difficult to tune R1 much above 1 kHz, in part because it is hard to open one's mouth wide enough. Some sopranos who practise the very range of the coloratura soprano, or the whistle voice in pop music, tune R2 to f0, above about C6, which gives them up to another octave or so in their whistle or M3 mechanism (Garnier et al, 2011).
This figure, from Kob et al (2011), shows the different tuning strategies that may be used by different voice categories. Oversimplifying for the sake of brevite, low voices may tune R1 (or R2) to harmonics of the voice. Altos, especially in belting and in the Bulgarian style, tune R1 to the second harmonic. Sopranos tune R1 to f0 up to high C and above that tune R2 to f0. See Henrich et al (2011) for details.
In a range of styles known as harmonic or overtone singing, practitioners use a constant, rather low fundamental frequency (in a range where the ear is not very sensitive). They then tune a resonance to select one of the high harmonics, typically from the about the fifth to the twelfth (Kob, 2003; Smith et al., 2007). We have a web site about this.
Is resonance tuning used in speech?
Some speakers (actors, public speakers, teachers) have to speak long and loud. Resonance tuning might be easier for them in one sense: unlike (most) singers, they get to choose the pitch for every word. Some preliminary research suggests that resonance tuning is used in shouting (Garnier et al., submitted).
The singers formant
Male, classically-trained singers often show a spectral peak in the range 2–4 kHz, a range where the ear is quite sensitive. This spectral peak is called the singers formant (Sundberg, 1974, 2001; Bloothooft and Plomp, 1986b). This vocal feature has the further advantage that orchestras have relatively little power in this range, which might allow opera soloists to ‘project’ or to be heard above a large orchestra in a large opera hall.
Singers formants are either weaker, not usually observed, or harder to demonstrate, in women singers (Weiss et al., 2001). This is not surprising: high voices have wide harmonic spacing, which makes it hard to define a formant in the spectrum of any single note. (While one can find a peak in the time-average spectrum of many notes, this is not necessarily the same as a formant, because it depends on which notes are sung.) Further, a resonance in this range is of less use to a high alto or soprano, because the wide harmonic spacing allows a resonance to fall between harmonics. High voices also have the advantage that the fundamental, usually the strongest harmonic, falls in the range of sensitive human hearing. If the gain in a singer’s vocal tract had a bandwidth of a few hundred Hz (the typical width of the singers formant), then for many notes in the high range it would fall between two adjacent harmonics. Finally, high voices can use resonance tuning more effectively than other singers, and may therefore have less need of a singers formant.
Sundberg (1974) attributes the singers formant to a clustering of the third, fourth and/or fifth resonances of the tract. (Measuring the resonances associated with it is an ongoing project in our lab.) Singers produce this formant by lowering the larynx low and narrowing the vocal tract just above the glottis (Sundberg, 1974; Imagawa et al., 1003; Dang and Honda, 1997; Takemoto et al., 2006). A vocal tract with this geometry should work better to transmit power from the glottis to the sound field outside the mouth.
When a strong singers formant is combined with the strong high harmonics produced by rapid closure of the glottis, the effect is a very considerable enhancement of output sound in the range 2-4 kHz – i.e. in a range in which human hearing is very acute and in which orchestras radiate relatively little power. It is not surprising that these are among the techniques used by some types of professional singers who perform without microphones.
Increasing the fraction of power at high frequencies has a further advantage: at wavelengths long in comparison with the size of the mouth, the voice radiates almost isotropically. As the frequency rises and the wavelength decreases, the voice becomes more directional, and proportionally more of the power is radiated in the direction in which the singer faces, which is usually towards the audience (Flanagan 1960; Katz and d’Alessandro, 2007, Kob and Jers, 1999). So increasing the power at high rather than low frequencies via rapid glottal closure and/or a singer’s formant help the singer not to ‘waste’ sound energy radiated up, down, behind and to the sides.
A number of studies have investigated a speaker’s formant or speaker’s ring in the voice of theatre actors or in the speaking voice of singers (Pinczower and Oates, 2005; Bele, 2006; Cleveland et al., 2001; Barrichelo et al., 2001; Nawka et al., 1997). Leino (1993) observed a spectrum enhancement in the voices of actors, but of smaller amplitude than the singing formant, and shifted about 1kHz towards high frequencies. This was interpreted as the clustering of F4 and F5. Bele (2006) reported a lowering of F4 in the speech of professional actors, which contributed to the clustering of F3 and F4 in an important peak. Garnier (2007) also reported such a speaker's formant in speech produced in noisy environment, with a formant clustering that depended on the vowel.
More about speech and singingVoice science is a broad and active area of research. The references quoted in this essay appear below, and below that is a collection of links. One of the aims of this essay is to provide an introduction to our research on the voice and to our publications on voice and music acoustics.