4

TLDR: How do we differentiate, say, a "A" from a "O", how do we identify speech sounds? If formants are the key, how is it possible to identify it regardless of the pitch (fundamental frequency), and regardless of the voice of the person speaking/singing (timbre). See 1) and 2) below for more specific questions.

I have a question that I thought was fairly basic and turns out to be more complex than expected, for someone who doesn’t have specific knowledge of the field.

Sounds are acoustic waves that we can analytically characterize/visualize with a variety of different concepts. I want to understand how those analytical tools relate to the intuitive understanding of a sound that we have (i.e. how would one read a spectrogram/amplitude envelope of a sound and “predict” exactly what is heard). Specifically, how do we differentiate $\textbf{speech sounds}$.

For reference, here is my qualitative understanding of some of the most notable features of a general sound (not only speech). I mostly focus on sounds with a well-defined fundamental frequency (instrument or person singing a note)

  • Pitch: For a harmonic sound like the pluck of a guitar cord, you extract the fundamental frequency from your Fourier analysis and that’s the perceived pitch. (in the case of a non-harmonic sound i.e. no distinct spectral maxima, I guess you can say there’s no well-defined pitch i.e. you’re just not singing/playing something)
  • Loudness: amplitude of the soundwave
  • How “explosive” the sound feels: time dependent amplitude-envelope (e.g. sharp amplitude peak at the attack of the sound)
  • Timbre (I mostly mean by that what allows us to identify what is making the sound: piano? Guitar? human?): More complex. But as I understand it, the most decisive factors are the distribution of the harmonics in the spectrum (i.e. different instruments have different Fourier coefficients for the same fundamental frequency) and again the time envelope (which echoes to the previous point since it quite clearly comes into play for identifying an instrument). Timbre also allows to differentiate voices. Thus I would suppose that a timbre is “characteristic” of a person, i.e. John has his own identifiable timbre that is unchanging (unless he’s mimicking Martha, but that’s kinda rude)

Now the last point that I’m very curious about but fail to really understand is speech sounds. How can I so accurately differentiate a “AAAA” sound from a “OOOO” sound, regardless of the person speaking (voice timbre) and the pitch used (frequency, more precisely fundamental frequency)? To satisfy this, I imagined (wrongly) that the characteristic of the soundwave related to the specific speech sound must not be correlated to either frequency or timbre. However, doing some research, I came across the notion of $\textbf{formants}$ (https://en.wikipedia.org/wiki/Formant and https://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/spectrogram-sounds.html), which are broad spectral maxima. It seems that the frequency of those peaks are characteristic of, say, a given vowel.

I have 2 problems with that. First, I don’t understand how we can vocalize a “A” sound at different pitches and still keep it absolutely identifiable and very distinct from any other vowel. For example, it is said that for the average man, the first formant for E is 390 Hz and A is 850 Hz. But, I can very well sing a “E” sound at a pitch that gives me a fundamental frequency of 850 Hz, can’t I? Then how are we still able to clearly make the difference? Even if there are other formants, I feel this should at least confuse the ear, no? Those relative positions of the first formants of two different vowels seem to imply that "A" should sound higher, but of course it is not necessarily the case...

Second problem: Having multiple formants with a specific frequency difference/ratio seems to be imposing a particular resonance pattern. Said otherwise, it seems to me that having specific relative frequencies for formants imply having a specific set of Fourier coefficient. And that, to me, was mostly what defined timbre (our ability to differentiate voices). So how come different voices can have the same “formant pattern” that we identify? Shaping our mouth cavity, I imagine we can produce different resonance patterns, but then it seems to me that producing different formants would require us to alter what define our identifiable voice. Basically, what is exactly the difference between timbre and vocalization of vowels? I am confused.

So, two questions to summarize:

  1. Does the relative positions of the first formants imply that a vowel intrisinsically is higher than another? How do we reconcile that with our ability to sing vowels very low/high

  2. The formant structure of a speech sound is akin to its fourier decomposition in my understanding. However, this is also what determines timbre, so what is the difference between both, i.e. how can we determine both timbre and the nature of the speech sound independently?

Sorry for the long post! As it is a very broad question I wanted to try and be clear on what I understand/what I don’t

Barbaud Julien
  • 983
  • 5
  • 15
  • Great question! I'm an amateur and not quite an expert in this subject, so I don't think I can give a proper answer, but it's no accident that you're finding there's no simple distinction that makes an "A" sound. In fact, if you listen to yourself say "A" slowly, you'll notice the sound shifts from beginning to end. You can sometimes convince yourself that you're actually saying "e" if you just cut out the middle part. You often have to say "b as in boy" over the phone but not in person, because the differences in the waveform are so subtle than any loss of quality over the phone can ruin it – Señor O Jun 24 '21 at 05:12
  • Good point! From hearing alone, the distinction is not necessarily as easy and clear as I made it to be, and other factors might come in to help (identifying sounds by correlation from prior knowledge of possible words, visual lip reading, etc...) – Barbaud Julien Jun 24 '21 at 05:16
  • Re, "'explosive'...time dependent...envelope" I think you are asking two separate questions. You started out asking about vowel sounds, "How do we differentiate a 'A' from a 'O'?" There's no time dependency there. It's all about the frequency spectrum which is a somewhat physics-y question. But, when you ask about plosives, that becomes more of a question about human perception and/or linguistics. Not much room for physics in that part of the question. – Solomon Slow Jun 24 '21 at 13:08
  • @SeñorO, Saying the name, "A," is different from making the long-A vowel sound. As you rightly pointed out, to say the name, you have to make a sequence of three sounds; A little plosive kind of sound (I don't know a proper name for it) at the back of your throat, and then a long-A vowel sound, and then a smooth transition to a long-E sound at the end. If you "say 'A' slowly" you hold either the long-A part or the long-E part, probably depending on what part of the world you were raised in. – Solomon Slow Jun 24 '21 at 15:08
  • 5
    Does this answer your question? Why do different letters sound different?

    If the question has arosen in the context of studying foreign languages, it is worthwhile investing a bit of time into learning a bit of phonetics, how different sounds are articulated and classified: https://en.wikipedia.org/wiki/Phonetics

    – Roger V. Jun 25 '21 at 07:35
  • @BarbaudJulien Have you seen my answer to that question? - It would be against forum policy to repost it here. In particular, to what extent it is about harmonics or pitch depends on a language. – Roger V. Jun 25 '21 at 08:18
  • @Roger Vadim thx for the link, it is a nice discussion and it helps to clarify some ideas. But I don't think it answers the details of my question. I have edited it to make it more obvious. I refer to points 1. and 2., which I believe are not answered in that discussion – Barbaud Julien Jun 25 '21 at 08:21
  • @BarbaudJulien In an Indo-European language, such as English, pitch is non-informative, i.e., vowels can be pronounced with different pitch. This doesn't mean that they are all pronounced with the same pitch in a speech flow - some may tend to be lower and others higher. However, changing a hight of a voweld would not change its meaning, even though the speech may sound a bit strange. This is the difference between phonetics and phonology - sounds may vary in their physical characteristics, but still be perceived as the same sound. – Roger V. Jun 25 '21 at 08:37
  • @BarbaudJulien E.g., my native Russian has 5-6 vowels (depending on classification), whereas French has up to 17 and English has even more. This does not mean that Russians are not capable of producing the same range of sounds as native French and English speakers, but they will not differentiate many of them as different. Also, English and germanic langauges have sounds of different length carrying different meaning ('green' vs. 'grin') - a feature that is ignored by French, even though they are perfectly capable of prinouncing vowels a bit longer. – Roger V. Jun 25 '21 at 08:41
  • @BarbaudJulien Vraiation of the number of vowels in a language also has to do with the fact that some dialects of the same language may pronounce differently sounds that otehr dialects do not differentiate. E.g., French traditionally has 4 nasal vowels, but two of them has merged in the modern Parisian pronunciation. – Roger V. Jun 25 '21 at 08:42
  • I'm not convinced this question is on-topic for us - it involves acoustics, but it's more about a rather arbitrary classification of specific "sounds" than objective physics. I'll see whether [linguistics.SE] wants this question. – ACuriousMind Jun 25 '21 at 09:54
  • In neuroscience the term for what you discuss in the first half of your question is "invariance". It's the same problem of how do you visually recognize your dog as the same dog whether you look at it from the front, side, or back end, in daylight and dim light, whether she's eating or growling or napping. Consider a bitmap image of any of these circumstances - if you start from pixels (like starting from a spectrogram) you'll find little that is recognizably similar. – Bryan Krause Jun 25 '21 at 14:51

0 Answers0