-3

I am new to Natural Language Processing and the first thing that I encountered is phoneme representation of a word. I am wondering how come "hello" gets converted to "HH AH0 L OW1"? From where does this numbers comes along with the representations like AH0 and OW1? I went through these 2 charts but could not find any numerical representation like this. I am attaching the charts

enter image description here

enter image description here

Can anyone guide me from where does this numerical representations like AH0 and OW1 comes?

Mohan Singh
  • 105
  • 2
  • 3
    Can you tell something more about where you encountered the notation you are describing? Was it the output of some tool (name it!), was it in some ready-made language resource (provide a citation!)? – Sir Cornflakes Feb 02 '23 at 12:32
  • http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=caterpiller&stress=-s

    It is mentioned as lexical stress here.. But I don't get the idea behind it

    – Mohan Singh Feb 02 '23 at 12:56
  • 3
    On the bottom of the page, you find a listing and a reference to https://en.wikipedia.org/wiki/ARPABET After reading this, what question is left? – Sir Cornflakes Feb 02 '23 at 13:43
  • Every speech analyzer that tags phonemes uses what linguists would call allophones, instead of phonemes. They only needed one unit, you see, rather than the two-level phonetics/phonology unit system that linguists use. They were only doing English at the start, after all. So, basically, the systems that recognized more variants correctly won, and those were the ones with the big numbers of phonemes. No real language has more than a hundred phonemes, but many such systems go into the thousands. For that, you need numbers to label them. And the names stick. That's all, really. – jlawler Feb 02 '23 at 16:41
  • 2
    @jlawler This isn't the case in this case, in ARPABET the numbers added at the end of a syllable denote stress level (0 unstressed, 1 primary, 2 secondary, 3 tertiary). – Sir Cornflakes Feb 02 '23 at 17:28
  • @SirCornflakes Sure, each system has its own reason and code for labels. Sometimes they're separable, with independent references. The point is there are too many to keep track of without numbering of some sort -- after all, they're getting changed into binary numbers to do the analysis. – jlawler Feb 02 '23 at 17:31
  • @jlawler A few languages do have more than a hundred phonemes; e.g., West !Xóõn in a very restrictive analysis has 107 + two tonemes. – Janus Bahs Jacquet Feb 03 '23 at 09:31
  • Close, but no cigar, eh? – jlawler Feb 03 '23 at 14:58

1 Answers1

1

This is from the CMU dictionary, and the numbers reflect degree of stress. It is generally assumed (but not entirely free of controversy) that in English there is a phonemic distinction between primary and second stress, as well as unstressed vowels. For example "latex" has primary stress on the first syllable and secondary stress on the second syllable. The word "latest" has initial primary stress and no stress on the second syllable.

The book Sound pattern of English gives an extensive phonological analysis of stressed in English, and is based on the phonetic values in Kenyon & Knott. It may or may not be useful in understanding the nature of stress in English. The CMU dictionary treats stress as a property of vowels.

It should be noted that vowel reduction is related to stress in English, so there is a chicken-and-egg question, whether the difference between "latest" and "latex" has to do with degrees of stress versus different vowel qualities. One can write [ˈletɛks] and [ˈletəst], using vowel quality as a substitute for degree of stress; if you include detail on how "t" is pronounced ([ˈletʰɛks] and [ˈleɾəst]) you have additional bases for distinguishing the words without appeal to degrees of stress. The argument that it is phonologically a matter of degrees of stress is not seriously in doubt, but in terms of simple-minded recording of data, secondary stress is not strictly necessary.

I found this file regarding possible inputs for the dictionary

  • a 20k+ general English dictionary, built by hand at Carnegie Mellon (extensively proofed and used).
  • a 200k+ UCLA-proofed version of the shoup dictionary.
  • a 32k subset of the Dragon dictionary.
  • a 53k+ dictionary of proper names, synthesiser-generated, unproofed.
  • a 200k dictionary generated with Orator, unproofed.
  • a 200k dictionary generated with Mitalk, unproofed.

They comment that

All of the above sources were preprocessed and the transcriptions in the current cmudict.0.1 were selected from the transcriptions in the sources or a combination thereof. We have removed some potentially unreliable transcriptions from this dictionary, including those based on only one source, and will reintroduce them once we have verified the transcriptions.

It seems to reduce to a judgment by the editor, and no indication what principles were followed, or what the actual sources were (versions? what dictionary?).

What is not questionable is that one cannot compute the pronunciation of an English word from its spelling, therefore any computer dictionary of English has to be based on some other pronouncing dictionary, of which there are many. Most people disagree with some prounciation claims of every dictionary ("that's not how I pronounce it"), and there is no pretext that there CMU dictionary derives from processing a massive corpus of naturalistic speech from some location. They claim that "Snohomish" is "S N AA1 HH AH0 M IH0 SH" = [ˈsnɑhəmɪʃ], but that is not even a possible pronunciation in US English (it's [snoˈhomɪʃ ~ snəˈhomɪʃ]).

user6726
  • 83,066
  • 4
  • 63
  • 181