1

I really, really hope someone can help me with regards to this question.

I'm trying to implement a Hidden Markov Model (based off this paper: Here

I understand the processes, but, I do not understand what M would represent in the data I am trying to train the HMM with.

I am given this example:

"N = the number of hidden states
 M = the number of distinct observation symbols
 T = the number of observations

 So, for the English text example, if you let N = 2, M = 27 (26 letters
 plus word-space), and T = 50,000 (number of input letters to use), you
 should see that the 2 hidden states correspond to consonants and
 vowels."

This example works for the English Dictionary, I understand that. BUT I am attempting to train the HMM with MFCC Coefficients of a file (stop.mfc) which contains 4k+ values. Now my interpretation would be that: T = 4000; (The size of the 'Observable' sequence) and N = 2; ("Stop" and "Go") so therefore what would M represent in the example I am giving? Which is: Differentiating between someone saying "Stop" or "Go" Would M infer the number of training samples I have?

I really hope someone can help me with this.

Phorce
  • 485
  • 1
  • 6
  • 18
  • M represents an "alphabet" of vocal symbols that you are going to idetify using characteristics of speech that include the MFCC coefficient values along with various energy measurements. See: http://www.google.com/url?sa=t&rct=j&q=speech%20recognition%20sysnthesis%20filetype%3Apdf&source=web&cd=3&ved=0CE0QFjAC&url=http%3A%2F%2Fwww.stanford.edu%2Fclass%2Fcs224s%2Flec%2F224s.09.lec9.pdf&ei=88MjUfiAJeWE0QG0uIDIBg&usg=AFQjCNGaZNwLWgioM7qePahEHJNvZbhH3g&bvm=bv.42553238,d.dmQ – user2718 Feb 19 '13 at 18:31
  • @BruceZenone Thanks for the reply. My MFCC values contain (13-Coefficients) for each block.. so therefore, could M = 13? – Phorce Feb 19 '13 at 18:48
  • That is a good place to start. Maybe that would suffice as a first order approximation of symbols. If you review the referenced document, a robust set of symbols needs more information that the 13 MFCC coefficients alone. – user2718 Feb 19 '13 at 18:56
  • @BruceZenone Thank you :)! I'll try training with 13, if that isn't enough I will attempt to use more.. But, thank you – Phorce Feb 19 '13 at 19:14
  • My previoius answer is in error. It isn't the number of MFCC values per block that determines the number of vocal symbols. It is the range of possible MFCC values that matters. You have to map the MFCC values along with other information to a finite set of "symbols". Identifying how to do such a mapping is quite complex. – user2718 Feb 19 '13 at 20:01
  • Some early speech recognition systems used a vector quantization step to map the acoustic feature vectors into a discrete codebook of acoustic codewords; and then discrete HMMs on these symbols to perform the recognition. This method is more "kludgy" than directly using continuous observation HMMs with one or many gaussians per state - and performs less well (because of the loss of information - "hard decision" during the vector quantization step) – pichenettes Feb 19 '13 at 20:09
  • @BruceZenone Could you recommend any algorithms that will allow me to map such a vector? Or, is there any library out there that I could use to train a HMM with MFCC values? – Phorce Feb 19 '13 at 20:20
  • Vector Quantization could be used to turn your MFCC vectors into a codebook of discrete symbols. As I have said, this is not a road I would recommend! As for code/libraries, for very low-level matlab stuff there's H2M (http://perso.telecom-paristech.fr/~cappe/h2m/) which handles both discrete and continuous distributions ; though I would recommend directly using a speech recognition framework like HTK. – pichenettes Feb 19 '13 at 20:23
  • Am working in handwriting recognition, can any one advice me what HMM model should I use. Discrete or continuous? If I have 500 words images and for each word there is 100 sample what will be the parameters of my model. – Hanadi Nov 27 '16 at 18:08

1 Answers1

2

Firstly, this is not because you have two words to identify that you need $N = 2$ states. Your goal is not to train a model with two states - one for each word to recognize - but to train 2 models, one for each word to recognize, and each of these models will have as many states as necessary. In fact, each state in your HMM should correspond to a distinct "stage" in the pronunciation of a word - and will very likely correspond to a phoneme. Your vocabulary size (here, two: "stop" and "go") is external to this. For "stop", there are 4 phonemes. For "go", there are 2 phonemes. So you train a 4-state left-to-right model on the "stop" data; and independently of this, a 2-state left-to-right model on the "go" data. To recognize a word given its MFCCs, you evaluate which of these two models has the highest likelihood given the data. If you had to recognize words within a lexicon of 10 words, you would similarly train 10 HMMs, one for each word, each of these models having a number of states suitable to the length/complexity of the word to recognize.

You need to step back and ask yourself "why HMMs in the first place?". We need HMM for speech recognition because words are made of a sequence of distinct elements in sequence (phonemes). If we want to describe/recognize the word "stop", we need to learn a description which is expressive enough to capture that "first it sounds like ssss, for a short while, then it is tttt for a short while, then it is oooo for a longer amount of time, then it is pppp for a short moment". HMMs are a good match for expressing that - states are phonemes, the transition matrix (which will be here diagonal + upper diagonal) indicates that we move through the word from first phoneme to last phoneme, staying a variable amount of time in each phoneme, and the distribution associated with each state indicates how each phoneme translates into your acoustic features.

It seems also that you are mixing up discrete HMM (in which the observations are drawn from a discrete distribution associated with each state) with continuous HMMs (in which the observations are scalars or vectors, characterized by a continuous distribution such as a gaussian). So the parameter $M$, number of distinct observation symbols, is irrelevant in your case, since your observations are 13-dimensional vectors, an uncountable set! ($M$ would be... the cardinality of the continuum).

I am afraid the introduction material you have picked is not directly relevant to speech recognition - though it is useful for applications in which HMMs are used to recover hidden structure from a discrete observations (and there are many of them, for example parsing/tagging in NLP). Try to master this material without thinking much about your speech recognition problem, and then move on to material about continuous HMMs with multivariate normal distributions - and finally to continuous HMM with mixtures of multivariate normal distributions (since this is what is likely to work best for speech).

pichenettes
  • 19,413
  • 1
  • 50
  • 69
  • Thank you for your long reply, I will defiantly look into this! In the paper I gave, the author gave an example (example 3) which showed how this material could be used to solve a task related to speech (Identifying whether someone is saying "Yes" or "No") and, my thought process is this: If I can "estimate the HMM" which given the observations will produce a vector of probabilities, I can then compare the given signal (after being through the HMM) to the training file and then using a "scoring algorithm" find the best possible match for the sequence. Does this look accurate? – Phorce Feb 19 '13 at 19:40
  • I don't necessarily want to go into more complex versions of Hidden Markov Models (I can understand them at a later date. Because of the time contraints that I have, it might be impossible to understand and implement continuous HMM's - But, it is defiantly something I will consider after my research is completed. – Phorce Feb 19 '13 at 19:42
  • What the author has written in the paper - that training is "Problem 3" and recognition is "Problem 1" (this applies to speech but also hand-writing, gesture recognition...) is true, but everything else in the document is about discrete HMMs, in which the data you process is in the form of discrete symbols (because this is how they are input or because the data has been quantized). Isn't it obvious to you that the data at the input of your recognition/training process is not discrete symbols from an alphabet, but vectors of reals? This is why the recipes in this document cannot be used as is. – pichenettes Feb 19 '13 at 20:03
  • 1
    I am not trying to sell you a "more complicated" version of HMMs. What you need to process are vectors of numbers (13-dimensional MFCC vectors), hence you need a tool to process vectors of numbers, and this tool is continuous HMMs. Not discrete HMMs. – pichenettes Feb 19 '13 at 20:04
  • Thank you for your honesty, and, your input. Shall I forget about using his documents for attempting to train a Hidden Markov Model with MFCC values and start from the beginning? I'm currently writing the pseudocode he has given in his paper but I don't know whether or not to bother now.. Can you let me know please? – Phorce Feb 19 '13 at 20:12
  • 1
    I suggest you to continue studying this paper and writing code, but to solve a different problem which is more suitable for discrete HMMs (maybe ask a question here about some ideas of good "toy problems" for learning discrete HMMs). If you don't have a solid understanding of discrete HMMs, you'll struggle when studying continuous HMMs anyway... – pichenettes Feb 19 '13 at 20:21
  • Thank you I will do this. But to be clear, this version would not be able to solve identifying whether someone is saying "Stop" or "Go" from MFCC values? And could you recommend any papers on continuous HMM's? I've come so far to just give up at this stage. – Phorce Feb 19 '13 at 20:23
  • "But to be clear, this version would not be able to solve identifying whether someone is saying "Stop" or "Go" from MFCC values?" >> No, because your input are vectors, and discrete HMMs process strings of discrete symbols. – pichenettes Feb 19 '13 at 20:24
  • The use of a discrete HMM isn't off base for speech recognition. The problem is in coming up with a discrete set of symbols from a continum of audio features. This is done using a process called vector quantization. There are references on line, but this isn't a simple process. – user2718 Feb 19 '13 at 20:34
  • @BruceZenone So if I transformed the MFCC coefficients into discrete set of symbols (using vector quantization) I could use a discrete HMM (like the on given in the paper)? Even though it's not a good method to do things, it would solve the problem of identifying whether someone is saying "Stop" or "Go"? – Phorce Feb 19 '13 at 20:37
  • You may be able to get there, but it is no simple task. It looks like the MFCC coefficients represent quite a raw set of features. The paper I listed earlier is using processed MFCC coefficients and energy estimates to create 39 dimensional vectors. These vectors are then mapped into 256 classes. These classes are the symbols for the HMM. What you need to distinguish two words like "Stop" and "Go" may be far less complicated, but it isn't trivial by any stretch. Seems like a great science experiment. Have fun! – user2718 Feb 19 '13 at 21:22
  • @BruceZenone Thanks again for the reply. The paper you listed, mentions the "Baum-Welch" algorithm. Couldn't I therefore get all 39 features, use the Baum-Welch algorithm to get the symbols instead of the using the vector quantization? Or, would it be better just to learn about the continuous HMM and not use the discreet HMM? – Phorce Feb 19 '13 at 21:52
  • The Baum-Welch is an algorithm for learning the parameters of the HMM ("problem 3") given data. This algorithm is neither specific to discrete or continuous HMMs. It is not a substitute for Vector Quantization. – pichenettes Feb 19 '13 at 22:01
  • @pichenettes Thanks :) Do you know of any papers, or pseudocode or even books for a continuous Hidden Markov Models where I can use the 13-Coefficients I have without having to compute the VQ etc.. – Phorce Feb 19 '13 at 22:03
  • I don't know which approach is easier to implement., but given @pichenettes input, I think the continuous HMM approach has proven to be superior, so I would proceed in that direction. – user2718 Feb 19 '13 at 22:08
  • @pichenettes Would this library be sufficient enough? http://code.soundsoftware.ac.uk/projects/qm-dsp/repository/entry/hmm/hmm.c This library doesn't accept the "M" value just the number of states, initial probability and transition probability. – Phorce Feb 20 '13 at 13:39
  • yes, this library does continuous HMM (1 gaussian per state) with parameter estimation/decoding/scoring. – pichenettes Feb 20 '13 at 14:20
  • @pichenettes Thanks :) So I can use my 13-Coefficients without using any other algorithm to train the HMM? This library should be enough then, thank you for your time and your opinions. Over the summer break (in a few months) I'm going to go back, and, learn discreet and continuous markov models and more complex speech recognition tools. – Phorce Feb 20 '13 at 14:35