Continuous speech recognition with HMM

Question

I don't understand how this works. Help would be really appreciated.

Let's say we have a 10 second sound input, we make a feature vector every 10ms - so we have 1000 vectors - $\mathbf{o}=o_1, \cdots,o_{1000}$.

For simplicity, let's say that all we want to know, is what sequence of phones $p=p_1,\cdots,p_n$ is the answer to $\text{argmax } P(o|p)$ for $p$, or at least what's a good estimate/close answer. For all phones, we have a seperate left to right HMM model (I think that's what makes sense in this scenario).

How exactly is this done? Will similar enough vectors be "put together" and only treated as 1? The input is probably not one phone or a thousand phones - but we don't know that... so for what $p$ will $P(o|p)$ be computed?

score 2 · Accepted Answer · answered Aug 23 '14 at 21:16

2

Will similar enough vectors be "put together" and only treated as 1?

No, this is not how it is done.

so for what p will P(o|p) be computed?

Your assumption that recognition is performed by considering one by one all possible phone or word sequences and scoring them is wrong. For just a few seconds of speech the number of possible words would be extremely large! This approach - scoring each candidate model among the entire space and picking up the one with the largest score, is actually used only for some small vocabulary / single-word applications (say recognizing several options on a voice menu). I think this confusion stems from the fact that the classic Rabiner tutorial on HMM makes a lot of sense for finite-vocabulary applications, but stays away from all the difficulties/technicalities of continuous applications.

In continuous speech recognition there are several layers on top of the phones models. Phone models are joined together to form words models, and words models are joined together to form a language model. The recognition problem is formulated as searching the path of least cost through this graph.

One could do the same by piecing together HMMs, though... Let's take a simpler example... You have 5 phones and you want to recognize any sequence of them (no words models, no language models - let's consider meaningless speech for this example). You train one left-right HMM per phone, say 3 or 4 states - maybe on clean individual recordings of each phone. To perform continuous recognition, you build a big HMM by adding transition states between the last state of each phone and the first state of each phone. Performing recognition with this new model is done by finding the most likely sequence of states given the sequence of observation vectors - with the Viterbi algorithm. You will know exactly which sequence of states (and thus phones) is the most compatible with the observation. Practically, things are never done this way, and this is why I downplayed in my previous reply the importance of Rabiner's "problem 2": in a true continuous application, the size of the vocabulary is such that considering this giant HMM made of all possibles connections between phones to make words, and words to make sentences would make the naive use of the Viterbi algorithm impossible. We stop reasoning in terms of HMM and we start thinking in terms of FST.

answered Aug 23 '14 at 21:16

pichenettes

19,413
1
50
69

With isolated word recognition for finite set of words, you can simply create a HMM model for each word, then you can take maximum of P(O|M). In continuous speech recognition, you do not know how long the sequence of recognizable sounds will be - you piece together HMM models, to make one big HMM model, but by what method can it be determined how good this big HMM model is? Do you in some way still try to get a good P(O|M), where M is this big HMM model, just like with the isolated word recognition? – Jake1234 Aug 24 '14 at 15:13
The thing I don't get - you're mentioning the viterbi algorithmm - so does P(O|M) even matter? Or the two are correlated? If you know the most probable sequence of states for a given big HMM model (of connected HMM models of phones), how does that tell you whether this big HMM model is better than some other big HMM model? Which is what you're going for right? A big HMM model of connected HMM models, that make the observation of 1000 vectors likely. – Jake1234 Aug 24 '14 at 15:18
In a continuous application there is only one big HMM - which describe the entire space of inputs to recognize. So the value of P(O|M) doesn't matter because there's nothing to compare it with. What matters is the sequence of state. – pichenettes Aug 24 '14 at 15:24
5 Phones, 3-state left-right HMM for each phone. If you'd want to do isolated phone recognition, you could just pick such HMM model, that P(O|M) is the highest, correct? If you want to recognize a sequence of phones, and connect HMM phone models together for a big HMM, would you not want a high P(O|M)? How will the state sequence tell you if the big HMM model is connected out of the right phone HMM models?
If you want to use the viterbi algorithm, I can only imagine using a fully connected HMM model (not left-right), where each state is a phone, or something like that. Is that what you mean?
– Jake1234 Aug 24 '14 at 19:03
"How will the state sequence tell you if the big HMM model is connected out of the right phone HMM models?". There is only one HMM model so there's no way of getting it wrong. There's no alternative. You're not discriminating between several model. There is only one model, and finding the path through it tells you what your input is. – pichenettes Aug 24 '14 at 19:12
"Is that what you mean?". No this is not what I mean. Read again my original answer: one left-right model with several states per phone; and transitions between them to model the fact that what you want to recognize is a sequence of such phones. – pichenettes Aug 24 '14 at 19:14
Ok, so the big HMM model is no longer left-right though, correct?
Why does this optimal state sequence tell us the answer? The HMM models for phones are all trained as to maximize P(O|M) given an isolated word input.

Let's say I want to recognize one phoneme, which is the only input. What if it happens, that while P(O|M_1) is the highest out of all the P(O|M_i), there exists a state sequence S in M_2, such that P(O|M_2,S) is bigger than P(O|M_1,Z) for any Z state sequence in M_1. The answer with viterbi would then be M_2 is the answer, while we know that the answer is M_1.
– Jake1234 Aug 24 '14 at 20:07
The big HMM model is a set of left-right models pieced together. "Why does this optimal state sequence tell us the answer?". Because it gives the most likely "explanation" of the observed sequence as a concatenation of phonemes. – pichenettes Aug 24 '14 at 21:03
I don't understand your example where you're trying to oppose viterbi and model scoring. These two methods don't apply to the same problem at all. Are you talking about a single event or a continuous problem? If this is the former, the viterbi algorithm is not used - you have several models, you score them, and decide upon the model which has the highest score. If this is the later, there is only one single model, and your answer is contained in the most likely path through this model. – pichenettes Aug 24 '14 at 21:04
One observation though... Let's say you have a set of K models. You create a new big model which consist of a dummy "source" state with K equiprobable transitions from the dummy state to the first state of each model. In this case, the following two operations are equivalent: 1/ scoring each of the K model and selecting the one with the highest probability. 2/ running the Viterbi algorithm of the composite model, and selecting the model through which the most likely path go through. – pichenettes Aug 24 '14 at 21:12
I think this is exactly what I don't get - why is 1/ and 2/ equivalent?
If the most probable path goes across model K_1, how does this imply, that P(O|K_1) is the highest out of all P(O|K_i)?
– Jake1234 Aug 24 '14 at 21:40
Because if, for example, P(O|K_2) > P(O|K_1), then there would exist a path going through K_2 that would have a higher score than the best path going through K_1, which would mean that the path going through K_1 wouldn't be the highest score path, which would contradict the property of the Viterbi algorithm that it returns the highest score path. – pichenettes Aug 24 '14 at 22:18
I'm sorry, but I still don't get it. How does P(O|K_2) > P(O|K_1) imply that there would exist a path going through K_2 that would have a higher score than the best path going through K_1? P(O|K_2) is basically is the sum of probabilities of all paths in K_1, but why would this mean that any one of those paths is of some particular high probability? – Jake1234 Aug 24 '14 at 22:55
Now I get what troubles you... In practice, among all possible paths going through a model, one of them will have a score several orders of magnitude higher than the others, to the point that the sum of probabilities over all paths can be extremely well approximated by the probability of the most likely path. There are applications of HMMs in which this is not true; but in speech applications, this approximation works well. – pichenettes Aug 24 '14 at 23:11
In particular, it is very common to write algorithms that do not manipulate the probabilities themselves (they are way too small because of the high dimensionality of the feature vectors), but their logarithm; and the operation of summing two probabilities is approximated, in the log domain, by the maximum. – pichenettes Aug 24 '14 at 23:12
Ah, ok. Is this outcome (one path in a model score much higher than others) a result of a specific training procedure? All of this in the context of just recognizing a sequence of phones. All I know is that there's the baum-welch and viterbi training methods. – Jake1234 Aug 24 '14 at 23:53
I'd say this is largely a consequence of the large dimension of the observation vectors. For a classic gaussian model (http://en.wikipedia.org/wiki/Multivariate_normal_distribution), evaluating the pdf for a given observation vector yields a number extremely small (because of the product over so many dimensions), with several orders of magnitude of difference between the probability evaluated at a state that "fits" and at a state that doesn't "fit". – pichenettes Aug 25 '14 at 00:17
Thank. Just one last thing - GMMs are used for continuous emmissions, but clearly the probability of observing a certain vector (from the perspective of the integral of a GMM across a degenerative interval is zero) is zero. Is this really the way it's meant to be?
What I mean is, if you've got a continuous emission of feature vectors, it's realistic you'll find 2, such that emission is >0.5 for both, which doesn't really make sense.

I guess it doesn't really matter, as you just want the emission of a certain vector (and those close to it) to be high?
– Jake1234 Aug 25 '14 at 22:32
"Is this really the way it's meant to be?". This question is not specific to speech recognition. When you're modelling the height of people with a gaussian, should we conclude that it's impossible for a person to be 1m87 tall because 1m87 is a mass-less single real value (as opposed to an interval), and as such, p(height=1.87) is null? This is easily resolved by considering that any measurement (including your feature vectors) are tainted by error, so you're not really weighting discrete points but small intervals/volume elements centered at the observed values. – pichenettes Aug 26 '14 at 01:14

Continuous speech recognition with HMM

1 Answers1

Linked