Will similar enough vectors be "put together" and only treated as 1?
No, this is not how it is done.
so for what p will P(o|p) be computed?
Your assumption that recognition is performed by considering one by one all possible phone or word sequences and scoring them is wrong. For just a few seconds of speech the number of possible words would be extremely large! This approach - scoring each candidate model among the entire space and picking up the one with the largest score, is actually used only for some small vocabulary / single-word applications (say recognizing several options on a voice menu). I think this confusion stems from the fact that the classic Rabiner tutorial on HMM makes a lot of sense for finite-vocabulary applications, but stays away from all the difficulties/technicalities of continuous applications.
In continuous speech recognition there are several layers on top of the phones models. Phone models are joined together to form words models, and words models are joined together to form a language model. The recognition problem is formulated as searching the path of least cost through this graph.
One could do the same by piecing together HMMs, though... Let's take a simpler example... You have 5 phones and you want to recognize any sequence of them (no words models, no language models - let's consider meaningless speech for this example). You train one left-right HMM per phone, say 3 or 4 states - maybe on clean individual recordings of each phone. To perform continuous recognition, you build a big HMM by adding transition states between the last state of each phone and the first state of each phone. Performing recognition with this new model is done by finding the most likely sequence of states given the sequence of observation vectors - with the Viterbi algorithm. You will know exactly which sequence of states (and thus phones) is the most compatible with the observation. Practically, things are never done this way, and this is why I downplayed in my previous reply the importance of Rabiner's "problem 2": in a true continuous application, the size of the vocabulary is such that considering this giant HMM made of all possibles connections between phones to make words, and words to make sentences would make the naive use of the Viterbi algorithm impossible. We stop reasoning in terms of HMM and we start thinking in terms of FST.
If you want to use the viterbi algorithm, I can only imagine using a fully connected HMM model (not left-right), where each state is a phone, or something like that. Is that what you mean?
– Jake1234 Aug 24 '14 at 19:03Why does this optimal state sequence tell us the answer? The HMM models for phones are all trained as to maximize P(O|M) given an isolated word input.
Let's say I want to recognize one phoneme, which is the only input. What if it happens, that while P(O|M_1) is the highest out of all the P(O|M_i), there exists a state sequence S in M_2, such that P(O|M_2,S) is bigger than P(O|M_1,Z) for any Z state sequence in M_1. The answer with viterbi would then be M_2 is the answer, while we know that the answer is M_1.
– Jake1234 Aug 24 '14 at 20:07If the most probable path goes across model K_1, how does this imply, that P(O|K_1) is the highest out of all P(O|K_i)?
– Jake1234 Aug 24 '14 at 21:40What I mean is, if you've got a continuous emission of feature vectors, it's realistic you'll find 2, such that emission is >0.5 for both, which doesn't really make sense.
I guess it doesn't really matter, as you just want the emission of a certain vector (and those close to it) to be high?
– Jake1234 Aug 25 '14 at 22:32