What does a "vector" in a hidden Markov model mean?

Question

I know that a Hidden Markov Model (HMM) is used in speech recognition and understand it to some degree. However, what I don't know is how input (speech) is "transformed" to a vector which in later used in HMM.

How do you get a vector from a sound input? Is this vector readable by a human?

Could someone explain the OP why he was downvoted? That way he can edit his question into something that's perhaps more appropriate — Ivo Flipse, Aug 16 '11 at 21:38
As I understand it, mathematicians use the term "vector" for what normal people would call "a string of numbers". They see your MP3 file as an arrow pointing to a specific point in a "Hilbert space", which has an infinite number of dimensions... — endolith, Aug 17 '11 at 04:09
I presume you are asking about MFCC. It is clearly given in this Wiki Link on MFCC. http://en.wikipedia.org/wiki/Mel-frequency_cepstral_coefficient — Rajesh D, Aug 17 '11 at 08:09
Nice turn-around, yoda! The original post had several shortcomings, but I think that the current form doesn't deserve the downvotes or close votes. — Kevin Vermeer, Aug 17 '11 at 22:26
@rajesh: why don't you put this in an answer ... so far there is not a good one — Peer Stritzinger, Aug 20 '11 at 06:59

score 8 · Accepted Answer · edited Aug 22 '11 at 05:22

The way speech recognition is carried out with HTK (or any other tool) is sort of similar to the way speech recognition is carried out in the brain. When you hear a word, you instantly break it down into its constituent phones and then compare the phones with an internal mental "model" of the phones. These "models" are constructed over years of listening to speech and gives you the ability to distinguish between similar-sounding sentences like "How to wreck a nice beach" and "How to recognize speech". Speech recognition with HTK or any other model-based scheme works in a similar manner. Here, in a few steps, is how you do it:

You take the input speech signal and convert it into a feature vector representation.
Take a large number of sentences and perform step 1 on each of them.
Use the feature vectors in step 2 to build a statistical model for each of the phones/words in the sentences (there are a limited number of phones/words as against an infinite number of ways of saying them - so you reduce the unknowns by modelling).
When a new word comes in, break it into phones and compare with each of the known models. The sequence of phones with the highest probability wins!

All the above steps are critical to the successful completion of any speech recognition task. By decomposing a sound into its feature vector, you are taking it into a model-space, giving it a representation that makes it more suited to making-a-model-out-of than other representations (say the time-amplitude representation). Most such representations lie in the frequency, or the time-frequency domain. One of the most popular such representations is the MFCC (Mel Frequency Cepstral Coefficient). In a way, this technique mimics the human hearing response with a set of filters. An input signal is decomposed with this set of filters that have a logarithmic spacing of their center frequencies. The MFCC coefficients of any one sentence (say) are then used to model each of the phones that the sentence is made of. As an example, consider,

Sentence: HI. Phonetic description: hh aa ey
When you feed the MFCC coefficients into HTK, it will associate the MFCC coefficients of a portion of the sentence with hh, another with aa and so on. When this is repeated many times over, the models for the phones begin to form.

HTK uses the tool HCopy to convert an input sentence into its feature-vector representation. There are many "flavours" to MFCC's as well (E_D_A or E_D_A_Z representations). It would be a good idea to read up on the documentation for HCopy within the htkbook.

The MFCC coefficients are written to a file with extension .mfc by HTK. It is not possible to read that file using any one of the text editors because (I think) the coefficients are written in binary. You can try to read the files with C though.

HTH.

I have down voted because of multiple reasons. It lacks accuracy. It is highly inaccurate and full of half truths. Unecessary mention of tools and gadgets and methods to use them, which is not relevant to the question. First of all the question itself is not well composed, and your answer doesn't seem to mention that. More over there is a nice Wikipedia article on MFCC which i mention my comment on the question. — Rajesh D, Aug 18 '11 at 11:42
@Rajesh: thanks for the feedback! I have provided the OP with a link to the MFCC page on wikipedia if you look carefully. If you think this answer is inaccurate, then please highlight the inaccuracies, so we may learn. Mere down-voting does not amount to constructive criticism so I encourage you to make free use of the edit button on the answer, or better still, provide us with an answer of your own. And yes, if the question itself is inaccurate, the answer too will lack accuracy. That has been discussed in the comments section itself. — Sriram, Aug 20 '11 at 06:29
You repeatedly mention "phones" ... shouldn't this be "phnoemes" instead? — Peer Stritzinger, Aug 20 '11 at 06:46
@Peer: No. The phones are modelled in a speech recognition task, not the phonemes (IMHO). — Sriram, Aug 20 '11 at 07:38
Ah I see ... unfortunate choice of a technical term (really hard to google ;-) Have added a link to the wikipedia article for phones — Peer Stritzinger, Aug 20 '11 at 07:47

score 0 · Answer 2 · answered Aug 16 '11 at 22:18

Every wave can be decomposed into the addition of many other waves. Using a Fourier transform, you can analyze a wave into its frequency components. The amplitude of these frequency components can then be used as a vector. Here's the documentation on the Sphinx class that does this and here's a good visual explanation of the Fourier transform.

What does a "vector" in a hidden Markov model mean?

2 Answers2