The way speech recognition is carried out with HTK (or any other tool) is sort of similar to the way speech recognition is carried out in the brain. When you hear a word, you instantly break it down into its constituent phones and then compare the phones with an internal mental "model" of the phones. These "models" are constructed over years of listening to speech and gives you the ability to distinguish between similar-sounding sentences like "How to wreck a nice beach" and "How to recognize speech". Speech recognition with HTK or any other model-based scheme works in a similar manner. Here, in a few steps, is how you do it:
- You take the input speech signal and convert it into a feature vector representation.
- Take a large number of sentences and perform step 1 on each of them.
- Use the feature vectors in step 2 to build a statistical model for each of the phones/words in the sentences (there are a limited number of phones/words as against an infinite number of ways of saying them - so you reduce the unknowns by modelling).
- When a new word comes in, break it into phones and compare with each of the known models. The sequence of phones with the highest probability wins!
All the above steps are critical to the successful completion of any speech recognition task. By decomposing a sound into its feature vector, you are taking it into a model-space, giving it a representation that makes it more suited to making-a-model-out-of than other representations (say the time-amplitude representation). Most such representations lie in the frequency, or the time-frequency domain. One of the most popular such representations is the MFCC (Mel Frequency Cepstral Coefficient). In a way, this technique mimics the human hearing response with a set of filters. An input signal is decomposed with this set of filters that have a logarithmic spacing of their center frequencies. The MFCC coefficients of any one sentence (say) are then used to model each of the phones that the sentence is made of. As an example, consider,
Sentence: HI. Phonetic description: hh aa ey
When you feed the MFCC coefficients into HTK, it will associate the MFCC coefficients of a portion of the sentence with hh, another with aa and so on. When this is repeated many times over, the models for the phones begin to form.
HTK uses the tool HCopy to convert an input sentence into its feature-vector representation. There are many "flavours" to MFCC's as well (E_D_A or E_D_A_Z representations). It would be a good idea to read up on the documentation for HCopy within the htkbook.
The MFCC coefficients are written to a file with extension .mfc by HTK. It is not possible to read that file using any one of the text editors because (I think) the coefficients are written in binary. You can try to read the files with C though.
HTH.