Last modified 10 years ago Last modified on 09/26/07 11:25:33

Since continuous speech is very context-dependent and variable it's not sufficient to build model for each phone, acoustics can differ sufficiently if phone is used in different context. That's why for continuous speech context-dependent models are often used. Models don't depend on phone name but on the name of next and previous phones and probably on many more parameters. Of course it's not possible to build a model for all combinations of arguments, moreover their number can exceed hundred. That's why training software usually either selects the set of models automatically or with a little input from the user.

For example sphinx can build set of models automatically. HTK requires you to pass the list of properties model selection will use and will do the rest itself. Of course if you have hand-made questions it's better to submit them to sphinx too, moreover it allows it.

The important thing is that models are organized in a tree and the parameters you pass are called questions. Decoder asks a question on phone context and decides what model to use. So it's important for you to create a good list of questions. Let's describe how you can do it, in HTK it's a file tree.hed, in sphinx questions are specified in a config file.

We consider the task of tree creation for a new language. For some languages like English, questions already exist of course. So what should you do. Well, just list important things that affects phone acoustic. Collect sources, look for description of acoustic classes:

  • Books on phonetics
  • Festival TTS voices (often has precise description)
  • Questions in similar languages
  • (language page often has a phoneset with classification in IPA, but it's not very precise)

Now read the book and let's try to build the list, often acoustic connected by the following things:

  • List of vowels
  • List of consonants
  • List of vowels for each property: front vowels, back vowels, middle vowels, diphtongs, rounded and long vowels
  • List of fricative consonants
  • List of nasals
  • List of liquids
  • List of stops
  • Any other group of phones

I hope you get the idea, now repeat questions for each context - question for left context, right context and phone itself. The result should look like this for Sphinx or like this for HTK

The number of questions should be small since otherwise you have to collect too much data to train all models. It's recommended to have 20-30 questions for the tree.