Ticket #271 (new enhancement)

Opened 12 years ago

Last modified 12 years ago

periodically tuning acoustic model parameters as corpus gets larger

Reported by: kmaclean Owned by: kmaclean
Priority: critical Milestone: Acoustic Model 1.0
Component: Acoustic Model Version: 0.1-alpha
Keywords: Cc:

Description

email from David Gelbart

adding more acoustic model parameters

The general rule I have seen with ASR systems is that, as the amount of training data increases, it eventually becomes necessary to add more acoustic model parameters in order to get the full benefit of the additional data. On the other hand, using too many acoustic model parameters may cause overfitting (in other words, the system starts modeling quirks of the training data to the point where the system's performance on non-training data is worsened).

Thus, you may need to periodically tune the number of acoustic model parameters you are using. I suppose the easiest way to do this is to create a test set which does not overlap with the training set, and measure word recognition accuracy on the test set for various acoustic model sizes.

use more Gaussians in the Gaussian mixtures

One way to increase the number of parameters is to use more Gaussians in the Gaussian mixtures. (One way to do this in HTK is to add one or more additional mixup stages. This has the advantage that you can use your test set to compare recognition accuracy before and after the mixup, so that you can obtain your recognition accuracy numbers without having to retrain a system from scratch each time.)

move from monophones to triphones

Another way to increase the number of parameters is to move from monophones to triphones (unless you are using triphones already).

reduce the amount of state-tying

Another way is to reduce the amount of state-tying.

Change History

comment:1 Changed 12 years ago by kmaclean

My reply:

Hi David,

thanks for keeping an eye on the VoxForge? project!

Thus, you may need to periodically tune the number of acoustic model parameters you are using.

I did not realize I needed to do this on an on-going basis (as the corpus gets larger ...), thanks.

I suppose the easiest way to do this is to create a test set which does not overlap with the training set, and measure word recognition accuracy on the test set for various acoustic model sizes.

I basically have no real acoustic model testing (just some 'sanity-testing' using recordings of my own voice) - I agree it needs to be done.

One way to increase the number of parameters is to use more Gaussians in the Gaussian mixtures.

My training recipe is based on the HTK Tutorial, which describes the creation of "continuous density mixture Gaussian tied-state triphones with clustering performed using phonetic decision trees". I have not looked at increasing the number of Gaussian mixtures per state, and am not sure how. I am not even sure I understand what a Gaussian is (more reading is required on my part...)

Keith Vertanen's HTK training recipe site has a paper where he describes the results of using different combinations of parameters. With respect to the Number of Gaussians (section 2.2) he says:

Recognition experiments where conducted on models with a varying number of Gaussians per state. Results for a single Gaussian per state were omitted from the graphs for clarity. In all cases the omitted single Gaussian model performed much worse than the semi-continuous or two Gaussian model.

Both Nov'92 (figure 1 and 2) and si dt s2 (figure 7 and 8) tasks show continued reductions in WER as exponentially more Gaussians are added to the models. Noticeable gains were made even from 16 to 32 Gaussians suggesting even more Gaussians might prove advantageous.

The large number of Gaussians per state does not come for free, the real-time factor increases significantly as more Gaussians were added (figures 4, 5, 10 and 11).

Using the Sphinx recognizer, further tests were done on models with 64 and 128 Gaussians. As shown in figure 13, more Gaussians provided no additional benefit on either the Nov'92 or si dt s2 test sets. Using so many Gaussians also slows the recognizer to significantly below real-time (figure 14)

(One way to do this in HTK is to add one or more additional mixup stages. This has the advantage that you can use your test set to compare recognition accuracy before and after the mixup, so that you can obtain your recognition accuracy numbers without having to retrain a system from scratch each time.)

I am not sure what you mean by additional "mixup stages". In Keith's training recipe, his train_mixup.sh script seems to be doing what you are talking about. From the comments in the script:

        # Mixup the number of Gaussians per state, from 1 up to 8.
        # We do this in 4 steps, with 4 rounds of reestimation
        # each time.  We mix to 8 to match paper "Large Vocabulary
        # Continuous Speech Recognition Using HTK"
        #
        # Also per Phil Woodland's comment in the mailing list, we
        # will let the sp/sil model have double the number of
        # Gaussians.
        #
        # This version does sil mixup to 2 first, then from 2->4->6->8 for
        # normal and double for sil.

The following is a section from his train_mixup.sh script:

        #######################################################
        # Mixup sil from 1->2
        HHEd -B -H $TRAIN_WSJ0/hmm17/macros -H $TRAIN_WSJ0/hmm17/hmmdefs -M $TRAIN_WSJ0/hmm18 $TRAIN_WSJ0/ mix1.hed $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hhed_mix1.log

        #HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm18/macros -H $TRAIN_WSJ0/hmm18/hmmdefs -M $TRAIN_WSJ0/hmm19 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm19.log

        #HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm19/macros -H $TRAIN_WSJ0/hmm19/hmmdefs -M $TRAIN_WSJ0/hmm20 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm20.log

        #HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm20/macros -H $TRAIN_WSJ0/hmm20/hmmdefs -M $TRAIN_WSJ0/hmm21 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm21.log

        #HERest -B -m 0 -A -T 1 -C $TRAIN_COMMON/config -I $TRAIN_WSJ0/wintri.mlf -t 250.0 150.0 1000.0 -S train.scp -H $TRAIN_WSJ0/hmm21/macros -H $TRAIN_WSJ0/hmm21/hmmdefs -M $TRAIN_WSJ0/hmm22 $TRAIN_WSJ0/tiedlist >$TRAIN_WSJ0/hmm22.log

        $TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm18 hmm19 tiedlist wintri.mlf 0
        $TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm19 hmm20 tiedlist wintri.mlf 0
        $TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm20 hmm21 tiedlist wintri.mlf 0
        $TRAIN_TIMIT/train_iter.sh $TRAIN_WSJ0 hmm21 hmm22 tiedlist wintri.mlf 0

where mix1.hed contains:

        MU 2 {sil.state[2-4].mix}

It seems to me, that this one of *many* additional training steps that occur after Step 10 from the HTK Tutorial (which creates the tied-state triphones), where you incrementally increase the number of Gaussian models per state - i.e. the "mixup stages" that you were referring to.

Another way to increase the number of parameters is to move from monophones to triphones (unless you are using triphones already).

yes, we already use triphones

Another way is to reduce the amount of state-tying.

Keith also describes his approach to training acoustic models with varying numbers of tied stats:

HTK and Sphinx acoustic models were trained varying the number of tied-states (senones) between 4000, 6000, 8000 and 10000.

In the case of HTK, the exact number of tied-states cannot be specified, but instead thresholds are given to the phonetic decision tree state clustering step. The outlier threshold (RO) was held constant and the threshold controlling clustering termination (TB) was varied (see table 3).

On the "easy" 5K vocabulary Nov'92 task, there was little or no WER advantage in using more tied-states for either Sphinx (figure 1) or HTK (figure 2). On the "harder" 60K vocabulary si dt s2 task there appears to be a modest advantage to more tied-states using Sphinx (figure 7), but little difference using HTK (figure 8).

Of course having more tied-states requires the decoder to compute more Gaussian likelihoods per observation. This is shown by the increased xRT factor for the higher numbers of tied-states in figures 4, 5, 10, and 11.

So from Keith discussion, I think I can figure out how to adjust the number of tied-states in HTK (using RO and TB).

One thing I noticed from Keith's scripts is that it seems like you need to chunk the process in order to avoid errors with large speech corpora. From his train_iter.sh script:

# Does a single iteration of HERest training. # # This handles the parallel splitting and recombining # of the accumulator files. This is neccessary to # prevent inccuracies and eventual failure with large # amounts of training data. # # According to Phil Woodland, one accumulator file # should be generated for about each hour of training # data. #

Do you do something similar in your AM training?

Basically, it seems like I've got to study Keith's scripts to ensure that the VoxForge? Acoustic Models are as accurate as possible as the corpus increases in size.

Can I include this as a thread on the VoxForge? site?

thanks,

Ken

comment:2 Changed 12 years ago by kmaclean

I am not even sure I understand what a Gaussian is (more reading is required on my part...)

I have some tutorial material linked at http://www.icsi.berkeley.edu/~gelbart/edu.html that may be useful. Among the online material, I especially recommend the Columbia/IBM slides. Week 3 talks about Gaussians. A Gaussian in speech recognition is the same as a Gaussian probability density function in probability & statistics. Along with the slides for Week 3, you can find a list of textbook readings that go along with it. These books may be hard to find in public libraries but you could try inter-library loan or a university library (or buy them).

(One way to do this in HTK is to add one or more additional mixup stages. This has the advantage that you can use your test set to compare recognition accuracy before and after the mixup, so that you can obtain your recognition accuracy numbers without having to retrain a system from scratch each time.)

I am not sure what you mean by additional "mixup stages". In Keith's training recipe, his train_mixup.sh script seems to be doing what you are talking about.

Yes. I think the section in the HTK manual that describes this is titled 'Mixture Incrementing'.

One thing I noticed from Keith's scripts is that it seems like you need to chunk the process in order to avoid errors with large speech corpora. From his train_iter.sh script:

...

Do you do something similar in your AM training?

I have only used HTK with small corpora and whole-word modeling (not triphones). So I cannot provide much advice regarding chunking or state-tying.

I think the htk-users mailing list is the best forum for your HTK questions. If you write to that list, I think it would be good to include a description of the VoxForge? project and what you've accomplished so far. That may help motivate people to help you, and it will spread awareness of your project.

Can I include this as a thread on the VoxForge? site?

Please do.

Regards, David

comment:3 Changed 12 years ago by kmaclean

  • Priority changed from major to critical
  • Milestone set to Acoustic Model 1.0
Note: See TracTickets for help on using tickets.