Ticket #16 (new enhancement)

Opened 11 years ago

Last modified 11 years ago

Corpora testing

Reported by: kmaclean Owned by: somebody
Priority: major Milestone: Release 0.1.1
Component: audio Version: 0.1
Keywords: Cc:


See tpavelka's post Hi,

I'm sure there are plans for creating testing corpora of various difficulties for VoxForge?. Here is my suggestion for one:

Spoken numbers, let's say between one and one million


  • Easy task, you can expect over 95% accuracy
  • Can be recognized with a grammar => no need for language model tweaking and no need for special recognizers, HVite is sufficient

Creating a grammar for english numbers should be straightforward, here is a place to start:


When generating the prompts, do not use HSGen since this leads to weird numbers like "one hundred thousand and three", use a random number generator and then convert the numbers to words. A convertor can be found e.g. here:


Also native and non native speakers should be clearly separated so that you can see the difference in accuracy.


generator.zip (1.1 MB) - added by kmaclean 11 years ago.

Change History

Changed 11 years ago by kmaclean

comment:1 Changed 11 years ago by kmaclean

User: tpavelka Date: 3/18/2009 9:47 am


finally the convertor is here. Writing the generator was pretty strainght forward using Lingua:EN:Numbers. I got the grammar from a coleague who used it in a different project. It is in JSGF format so I wrote a simple convertor into the EBNF format used by HTK.

The tricky part was ensuring that the grammar covers all the generated sentences. For that I used a parser that can convert the sentences back into numbers, after that the number can be compared to the original one.


Note: See TracTickets for help on using tickets.