| 2 | | #################################################################### |
|---|
| 3 | | ### |
|---|
| 4 | | ### script name : AudioBook.pm |
|---|
| 5 | | ### version: 0.2 |
|---|
| 6 | | ### created by: Ken MacLean |
|---|
| 7 | | ### mail: contact@voxforge.org |
|---|
| 8 | | ### Date: 2008.01.31 |
|---|
| 9 | | ### Command: perl ./AudioBook.pm |
|---|
| 10 | | ### |
|---|
| 11 | | ### Copyright (C) 2008 Ken MacLean |
|---|
| 12 | | ### |
|---|
| 13 | | ### This program is free software; you can redistribute it and/or |
|---|
| 14 | | ### modify it under the terms of the GNU General Public License |
|---|
| 15 | | ### as published by the Free Software Foundation; either version 2 |
|---|
| 16 | | ### of the License, or (at your option) any later version. |
|---|
| 17 | | ### |
|---|
| 18 | | ### This program is distributed in the hope that it will be useful, |
|---|
| 19 | | ### but WITHOUT ANY WARRANTY; without even the implied warranty of |
|---|
| 20 | | ### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|---|
| 21 | | ### GNU General Public License for more details. |
|---|
| 22 | | ### |
|---|
| 23 | | ### Changes: |
|---|
| 24 | | ### 2008/05/02 - 0.2 - convert to calss; major refacture ; renamed fullrun.pl to AudioBook.pm |
|---|
| 25 | | #################################################################### |
|---|
| | 2 | $VERSION = 0.2; |
|---|
| | 3 | |
|---|
| | 4 | =head1 NAME |
|---|
| | 5 | |
|---|
| | 6 | AudioBook - Convert a single transcribed audio file into an average of 15 word audio segments |
|---|
| | 7 | |
|---|
| | 8 | =cut |
|---|
| | 9 | |
|---|
| | 21 | |
|---|
| | 22 | =head1 SYNOPSIS |
|---|
| | 23 | |
|---|
| | 24 | $./AudioBook -h display help |
|---|
| | 25 | $./AudioBook -a speechfile.wav -t text.txt minimal run configuration |
|---|
| | 26 | |
|---|
| | 27 | =head1 DESCRIPTION |
|---|
| | 28 | |
|---|
| | 29 | This program segments a speech audio file into 15 word (on average) speech segments. It is executable from the command line and uses |
|---|
| | 30 | the following configuration options to help in segmenting speech: |
|---|
| | 31 | |
|---|
| | 32 | -a * audio file name (WAV format only) |
|---|
| | 33 | -b beam width for Forced Alignment with HVit (default = 250) |
|---|
| | 34 | -d pronunciation dictionary (default = AudioBook/input_files/VoxforgeDict) |
|---|
| | 35 | -h show help |
|---|
| | 36 | -l LICENSE file (default = AudioBook/input_files/LICENCE) |
|---|
| | 37 | -m Maximum sentence length (default = 20 words) |
|---|
| | 38 | -p Minimum pause for sentence break (default = 2000000 in units of 100ns) |
|---|
| | 39 | -q log words with single quotes (default = yes) |
|---|
| | 40 | -r README file (default = AudioBook/input_files/README) |
|---|
| | 41 | -s Average sentence length (default = 15 words) |
|---|
| | 42 | -t * text file name (containing transcriptions of speech in audio file) |
|---|
| | 43 | -u username or name you want file stats collected by on VoxForge Metrics |
|---|
| | 44 | page: (http://www.voxforge.org/home/downloads/metrics) |
|---|
| | 45 | -v verify segments created from first pass Forced Alignment |
|---|
| | 46 | -x unique tar file suffix (max 3 characters - remainder is truncated) |
|---|
| | 47 | -S run sanity test |
|---|
| | 48 | -T create gzipped/tar file |
|---|
| | 49 | |
|---|
| | 50 | * required for script to run |
|---|
| | 51 | |
|---|
| | 52 | =head2 NOTES |
|---|
| | 53 | |
|---|
| | 54 | =head3 Text Does not Match Audio |
|---|
| | 55 | |
|---|
| | 56 | If the contents of the text file do not *exactly* match the contents of the speech audio file, the segmentation process necessarily becomes |
|---|
| | 57 | a manual, iterative process. |
|---|
| | 58 | |
|---|
| | 59 | If there are a large divergence in the text from the speech audio, then you will have to manually listen to the speech audio to determine |
|---|
| | 60 | where the biggest transcription errors lie, and then modify the original text file to match the speech audio file. |
|---|
| | 61 | |
|---|
| | 62 | If the transcription errors are minor, then the first pass forced alignment usually completes successfully. However, if you see "No tokens survived to final node of network at beam" errors in the |
|---|
| | 63 | HVite log (located in interim_files/logs), then using the "-v" verify switch might be helpful in determining where transcription problems |
|---|
| | 64 | might exist. |
|---|
| | 65 | |
|---|
| | 66 | The verify switch performs a forced alignment on the individual segments generated from the first pass forced alignment. Low scores |
|---|
| | 67 | (i.e. the lowest average log likelihood per frame score) indicate that the transcription text might not match the corresponding audio |
|---|
| | 68 | file. Look at the segment text and listen to the corresponding audo file to determine if they match. If they do not match, then fix the |
|---|
| | 69 | text in your original text transcription file, repeat this process (i.e. running the AudioBook program again with the verify switch on) |
|---|
| | 70 | until you can get a clean run. |
|---|
| | 71 | |
|---|
| | 72 | =head3 Segmenting large audio files |
|---|
| | 73 | |
|---|
| | 74 | For larger files (i.e. greater than 30 minutes of audio), you *may* need to manually segment the audio file into 30 minute segments. |
|---|
| | 75 | |
|---|
| | 76 | =head3 Automatically Adding Out-of-vocabulary words to pronunciation dictionary |
|---|
| | 77 | |
|---|
| | 78 | The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any new pronunciations are added to the |
|---|
| | 79 | pronunciation dictionary. Make sure you review the pronunciation before commiting these changes to SVN. |
|---|
| | 80 | |
|---|
| | 81 | =head1 REQUIREMENTS |
|---|
| | 82 | |
|---|
| | 83 | =item 1 - Sequitor G2P trainable Grapheme-to-Phoneme converter (which requires Python to be installed) |
|---|
| | 84 | http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html |
|---|
| | 85 | |
|---|
| | 86 | =item 2 - HTK Hidden Markov Model Toolkit - note: the source is "open", but there are distribution restrictions |
|---|
| | 87 | http://htk.eng.cam.ac.uk/ |
|---|
| | 88 | |
|---|
| | 89 | =head1 ALGORITHM |
|---|
| | 90 | |
|---|
| | 91 | This program tries to segments the speech audio file into 15 word sentences. However, if the pause following the 15th word relative to the current |
|---|
| | 92 | sentence start position is too short, the algorithm looks at the previous word (i.e. word 14) to see if it has a pause of suitable duration. |
|---|
| | 93 | If not, it then looks at word following the current start position (i.e. word 16), and so one until a pause of suitable |
|---|
| | 94 | duration can be found, increasing the number of words to look behind and ahead each time. |
|---|
| | 95 | |
|---|
| | 96 | The default pause duration is 2000000 in units of 100ns. This can be changed (using the "-p" switch") if the speech audio file does segment well |
|---|
| | 97 | enough with this default. |
|---|
| | 98 | |
|---|
| | 99 | =head1 METHODS (not user accessible) |
|---|
| | 100 | |
|---|
| | 101 | =cut |
|---|
| | 102 | |
|---|
| 329 | | |
|---|
| 330 | | 1; |
|---|
| | 441 | |
|---|
| | 442 | =head1 Change Log |
|---|
| | 443 | |
|---|
| | 444 | 2008/05/02 - 0.2 - convert to class; major refacture ; renamed fullrun.pl to AudioBook.pm |
|---|
| | 445 | 2008/01/31 - 0.1 - created |
|---|
| | 446 | |
|---|
| | 447 | =cut |
|---|
| | 448 | |
|---|
| | 449 | =head1 AUTHOR |
|---|
| | 450 | |
|---|
| | 451 | Ken MacLean |
|---|
| | 452 | contact@voxforge.org |
|---|
| | 453 | |
|---|
| | 454 | =head1 COPYRIGHT AND LICENSE |
|---|
| | 455 | |
|---|
| | 456 | Copyright (C) 2008 Ken MacLean |
|---|
| | 457 | |
|---|
| | 458 | This program is free software; you can redistribute it and/or |
|---|
| | 459 | modify it under the terms of the GNU General Public License |
|---|
| | 460 | as published by the Free Software Foundation; either version 2 |
|---|
| | 461 | of the License, or (at your option) any later version. |
|---|
| | 462 | |
|---|
| | 463 | This program is distributed in the hope that it will be useful, |
|---|
| | 464 | but WITHOUT ANY WARRANTY; without even the implied warranty of |
|---|
| | 465 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|---|
| | 466 | GNU General Public License for more details. |
|---|