| 54 | | |
|---|
| 55 | | =head1 NOTES |
|---|
| | 60 | =head1 Suggested Segmentation Approach: |
|---|
| | 61 | |
|---|
| | 62 | |
|---|
| | 63 | =head1 Step 1 - First Pass Forced Alignment - Getting it to Run Completely Without Errors |
|---|
| | 64 | |
|---|
| | 65 | Execute the script as follows using only the '-a' and '-t' parameters: |
|---|
| | 66 | |
|---|
| | 67 | $./AudioBook.pm -a audio -t eText.txt |
|---|
| | 68 | |
|---|
| | 69 | This tries to match the words in the text file with the words in the speech audio file, and create time stamps for each word. |
|---|
| | 70 | These time stamps are used to determine where pauses are located, and if the pause is large enough, it will create a segment |
|---|
| | 71 | of the sentence, and put an entry into the prompts file. |
|---|
| | 72 | |
|---|
| | 73 | =head2 NOTES |
|---|
| 59 | | If the contents of the text file do not *exactly* match the contents of the speech audio file, the segmentation process necessarily becomes |
|---|
| 60 | | a manual, iterative process. |
|---|
| | 77 | The text file *must exactly* match the contents of the speech audio file. |
|---|
| | 78 | |
|---|
| | 79 | If there are any errors when you are trying to run the segmentation script for the first time on a new set of text and speech audio files, |
|---|
| | 80 | the likely reason is that there is something in the text file that does not match what was said in the audio file. Figuring this out usually |
|---|
| | 81 | ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run). |
|---|
| 65 | | If the transcription errors are minor, then the first pass forced alignment usually completes successfully. However, if you see "No tokens survived to final node of network at beam" errors in the |
|---|
| 66 | | HVite log (located in interim_files/logs), then using the "-v" verify switch might be helpful in determining where transcription problems |
|---|
| 67 | | might exist. |
|---|
| | 86 | =head3 Dealing With Out-of-vocabulary Words |
|---|
| | 87 | |
|---|
| | 88 | Forced Alignment is performed with HTK's HVite tool. HVite requires that each word in the text to be forced aligned have a pronunciation entry |
|---|
| | 89 | in the pronunications lexicon. The script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide initial |
|---|
| | 90 | pronunciations for Out-Of-Vocabulary words so that the first pass forced alignment will work. This seems to be "good-enough" to find silences |
|---|
| | 91 | of reasonable lengths. Using this information, the script can create a prompt entries and corresponding audio segment. |
|---|
| | 92 | |
|---|
| | 93 | =head3 Segmenting Large Audio Files |
|---|
| | 94 | |
|---|
| | 95 | For larger files (i.e. greater than 30 minutes of audio), you *may* need to manually split the audio file into 30 minute segments, with |
|---|
| | 96 | corresponding text files. |
|---|
| | 97 | |
|---|
| | 98 | =head1 Step 2 - First Pass Forced Alignment - Runs OK, but there are Errors |
|---|
| | 99 | |
|---|
| | 100 | If the transcription errors are minor, then the first pass forced alignment usually completes successfully. |
|---|
| | 101 | |
|---|
| | 102 | However, you might see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). |
|---|
| | 103 | Ensure that the prompt text matches the prompt audio. |
|---|
| | 104 | |
|---|
| | 105 | =head1 Step 3 - First Pass Forced Alignment - Verify the Segments |
|---|
| | 106 | |
|---|
| | 107 | Get the script to perform a forced alignment on each of the segments, and display the worst 15 "average log likelihood per frame" |
|---|
| | 108 | scores. Check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed. |
|---|
| | 109 | |
|---|
| | 110 | Run the script as follows: |
|---|
| | 111 | |
|---|
| | 112 | $./AudioBook.pm -a audio -t eText.txt -v |
|---|
| 71 | | file. Look at the segment text and listen to the corresponding audo file to determine if they match. If they do not match, then fix the |
|---|
| 72 | | text in your original text transcription file, repeat this process (i.e. running the AudioBook program again with the verify switch on) |
|---|
| 73 | | until you can get a clean run. |
|---|
| 74 | | |
|---|
| 75 | | =head3 Segmenting large audio files |
|---|
| 76 | | |
|---|
| 77 | | For larger files (i.e. greater than 30 minutes of audio), you *may* need to manually segment the audio file into 30 minute segments. |
|---|
| 78 | | |
|---|
| 79 | | =head3 Automatically Adding Out-of-Vocabulary Words to Pronunciation Dictionary |
|---|
| | 116 | file. Look at the segment text and listen to the corresponding audo file to determine if they match. If they do not match (they might |
|---|
| | 117 | still match, but just have a low score), then fix the text in your original text transcription file, repeat this process (i.e. running |
|---|
| | 118 | the AudioBook program again with the verify switch on) until you can get a clean run. |
|---|
| | 119 | |
|---|
| | 120 | =head1 Step 4 - First Pass Forced Alignment - Adjusting Prompt Length |
|---|
| | 121 | |
|---|
| | 122 | After you can get the First Pass Forced Alignment to run without errors, check the AudioBook.log log file (in the output_files directory) and |
|---|
| | 123 | review the length of the created prompts. If there are too many prompts over 30 words long, reduce the size of the pause ("-p" switch) |
|---|
| | 124 | and run First Pass Forced Alignment again - something like this: |
|---|
| | 125 | |
|---|
| | 126 | $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 |
|---|
| | 127 | |
|---|
| | 128 | Continue making adjustments until you can get reasonable prompt lengths. |
|---|
| | 129 | |
|---|
| | 130 | =head3 Note |
|---|
| | 131 | |
|---|
| | 132 | The worst case scenario is that you cannot segment your audio because it does not have any pauses that are long enough to use for a |
|---|
| | 133 | segment. This is unlikely, given that people need to breath in every once in a while. What will occur is that you will have a few very long |
|---|
| | 134 | segments because the person spoke continuously for a long period of time. You will likely have to segment these longer prompts manually. |
|---|
| | 135 | |
|---|
| | 136 | =head1 Step 5 - Validate Suggested Out-of-Vocabulary Word Pronunciations |
|---|
| 82 | | pronunciation dictionary. Make sure you review the pronunciation before commiting these changes to SVN. |
|---|
| 83 | | |
|---|
| 84 | | =head1 REQUIREMENTS |
|---|
| 85 | | |
|---|
| 86 | | =item 1 - Sequitor G2P trainable Grapheme-to-Phoneme converter (GPL v2; requires Python to be installed) |
|---|
| 87 | | |
|---|
| 88 | | http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html |
|---|
| 89 | | |
|---|
| 90 | | =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open", but there are distribution restrictions) |
|---|
| 91 | | |
|---|
| 92 | | http://htk.eng.cam.ac.uk/ |
|---|
| 93 | | |
|---|
| 94 | | The HTK toolkit needs to be in your path (see http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/download) |
|---|
| | 139 | pronunciation dictionary. One way to do this is to use Speech Recognition to determine the pronunciation of the word in the actual audio file. |
|---|
| | 140 | You can do this with the '-w' switch: |
|---|
| | 141 | |
|---|
| | 142 | $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 -w |
|---|
| | 143 | |
|---|
| | 144 | The script then generates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text |
|---|
| | 145 | (so you can listen to the audio segment), the g2p recommended phone list, and HVite phone list recommendations (determined using speech |
|---|
| | 146 | recognition), so you can manually validate the final pronunciations. |
|---|
| | 147 | |
|---|
| | 148 | =head2 Note |
|---|
| | 149 | |
|---|
| | 150 | That this approach is only as good as the acoustic model you are using. The pronunciations still need to be validated against the Sequitor G2P recommended |
|---|
| | 151 | pronunciations. |
|---|
| | 152 | |
|---|
| | 153 | Please donate some speech to Voxforge to help improve our acoutic models. |
|---|
| | 154 | |
|---|
| | 155 | =head1 Step 6 - Update Pronunciation Lexicon |
|---|
| | 156 | |
|---|
| | 157 | If you are submitting your segmented audio to VoxForge, please include your validated Out-of-Vocabulary word pronunciations |
|---|
| | 158 | with your submission as a separate file called: "OOV_pron.txt". |
|---|
| | 159 | |
|---|
| | 160 | Thanks. |
|---|
| | 173 | |
|---|
| | 174 | =head2 Generating Out-of-Vocabulary Word Pronunciations |
|---|
| | 175 | |
|---|
| | 176 | The script gets Sequitor to generate the 20 likeliest pronunciations for each OOV, and then add it to the dict file. It then performs |
|---|
| | 177 | another forced aligment on the audio segment containing the Out-of-Vocabulary word. Hvite will take the sequence of phoneme sounds that it |
|---|
| | 178 | recognizes and try to match it to one of the possible pronunciations in the dictionary. We are therefore using the audio to automatically |
|---|
| | 179 | help generate the correct pronunciation. Because the VoxForge Aoustic models are not that accurate, these suggestions need to be validated |
|---|
| | 180 | and compared with the pronunications generated by Sequitor, and a judgment call needs to be made to select the correct pronunciations. |
|---|
| | 181 | |
|---|
| | 182 | =head1 REQUIREMENTS |
|---|
| | 183 | |
|---|
| | 184 | =item 1 - Sequitor G2P trainable Grapheme-to-Phoneme converter (GPL v2; requires Python to be installed) |
|---|
| | 185 | |
|---|
| | 186 | http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html |
|---|
| | 187 | |
|---|
| | 188 | =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code, but there are distribution restrictions) |
|---|
| | 189 | |
|---|
| | 190 | http://htk.eng.cam.ac.uk/ |
|---|
| | 191 | |
|---|
| | 192 | The HTK toolkit needs to be in your path |
|---|
| | 193 | (see http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/download) |
|---|
| 160 | | my $tempPronDict = "AudioBook/interim_files/pronDict"; |
|---|
| 161 | | copy($pronDict,$tempPronDict); |
|---|
| 162 | | |
|---|
| 163 | | my $textContents = AudioBook::Text->new($self,$textfile); |
|---|
| 164 | | $textContents->createWLISTFile("AudioBook/interim_files/wlist"); |
|---|
| | 238 | my $chapter = AudioBook::Chapter->new($self); |
|---|
| | 239 | # need draft missing word pronunciations before audio can be processed |
|---|
| | 240 | my $missingWords = $chapter->processText(); |
|---|
| | 241 | $chapter->processAudio(); |
|---|
| | 242 | |
|---|
| | 243 | my $segments = AudioBook::Segments->new($self,$chapter); |
|---|
| | 244 | $segments->processAudio(); |
|---|
| 166 | | my $dictionary = AudioBook::Dictionary->new($self); |
|---|
| 167 | | my $missingwordfound = $dictionary->findOutOfVocabularyWords($pronDict,"AudioBook/interim_files/MissingWords"); |
|---|
| 168 | | if ($missingwordfound) { |
|---|
| 169 | | $dictionary->getRecommendedPronunciations("AudioBook/interim_files/MissingWords_out"); # uses g2p |
|---|
| 170 | | $dictionary->updatePronDict($tempPronDict); |
|---|
| 171 | | copy($dict,$originalDict); # save dict before suggested pronunications are added - only need these pronunciations for segmentation of audio |
|---|
| 172 | | # need to update dict with missing words |
|---|
| 173 | | # can't seem to change default HDMan log file with "-l" parameter |
|---|
| 174 | | $command = ("HDMan -A -D -T 1 -g $htk_files/global.ded -m -w AudioBook/interim_files/wlist -i -l AudioBook/interim_files/dlog $dict $tempPronDict"); system($command) == 0 or confess "fullrun $command failed: $?"; |
|---|
| 175 | | $command = ("mv AudioBook/interim_files/dlog AudioBook/interim_files/logs/dlog2"); print "cmd:$command\n" if $debug; system($command); |
|---|
| 176 | | # no longer required$command = ("cp AudioBook/interim_files/MissingWords_out AudioBook/output_files/MissingWords"); print "cmd:$command\n" if $debug; system($command); |
|---|
| 177 | | } else { |
|---|
| 178 | | open(LOG,">>$log") or confess ("cannot open AudioBook/output_files/MissingWords file"); |
|---|
| 179 | | print LOG "\nMissing Words that need to be added to Pronunciation Dictionary, with suggested pronunciations::\n"; |
|---|
| 180 | | print LOG "------------------------------------------------\n"; |
|---|
| 181 | | print LOG "no missing words\n"; |
|---|
| 182 | | close LOG |
|---|
| 183 | | } |
|---|
| 184 | | # dict may get manually updated; dict only includes suggested prompts, therefore do not copy to output - suggested pronunications are in the log regardless ... |
|---|
| 185 | | #... $command = ("cp AudioBook/interim_files/dict AudioBook/output_files"); print "cmd:$command\n" if $debug; system($command); |
|---|
| 186 | | |
|---|
| 187 | | my $audio = AudioBook::Audio->new($self); |
|---|
| 188 | | $audio->segment($audiofile,$textContents); |
|---|
| 189 | | if ($self->{"verify_segments"}) { |
|---|
| 190 | | $audio->verifySegments; |
|---|
| 191 | | } |
|---|
| 192 | | if ($missingwordfound) { |
|---|
| 193 | | if ($self->{"verify_out_of_vocabulary_pronunciations"}) { |
|---|
| 194 | | $dictionary->getAlternatePronunciations("AudioBook/interim_files/MissingWords_alt",15); # uses Sequitor g2p to get top N pronunication vairations |
|---|
| 195 | | $dictionary->createAltDict($originalDict,$altDict); # merge & sort missing_words_alt and originalDict into altDict |
|---|
| 196 | | $dictionary->validateAlternatePronunciations($originalDict,$altDict,$prompts); |
|---|
| 197 | | } |
|---|
| 198 | | $dictionary->updatePronDict($pronDict); |
|---|
| 199 | | } |
|---|
| | 246 | # !!!!!! not completed |
|---|
| | 247 | # $missingWords->getAudio($segments); |
|---|
| 297 | | if ($opt_S) { # Sanity test switch |
|---|
| | 345 | if ($opt_h) { |
|---|
| | 346 | print "\nVoxForge Audio Segmentation Script Parameters\n"; |
|---|
| | 347 | print "=============================================\n"; |
|---|
| | 348 | print "-a\t* audio file name (WAV format only)\n"; |
|---|
| | 349 | print "-b\tnotify if beam width for Forced Alignment exceeds a certain level (default = 250)\n"; |
|---|
| | 350 | print "\t(does not set HVite's beam width parameter)\n"; |
|---|
| | 351 | print "-d\tpronunciation dictionary (default = AudioBook/input_files/VoxforgeDict)\n"; |
|---|
| | 352 | print "-h\tshow help\n"; |
|---|
| | 353 | print "-l\tLICENSE file (default = AudioBook/input_files/LICENCE)\n"; |
|---|
| | 354 | print "-m\tTarget maximum sentence length (default = $default_max_sentence_length words)\n"; |
|---|
| | 355 | print "-p\tMinimum pause for sentence break (default = $default_min_pause_for_sentence_break in units of 100ns)\n"; |
|---|
| | 356 | print "-q\tlog words with single quotes (default = yes)\n"; |
|---|
| | 357 | print "-r\tREADME file (default = AudioBook/input_files/README)\n"; |
|---|
| | 358 | print "-s\tAverage sentence length (default = $default_average_sentence_length words)\n"; |
|---|
| | 359 | print "-t\t* text file name (containing transcriptions of speech in audio file)\n"; |
|---|
| | 360 | |
|---|
| | 361 | print "-u\tusername or name you want file stats collected by on VoxForge Metrics \n"; |
|---|
| | 362 | print "\tpage:\t(http://www.voxforge.org/home/downloads/metrics)\n"; |
|---|
| | 363 | |
|---|
| | 364 | print "-v\tvalidate segment audio files to prompt text using forced Aligment\n"; |
|---|
| | 365 | print "-w\tvalidate missing word pronunciations to audio recordings\n"; |
|---|
| | 366 | print "-x\tunique tar file suffix (max 3 characters - remainder is truncated)\n"; |
|---|
| | 367 | print "-S\trun sanity test\n"; |
|---|
| | 368 | print "-T\tcreate gzipped/tar file\n"; |
|---|
| | 369 | print "\n\t* minimum required for script to run\n"; |
|---|
| | 370 | print "\n"; |
|---|
| | 371 | print "--\n"; |
|---|
| | 372 | print "Free Speech... Recognition\n"; |
|---|
| | 373 | print "http://www.voxforge.org\n\n"; |
|---|
| | 374 | exit; |
|---|
| | 375 | } elsif ($opt_S) { # Sanity test switch |
|---|
| 405 | | } elsif ($opt_h) { |
|---|
| 406 | | print "\nVoxForge Audio Segmentation Script Parameters\n"; |
|---|
| 407 | | print "=============================================\n"; |
|---|
| 408 | | print "-a\t* audio file name (WAV format only)\n"; |
|---|
| 409 | | print "-b\tnotify if beam width for Forced Alignment exceeds a certain level (default = 250)\n"; |
|---|
| 410 | | print "\t(does not set HVite's beam width parameter)\n"; |
|---|
| 411 | | print "-d\tpronunciation dictionary (default = AudioBook/input_files/VoxforgeDict)\n"; |
|---|
| 412 | | print "-h\tshow help\n"; |
|---|
| 413 | | print "-l\tLICENSE file (default = AudioBook/input_files/LICENCE)\n"; |
|---|
| 414 | | print "-m\tTarget maximum sentence length (default = $default_max_sentence_length words)\n"; |
|---|
| 415 | | print "-p\tMinimum pause for sentence break (default = $default_min_pause_for_sentence_break in units of 100ns)\n"; |
|---|
| 416 | | print "-q\tlog words with single quotes (default = yes)\n"; |
|---|
| 417 | | print "-r\tREADME file (default = AudioBook/input_files/README)\n"; |
|---|
| 418 | | print "-s\tAverage sentence length (default = $default_average_sentence_length words)\n"; |
|---|
| 419 | | print "-t\t* text file name (containing transcriptions of speech in audio file)\n"; |
|---|
| 420 | | |
|---|
| 421 | | print "-u\tusername or name you want file stats collected by on VoxForge Metrics \n"; |
|---|
| 422 | | print "\tpage:\t(http://www.voxforge.org/home/downloads/metrics)\n"; |
|---|
| 423 | | |
|---|
| 424 | | print "-v\tvalidate segment audio files to prompt text using forced Aligment\n"; |
|---|
| 425 | | print "-w\tvalidate missing word pronunciations to audio recordings\n"; |
|---|
| 426 | | print "-x\tunique tar file suffix (max 3 characters - remainder is truncated)\n"; |
|---|
| 427 | | print "-S\trun sanity test\n"; |
|---|
| 428 | | print "-T\tcreate gzipped/tar file\n"; |
|---|
| 429 | | print "\n\t* required for script to run\n"; |
|---|
| 430 | | print "\n"; |
|---|
| 431 | | print "--\n"; |
|---|
| 432 | | print "Free Speech... Recognition\n"; |
|---|
| 433 | | print "http://www.voxforge.org\n\n"; |
|---|
| 434 | | exit; |
|---|
| 475 | | return $self->{"max_sentence_length"}; |
|---|
| 476 | | } |
|---|
| 477 | | |
|---|
| | 532 | return $self->{"min_pause_for_sentence_break"}; |
|---|
| | 533 | } |
|---|
| | 534 | |
|---|
| | 535 | =item * getLog_single_quotes() |
|---|
| | 536 | |
|---|
| | 537 | =cut |
|---|
| | 538 | |
|---|
| | 539 | sub getLog_single_quotes { |
|---|
| | 540 | my $self = shift; |
|---|
| | 541 | return $self->{"log_single_quotes"}; |
|---|
| | 542 | } |
|---|
| | 543 | |
|---|
| | 544 | |
|---|
| | 545 | =item * getTextFile() |
|---|
| | 546 | |
|---|
| | 547 | =cut |
|---|
| | 548 | |
|---|
| | 549 | sub getTextFile { |
|---|
| | 550 | my $self = shift; |
|---|
| | 551 | return $self->{"textFile"}; |
|---|
| | 552 | } |
|---|
| | 553 | |
|---|
| | 554 | =item * getAudiofile() |
|---|
| | 555 | |
|---|
| | 556 | =cut |
|---|
| | 557 | |
|---|
| | 558 | sub getAudiofile { |
|---|
| | 559 | my $self = shift; |
|---|
| | 560 | return $self->{"audiofile"}; |
|---|
| | 561 | } |
|---|
| | 562 | |
|---|
| | 563 | =item * getUsername() |
|---|
| | 564 | |
|---|
| | 565 | =cut |
|---|
| | 566 | |
|---|
| | 567 | sub getUsername { |
|---|
| | 568 | my $self = shift; |
|---|
| | 569 | return $self->{"username"}; |
|---|
| | 570 | } |
|---|
| | 571 | |
|---|
| | 572 | =item * getLog() |
|---|
| | 573 | |
|---|
| | 574 | =cut |
|---|
| | 575 | |
|---|
| | 576 | sub getLog { |
|---|
| | 577 | my $self = shift; |
|---|
| | 578 | return $self->{"log"}; |
|---|
| | 579 | } |
|---|
| | 580 | |
|---|
| | 581 | |
|---|
| | 582 | =item * getPronDict() |
|---|
| | 583 | |
|---|
| | 584 | =cut |
|---|
| | 585 | |
|---|
| | 586 | sub getPronDict { |
|---|
| | 587 | my $self = shift; |
|---|
| | 588 | return $self->{"pronDict"}; |
|---|
| | 589 | } |
|---|
| | 590 | |
|---|
| | 591 | =item * getHtk_files() |
|---|
| | 592 | |
|---|
| | 593 | =cut |
|---|
| | 594 | |
|---|
| | 595 | sub getHtk_files { |
|---|
| | 596 | my $self = shift; |
|---|
| | 597 | return $self->{'htk_files'}; |
|---|
| | 598 | } |
|---|
| | 599 | |
|---|
| | 600 | =item * getG2p_model() |
|---|
| | 601 | |
|---|
| | 602 | =cut |
|---|
| | 603 | |
|---|
| | 604 | sub getG2p_model { |
|---|
| | 605 | my $self = shift; |
|---|
| | 606 | return $self->{'g2p_model'}; |
|---|
| | 607 | } |
|---|
| | 608 | |
|---|
| | 609 | =item * getDebug() |
|---|
| | 610 | |
|---|
| | 611 | =cut |
|---|
| | 612 | |
|---|
| | 613 | sub getDebug { |
|---|
| | 614 | my $self = shift; |
|---|
| | 615 | return $self->{'debug'}; |
|---|
| | 616 | } |
|---|
| | 617 | |
|---|
| | 618 | =item * getDebug() |
|---|
| | 619 | |
|---|
| | 620 | =cut |
|---|
| | 621 | |
|---|
| | 622 | sub getVerify_segments { |
|---|
| | 623 | my $self = shift; |
|---|
| | 624 | return $self->{'verify_segments'}; |
|---|
| | 625 | } |
|---|
| | 626 | |
|---|
| | 627 | |
|---|
| | 628 | |
|---|