voxforge.org
VoxForge Dev
Show
Ignore:
Timestamp:
06/09/08 17:06:50 (6 months ago)
Author:
kmaclean
Message:

refacture to create Chapter, Segments & MissingWords? classes - snapshop

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • Trunk/Scripts/Audio_scripts/AudioSegmentation/AudioBook.pm

    r2604 r2606  
    11#! /usr/bin/perl 
    2 $VERSION = 0.2
     2$VERSION = 0.2.1
    33 
    44=head1 NAME 
    55 
    6 AudioBook - Convert a single transcribed audio file into 15 word audio segments (approximately)  
     6AudioBook - Convert a single transcribed audio file into audio segments  
    77 
    88=cut  
    99 
    10 package AudioBook; 
     10package AudioBook;  
    1111use strict; 
    1212use diagnostics; 
     
    1515use File::Basename; 
    1616use File::Copy; 
    17 use lib '/home/kmaclean/VoxForge-dev/Main/Scripts/Audio_scripts/AudioSegmentation'; 
     17 
    1818use AudioBook::Audio; 
    1919use AudioBook::Text; 
    2020use AudioBook::Dictionary; 
    2121 
     22use AudioBook::Segments;  
     23use AudioBook::MissingWords; 
     24use AudioBook::Chapter; 
     25 
     26 
     27 
    2228=head1 SYNOPSIS 
    2329 
    24  $./AudioBook -h                                                                       display help 
     30 $./AudioBook -h                                                                display help 
    2531 $./AudioBook -a speechfile.wav -t text.txt             minimal run configuration 
    2632 
     
    5258                 * required for script to run 
    5359 
    54  
    55 =head1 NOTES 
     60=head1 Suggested Segmentation Approach: 
     61 
     62 
     63=head1 Step 1 - First Pass Forced Alignment - Getting it to Run Completely Without Errors 
     64 
     65Execute the script as follows using only the '-a' and '-t' parameters: 
     66 
     67  $./AudioBook.pm -a audio -t eText.txt 
     68   
     69This tries to match the words in the text file with the words in the speech audio file, and create time stamps for each word.   
     70These time stamps are used to determine where pauses are located, and if the pause is large enough, it will create a segment  
     71of the sentence, and put an entry into the prompts file. 
     72 
     73=head2 NOTES 
    5674 
    5775=head3 Text Does not Match Audio 
    5876 
    59 If the contents of the text file do not *exactly* match the contents of the speech audio file, the segmentation process necessarily becomes  
    60 a manual, iterative process. 
     77The text file *must exactly* match the contents of the speech audio file. 
     78 
     79If there are any errors when you are trying to run the segmentation script for the first time on a new set of text and speech audio files, 
     80the likely reason is that there is something in the text file that does not match what was said in the audio file.  Figuring this out usually 
     81ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run). 
    6182 
    6283If there are a large divergence in the text from the speech audio, then you will have to manually listen to the speech audio to determine  
    6384where the biggest transcription errors lie, and then modify the original text file to match the speech audio file.   
    6485 
    65 If the transcription errors are minor, then the first pass forced alignment usually completes successfully.  However, if you see "No tokens survived to final node of network at beam" errors in the  
    66 HVite log (located in interim_files/logs), then using the "-v" verify switch might be helpful in determining where transcription problems  
    67 might exist. 
     86=head3 Dealing With Out-of-vocabulary Words  
     87 
     88Forced Alignment is performed with HTK's HVite tool.  HVite requires that each word in the text to be forced aligned have a pronunciation entry 
     89in the pronunications lexicon.  The script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide initial  
     90pronunciations for Out-Of-Vocabulary words so that the first pass forced alignment will work.  This seems to be "good-enough" to find silences  
     91of reasonable lengths.  Using this information, the script can create a prompt entries and corresponding audio segment.   
     92 
     93=head3 Segmenting Large Audio Files 
     94 
     95For larger files (i.e. greater than 30 minutes of audio), you *may* need to manually split the audio file into 30 minute segments, with  
     96corresponding text files. 
     97 
     98=head1 Step 2 - First Pass Forced Alignment - Runs OK, but there are Errors  
     99 
     100If the transcription errors are minor, then the first pass forced alignment usually completes successfully.   
     101 
     102However, you might see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). 
     103Ensure that the prompt text matches the prompt audio. 
     104 
     105=head1 Step 3 - First Pass Forced Alignment - Verify the Segments 
     106 
     107Get the script to perform a forced alignment on each of the segments, and display the worst 15 "average log likelihood per frame" 
     108scores.  Check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed. 
     109 
     110Run the script as follows: 
     111 
     112  $./AudioBook.pm -a audio -t eText.txt -v 
    68113 
    69114The verify switch performs a forced alignment on the individual segments generated from the first pass forced alignment.  Low scores  
    70115(i.e. the lowest average log likelihood per frame score) indicate that the transcription text might not match the corresponding audio  
    71 file.  Look at the segment text and listen to the corresponding audo file to determine if they match.  If they do not match, then fix the  
    72 text in your original text transcription file, repeat this process (i.e. running the AudioBook program again with the verify switch on)  
    73 until you can get a clean run. 
    74  
    75 =head3 Segmenting large audio files 
    76  
    77 For larger files (i.e. greater than 30 minutes of audio), you *may* need to manually segment the audio file into 30 minute segments. 
    78  
    79 =head3 Automatically Adding Out-of-Vocabulary Words to Pronunciation Dictionary  
     116file.  Look at the segment text and listen to the corresponding audo file to determine if they match.  If they do not match (they might  
     117still match, but just have a low score), then fix the text in your original text transcription file, repeat this process (i.e. running  
     118the AudioBook program again with the verify switch on) until you can get a clean run.   
     119 
     120=head1 Step 4 - First Pass Forced Alignment - Adjusting Prompt Length 
     121 
     122After you can get the First Pass Forced Alignment to run without errors, check the AudioBook.log log file (in the output_files directory) and  
     123review the length of the created prompts. If there are too many prompts over 30 words long, reduce the size of the pause ("-p" switch)  
     124and run First Pass Forced Alignment again - something like this: 
     125 
     126  $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 
     127   
     128Continue making adjustments until you can get reasonable prompt lengths. 
     129 
     130=head3 Note 
     131 
     132The worst case scenario is that you cannot segment your audio because it does not have any pauses that are long enough to use for a  
     133segment.  This is unlikely, given that people need to breath in every once in a while.  What will occur is that you will have a few very long 
     134segments because the person spoke continuously for a long period of time.  You will likely have to segment these longer prompts manually. 
     135 
     136=head1 Step 5 - Validate Suggested Out-of-Vocabulary Word Pronunciations  
    80137 
    81138The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any new pronunciations are added to the 
    82 pronunciation dictionary.  Make sure you review the pronunciation before commiting these changes to SVN.     
    83  
    84 =head1 REQUIREMENTS 
    85  
    86 =item 1 - Sequitor G2P trainable Grapheme-to-Phoneme converter (GPL v2; requires Python to be installed) 
    87  
    88         http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html 
    89  
    90 =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open", but there are distribution restrictions) 
    91  
    92         http://htk.eng.cam.ac.uk/ 
    93          
    94         The HTK toolkit needs to be in your path (see http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/download) 
     139pronunciation dictionary.  One way to do this is to use Speech Recognition to determine the pronunciation of the word in the actual audio file. 
     140You can do this with the '-w' switch: 
     141 
     142  $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 -w 
     143 
     144The script then generates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text  
     145(so you can listen to the audio segment), the g2p recommended phone list, and HVite phone list recommendations (determined using speech 
     146recognition), so you can manually validate the final pronunciations.  
     147 
     148=head2 Note  
     149 
     150That this approach is only as good as the acoustic model you are using.  The pronunciations still need to be validated against the Sequitor G2P recommended  
     151pronunciations. 
     152 
     153Please donate some speech to Voxforge to help improve our acoutic models. 
     154 
     155=head1 Step 6 - Update Pronunciation Lexicon 
     156 
     157If you are submitting your segmented audio to VoxForge, please include your validated Out-of-Vocabulary word pronunciations 
     158with your submission as a separate file called: "OOV_pron.txt". 
     159 
     160Thanks. 
    95161 
    96162=head1 ALGORITHM 
     163 
     164=head2 Audio Segmentation 
    97165 
    98166This program tries to segments the speech audio file into 15 word sentences.  However, if the pause following the 15th word relative to the current  
     
    103171The default pause duration is 2000000 in units of 100ns.  This can be changed (using the "-p" switch") if the speech audio file does segment well  
    104172enough with this default. 
     173 
     174=head2 Generating Out-of-Vocabulary Word Pronunciations 
     175 
     176The script gets Sequitor to generate the 20 likeliest pronunciations for each OOV, and then add it to the dict file.  It then performs 
     177another forced aligment on the audio segment containing the Out-of-Vocabulary word.  Hvite will take the sequence of phoneme sounds that it  
     178recognizes and try to match it to one of the possible pronunciations in the dictionary.  We are therefore using the audio to automatically  
     179help generate the correct pronunciation.  Because the VoxForge Aoustic models are not that accurate, these suggestions need to be validated  
     180and compared with the pronunications generated by Sequitor, and a judgment call needs to be made to select the correct pronunciations. 
     181 
     182=head1 REQUIREMENTS 
     183 
     184=item 1 - Sequitor G2P trainable Grapheme-to-Phoneme converter (GPL v2; requires Python to be installed) 
     185 
     186        http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html 
     187 
     188=item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code, but there are distribution restrictions) 
     189 
     190        http://htk.eng.cam.ac.uk/ 
     191         
     192        The HTK toolkit needs to be in your path  
     193        (see http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial/download) 
    105194 
    106195=cut  
     
    139228=head2 process 
    140229 
    141 Segement the user designated speech audio file (-a) sing the supplied text file (-t)  
     230Segment the user designated speech audio file (-a) using the supplied text file (-t)  
    142231 
    143232=cut 
     
    145234sub process { 
    146235        my ($self)= @_; 
    147         my $debug = $self->{'debug'}; 
    148         my $audiofile = $self->{"audiofile"}; 
    149         my $textfile = $self->{"textfile"}; 
    150         my $username = $self->{"username"}; 
    151236        my $tarSuffix = $self->{"tarSuffix"}; 
    152         my $pronDict = $self->{"pronDict"}; 
    153         my $htk_files = $self->{'htk_files'}; 
    154         my $log = $self{'log'}; 
    155         my $dict = "AudioBook/interim_files/dict"; 
    156         my $originalDict = "AudioBook/interim_files/originalDict"; 
    157         my $altDict = "AudioBook/interim_files/altDict";         
    158         my $prompts = "AudioBook/interim_files/prompts";         
    159237         
    160         my $tempPronDict = "AudioBook/interim_files/pronDict"; 
    161         copy($pronDict,$tempPronDict);   
    162  
    163         my $textContents = AudioBook::Text->new($self,$textfile); 
    164         $textContents->createWLISTFile("AudioBook/interim_files/wlist"); 
     238        my $chapter = AudioBook::Chapter->new($self); 
     239        # need draft missing word pronunciations before audio can be processed 
     240        my $missingWords = $chapter->processText();  
     241        $chapter->processAudio();                
     242 
     243        my $segments = AudioBook::Segments->new($self,$chapter); 
     244        $segments->processAudio();       
    165245         
    166         my $dictionary = AudioBook::Dictionary->new($self); 
    167         my $missingwordfound = $dictionary->findOutOfVocabularyWords($pronDict,"AudioBook/interim_files/MissingWords"); 
    168         if ($missingwordfound) {  
    169                 $dictionary->getRecommendedPronunciations("AudioBook/interim_files/MissingWords_out"); # uses g2p 
    170                 $dictionary->updatePronDict($tempPronDict); 
    171                 copy($dict,$originalDict); # save dict before suggested pronunications are added - only need these pronunciations for segmentation of audio      
    172                 # need to update dict with missing words 
    173                 # can't seem to change default HDMan log file with "-l" parameter 
    174                 $command = ("HDMan -A -D -T 1 -g $htk_files/global.ded -m -w AudioBook/interim_files/wlist -i -l AudioBook/interim_files/dlog $dict $tempPronDict"); system($command) == 0 or confess "fullrun $command failed: $?"; 
    175                 $command = ("mv AudioBook/interim_files/dlog AudioBook/interim_files/logs/dlog2"); print "cmd:$command\n" if $debug; system($command); 
    176                 # no longer required$command = ("cp AudioBook/interim_files/MissingWords_out AudioBook/output_files/MissingWords"); print "cmd:$command\n" if $debug; system($command); 
    177         } else { 
    178                 open(LOG,">>$log") or confess ("cannot open AudioBook/output_files/MissingWords file"); 
    179                 print LOG "\nMissing Words that need to be added to Pronunciation Dictionary, with suggested pronunciations::\n";        
    180                 print LOG "------------------------------------------------\n";                          
    181                 print LOG "no missing words\n"; 
    182                 close LOG 
    183         }  
    184         # dict may get manually updated; dict only includes suggested prompts, therefore do not copy to output - suggested pronunications are in the log regardless ... 
    185         #... $command = ("cp AudioBook/interim_files/dict AudioBook/output_files"); print "cmd:$command\n" if $debug; system($command);          
    186  
    187         my $audio = AudioBook::Audio->new($self); 
    188         $audio->segment($audiofile,$textContents); 
    189         if ($self->{"verify_segments"}) { 
    190                 $audio->verifySegments; 
    191         }        
    192         if ($missingwordfound) { 
    193                 if ($self->{"verify_out_of_vocabulary_pronunciations"}) {  
    194                         $dictionary->getAlternatePronunciations("AudioBook/interim_files/MissingWords_alt",15); # uses Sequitor g2p to get top N pronunication vairations 
    195                         $dictionary->createAltDict($originalDict,$altDict);     # merge & sort missing_words_alt and originalDict into altDict  
    196                         $dictionary->validateAlternatePronunciations($originalDict,$altDict,$prompts); 
    197                 } 
    198                 $dictionary->updatePronDict($pronDict);  
    199         }        
     246# !!!!!! not completed           
     247#       $missingWords->getAudio($segments); 
    200248         
    201249        if (defined($tarSuffix)){ 
    202250                _createTarFile($self); 
    203         } 
     251        }       
    204252} 
    205253 
     
    295343        my $debug = $self->{'debug'};    
    296344        getopts('a:b:d:hl:m:p:r:s:t:u:x:q:vwST');    #  sets $opt_* as a side effect. 
    297         if ($opt_S) { # Sanity test switch 
     345        if ($opt_h) { 
     346                print "\nVoxForge Audio Segmentation Script Parameters\n";       
     347                print   "=============================================\n";       
     348                print "-a\t* audio file name (WAV format only)\n"; 
     349                print "-b\tnotify if beam width for Forced Alignment exceeds a certain level (default = 250)\n"; 
     350                print "\t(does not set HVite's beam width parameter)\n"; 
     351                print "-d\tpronunciation dictionary  (default = AudioBook/input_files/VoxforgeDict)\n"; 
     352                print "-h\tshow help\n";         
     353                print "-l\tLICENSE file (default = AudioBook/input_files/LICENCE)\n"; 
     354                print "-m\tTarget maximum sentence length (default = $default_max_sentence_length words)\n"; 
     355                print "-p\tMinimum pause for sentence break (default = $default_min_pause_for_sentence_break in units of 100ns)\n";              
     356                print "-q\tlog words with single quotes (default = yes)\n";              
     357                print "-r\tREADME file (default = AudioBook/input_files/README)\n";                              
     358                print "-s\tAverage sentence length (default = $default_average_sentence_length words)\n";                                
     359                print "-t\t* text file name (containing transcriptions of speech in audio file)\n"; 
     360                 
     361                print "-u\tusername or name you want file stats collected by on VoxForge Metrics \n"; 
     362                print "\tpage:\t(http://www.voxforge.org/home/downloads/metrics)\n";     
     363                 
     364                print "-v\tvalidate segment audio files to prompt text using forced Aligment\n"; 
     365                print "-w\tvalidate missing word pronunciations to audio recordings\n";          
     366                print "-x\tunique tar file suffix (max 3 characters - remainder is truncated)\n"; 
     367                print "-S\trun sanity test\n";           
     368                print "-T\tcreate gzipped/tar file\n"; 
     369                print "\n\t* minimum required for script to run\n";      
     370                print "\n";      
     371                print "--\n";                    
     372                print "Free Speech... Recognition\n"; 
     373                print "http://www.voxforge.org\n\n"; 
     374                exit; 
     375        } elsif ($opt_S) { # Sanity test switch 
    298376                $self->{"audiofile"}="AudioBook/test/audio.wav"; 
    299                 #$self->{"textfile"}="AudioBook/test/text-simple.txt"; 
    300                 $self->{"textfile"}="AudioBook/test/text-original.txt"; 
     377                #$self->{"textFile"}="AudioBook/test/text-simple.txt"; 
     378                $self->{"textFile"}="AudioBook/test/text-original.txt"; 
    301379                $command = ("cp AudioBook/input_files/VoxForgeDict AudioBook/interim_files/VoxForgeDict"); print "cmd:$command\n"; system($command); 
    302380                $self->{"pronDict"}="AudioBook/interim_files/VoxForgeDict"; 
     
    319397                } 
    320398                if (-r $opt_t) { 
    321                         $self->{"textfile"}=$opt_t; 
    322                 } else { 
    323                         die "can't open -t" . $self->{"textfile"} . "\n";              
     399                        $self->{"textFile"}=$opt_t; 
     400                } else { 
     401                        die "can't open -t" . $self->{"textFile"} . "\n";              
    324402                } 
    325403                if (defined($opt_d)) { 
     
    403481                        } 
    404482                } 
    405         } elsif ($opt_h) { 
    406                 print "\nVoxForge Audio Segmentation Script Parameters\n";       
    407                 print   "=============================================\n";       
    408                 print "-a\t* audio file name (WAV format only)\n"; 
    409                 print "-b\tnotify if beam width for Forced Alignment exceeds a certain level (default = 250)\n"; 
    410                 print "\t(does not set HVite's beam width parameter)\n"; 
    411                 print "-d\tpronunciation dictionary  (default = AudioBook/input_files/VoxforgeDict)\n"; 
    412                 print "-h\tshow help\n";         
    413                 print "-l\tLICENSE file (default = AudioBook/input_files/LICENCE)\n"; 
    414                 print "-m\tTarget maximum sentence length (default = $default_max_sentence_length words)\n"; 
    415                 print "-p\tMinimum pause for sentence break (default = $default_min_pause_for_sentence_break in units of 100ns)\n";              
    416                 print "-q\tlog words with single quotes (default = yes)\n";              
    417                 print "-r\tREADME file (default = AudioBook/input_files/README)\n";                              
    418                 print "-s\tAverage sentence length (default = $default_average_sentence_length words)\n";                                
    419                 print "-t\t* text file name (containing transcriptions of speech in audio file)\n"; 
    420                  
    421                 print "-u\tusername or name you want file stats collected by on VoxForge Metrics \n"; 
    422                 print "\tpage:\t(http://www.voxforge.org/home/downloads/metrics)\n";     
    423                  
    424                 print "-v\tvalidate segment audio files to prompt text using forced Aligment\n"; 
    425                 print "-w\tvalidate missing word pronunciations to audio recordings\n";          
    426                 print "-x\tunique tar file suffix (max 3 characters - remainder is truncated)\n"; 
    427                 print "-S\trun sanity test\n";           
    428                 print "-T\tcreate gzipped/tar file\n"; 
    429                 print "\n\t* required for script to run\n";      
    430                 print "\n";      
    431                 print "--\n";                    
    432                 print "Free Speech... Recognition\n"; 
    433                 print "http://www.voxforge.org\n\n"; 
    434                 exit; 
    435483        } else { 
    436484                print "\nVoxForge Audio Segmentation Script\n";  
     
    443491        } 
    444492        print "audiofile:" . $self->{"audiofile"}. "\n"; 
    445         print "textfile:" . $self->{"textfile"}. "\n"; 
     493        print "textFile:" . $self->{"textFile"}. "\n"; 
    446494        print "pronDict:" . $self->{"pronDict"} . "\n\n";        
    447495} 
    448496 
    449 =head2 Gettors - Public (used by methods in other sub-classes) 
     497=head2 Gettors  
    450498 
    451499=item * getAverage_sentence_length() 
     
    458506} 
    459507 
     508=item * getBeam_width() 
     509 
     510=cut 
     511 
     512sub getBeam_width { 
     513        my $self = shift; 
     514        return $self->{"beam_width"}; 
     515} 
     516 
    460517=item * getMax_sentence_length() 
    461518 
     
    473530sub getMin_pause_for_sentence_break { 
    474531        my $self = shift; 
    475         return $self->{"max_sentence_length"}; 
    476 
    477      
     532        return $self->{"min_pause_for_sentence_break"}; 
     533
     534 
     535=item * getLog_single_quotes() 
     536 
     537=cut 
     538 
     539sub getLog_single_quotes { 
     540        my $self = shift; 
     541        return $self->{"log_single_quotes"}; 
     542
     543 
     544 
     545=item * getTextFile() 
     546 
     547=cut 
     548 
     549sub getTextFile { 
     550        my $self = shift; 
     551        return $self->{"textFile"}; 
     552
     553 
     554=item * getAudiofile() 
     555 
     556=cut 
     557 
     558sub getAudiofile { 
     559        my $self = shift; 
     560        return $self->{"audiofile"}; 
     561
     562 
     563=item * getUsername() 
     564 
     565=cut 
     566 
     567sub getUsername { 
     568        my $self = shift; 
     569        return $self->{"username"}; 
     570}     
     571 
     572=item * getLog() 
     573 
     574=cut 
     575 
     576sub getLog { 
     577        my $self = shift; 
     578        return $self->{"log"}; 
     579
     580 
     581 
     582=item * getPronDict() 
     583 
     584=cut 
     585 
     586sub getPronDict { 
     587        my $self = shift; 
     588        return $self->{"pronDict"}; 
     589}        
     590 
     591=item * getHtk_files() 
     592 
     593=cut 
     594 
     595sub getHtk_files { 
     596        my $self = shift; 
     597        return $self->{'htk_files'}; 
     598}   
     599 
     600=item * getG2p_model() 
     601 
     602=cut 
     603 
     604sub getG2p_model { 
     605        my $self = shift; 
     606        return $self->{'g2p_model'}; 
     607
     608 
     609=item * getDebug() 
     610 
     611=cut 
     612 
     613sub getDebug { 
     614        my $self = shift; 
     615        return $self->{'debug'}; 
     616
     617 
     618=item * getDebug() 
     619 
     620=cut 
     621 
     622sub getVerify_segments { 
     623        my $self = shift; 
     624        return $self->{'verify_segments'}; 
     625
     626 
     627 
     628 
    478629=head1 Change Log     
    479630 
    480         2008/05/02 - 0.2 - convert to class; major refacture ; renamed fullrun.pl to AudioBook.pm                                                        
    481         2008/01/31 - 0.1 - created 
     6312008/06/09 - 0.2.1 - refacture to create Chapter, Segments & MissingWords classes 
     6322008/05/02 - 0.2 - convert to class; major refacture ; renamed fullrun.pl to AudioBook.pm                                                        
     6332008/01/31 - 0.1 - created 
    482634         
    483635=cut 
     
    501653MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    502654GNU General Public License for more details. 
     655 
     656=cut 
     657 
     6581;