voxforge.org
VoxForge Dev

Changeset 2624

Show
Ignore:
Timestamp:
06/26/08 21:51:06 (4 months ago)
Author:
kmaclean
Message:

doc updates

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • Trunk/Scripts/Audio_scripts/AudioSegmentation/AudioBook.pm

    r2616 r2624  
    11#! /usr/bin/perl 
    2 $VERSION = 0.2.1
     2$VERSION = 0.2.2
    33 
    44=head1 NAME 
     
    2525use AudioBook::Chapter; 
    2626 
    27  
    28  
    2927=head1 SYNOPSIS 
    3028 
    31  $./AudioBook -h                                                               display help 
     29 $./AudioBook -h                                                display help 
    3230 $./AudioBook -a speechfile.wav -t text.txt             minimal run configuration 
    3331 
     
    6563=head1 Suggested Segmentation Approach: 
    6664 
    67  
    68 =head2 Step 1 - First Pass Forced Alignment - Getting it to Run Completely Without Errors 
    69  
    70 Execute the script as follows using only the '-a' and '-t' parameters: 
     65=head2 Step 1  
     66 
     67Spell check the text for the audiobook and remove any mistakes or archiac spellings (its good to remove 
     68these to ensure that the pronunciation dictionary does not get cluttered). 
     69 
     70 
     71=head2 Step 2 - First Pass Forced Alignment - Getting it to Run Completely Without Errors 
     72 
     73Execute the script as follows using only the audio file ('-a') and and text file ('-t') parameters: 
    7174 
    7275  $./AudioBook.pm -a audio -t eText.txt 
     
    8083=head4 Text Does not Match Audio 
    8184 
    82 The text file *must exactly* match the contents of the speech audio file. 
     85The text file *must exactly* match the contents of the speech audio file (...well actually, it will run OK even if some words do not exactly 
     86match... it only needs about 98-99% accuracy). 
    8387 
    8488If there are any errors when you are trying to run the segmentation script for the first time on a new set of text and speech audio files, 
    8589the likely reason is that there is something in the text file that does not match what was said in the audio file.  Figuring this out usually 
    86 ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run). 
     90ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run).  Look 
     91for non-alphanumeric characters and remove them from the text - like multiple dashes (---), multiple periods (...) - any weird non-alphanumeric  
     92characters not being automatically removed by the script. 
    8793 
    8894If there are a large divergence in the text from the speech audio, then you will have to manually listen to the speech audio to determine  
    89 where the biggest transcription errors lie, and then modify the original text file to match the speech audio file.   
    90  
     95where the biggest transcription errors lie, and then modify the original text file to match the speech audio file.  This may involves mistakes  
     96(e.g the reader missing a line while reading the text) or formatting issues in the text (e.g. there might be columnar data in the text,  
     97and it is read by column by the reader - you then need to rewrite the text to match how the reader read the passage). 
     98  
    9199=head4 Dealing With Out-of-vocabulary Words  
    92100 
    93 Forced Alignment is performed with HTK's HVite tool.  HVite requires that each word in the text to be forced aligned have a pronunciation entry 
    94 in the pronunications lexicon.  The script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide initial  
    95 pronunciations for Out-Of-Vocabulary words so that the first pass forced alignment will work.  This seems to be "good-enough" to find silences  
    96 of reasonable lengths.  Using this information, the script can create a prompt entries and corresponding audio segment.   
     101Forced Alignment is performed with HTK's HVite tool as part of the segmentation process.   Force Alignment simply means that the HTK tools  
     102listens to the audio and looks up the most probable phone sequence in the pronunciation dictionary, and returns the word that corresponds  
     103to this phone sequence. 
     104 
     105HVite requires that each word in the text to be "forced aligned" to have a pronunciation entry in the pronunications lexicon.  The AudioBook.pm  
     106script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide draft pronunciations for Out-Of-Vocabulary words so that  
     107the first pass forced alignment will work.   
     108 
     109This seems to be "good-enough" to find silences of reasonable lengths.  Using this information, the script can create a prompt entries and  
     110corresponding audio segment.   
    97111 
    98112=head4 Segmenting Large Audio Files 
     
    101115corresponding text files. 
    102116 
    103 =head2 Step 2 - First Pass Forced Alignment - Runs OK, but there are Errors  
    104  
    105 If the transcription errors are minor, then the first pass forced alignment usually completes successfully.   
    106  
    107 However, you might see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). 
    108 Ensure that the prompt text matches the prompt audio. 
    109  
    110 =head2 Step 3 - First Pass Forced Alignment - Verify the Segments 
     117=head4 Automatic Numeric Conversion 
     118 
     119This script converts numbers to their word equivalent using these Perl packages: 
     120 
     121        Lingua::EN::Numbers qw(num2en num2en_ordinal); 
     122        Lingua::EN::Numbers::Years; 
     123 
     124These packages make assumptions that need to be validated.  Usually 1, 2, and 3 digit numbers get processed OK.   
     1254 digit numbers can be pronounced a couple of ways, and should be checked.  For example, the script will converted  
     126these numbers as follows:  
     127 
     128        converted number:7500: to seven thousand five hundred 
     129        converted number:8500: to eight thousand five hundred 
     130 
     131But the actual pronunciation the user used is Seventy Five Hundred and Eighty Five Hundred.  These need to be  
     132corrected manually. 
     133 
     134This script makes the assumptin that 4 digit numbers between 1000 and 2100 are years - this needs to be validated. 
     135 
     136=head2 Step 3 - First Pass Forced Alignment - Runs completely, but there are Errors 
     137 
     138If the transcription errors are only minor, then the first pass forced alignment usually completes successfully.  However, you might  
     139see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). 
     140 
     141You need to fix these errors by ensuring that the prompt text matches the prompt audio. 
     142 
     143=head2 Step 4 - First Pass Forced Alignment - Verify the Segments 
    111144 
    112145Get the script to perform a forced alignment on each of the segments, and display the worst 15 "average log likelihood per frame" 
    113 scores.  Check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed. 
     146scores.  Then check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed. 
    114147 
    115148Run the script as follows: 
     
    118151 
    119152The verify switch performs a forced alignment on the individual segments generated from the first pass forced alignment.  Low scores  
    120 (i.e. the lowest average log likelihood per frame score) indicate that the transcription text might not match the corresponding audio  
     153(i.e. the lowest average log likelihood per frame score) indicate that the transcription text *might* not match the corresponding audio  
    121154file.  Look at the segment text and listen to the corresponding audo file to determine if they match.  If they do not match (they might  
    122155still match, but just have a low score), then fix the text in your original text transcription file, repeat this process (i.e. running  
    123156the AudioBook program again with the verify switch on) until you can get a clean run.   
    124157 
    125 =head2 Step 4 - First Pass Forced Alignment - Adjusting Prompt Length 
     158=head2 Step 5 - First Pass Forced Alignment - Adjusting Prompt Length 
    126159 
    127160After you can get the First Pass Forced Alignment to run without errors, check the AudioBook.log log file (in the output_files directory) and  
    128 review the length of the created prompts. If there are too many prompts over 30 words long, reduce the size of the pause ("-p" switch)  
    129 and run First Pass Forced Alignment again - something like this: 
     161review the length of the created prompts. If there are too many prompts over 30 words long (one or two prompts in the low 30s is passable...), 
     162reduce the size of the pause ("-p" switch) and run First Pass Forced Alignment again - something like this: 
    130163 
    131164  $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 
     
    136169 
    137170The worst case scenario is that you cannot segment your audio because it does not have any pauses that are long enough to use for a  
    138 segment.  This is unlikely, given that people need to breath in every once in a while.  What will occur is that you will have a few very long 
    139 segments because the person spoke continuously for a long period of time.  You will likely have to segment these longer prompts manually. 
    140  
    141 =head2 Step 5 - Validate Suggested Out-of-Vocabulary Word Pronunciations  
    142  
    143 The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any new pronunciations are added to the 
    144 pronunciation dictionary.  One way to do this is to use Speech Recognition to determine the pronunciation of the word in the actual audio file. 
     171segment.  This is unlikely, given that people need to breath in every once in a while.  What will likely occur is that you will have a few  
     172very long segments because the person spoke continuously for a long period of time.  You will probably have to segment these longer  
     173prompts manually. 
     174 
     175=head2 Step 6 - Validate Suggested Out-of-Vocabulary Word Pronunciations 
     176 
     177The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any they are added to the 
     178pronunciation dictionary.  One way to do this is to use Speech Recognition to determine the phone set of the word in the actual audio file. 
    145179You can do this with the '-w' switch: 
    146180 
    147181  $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 -w 
    148182 
    149 The script then generates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text  
     183The -w generates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text  
    150184(so you can listen to the audio segment), the g2p recommended phone list, and HVite phone list recommendations (determined using speech 
    151185recognition), so you can manually validate the final pronunciations.  
     
    154188 
    155189That this approach is only as good as the acoustic model you are using.  The pronunciations still need to be validated against the Sequitor G2P recommended  
    156 pronunciations. 
    157  
    158 Please donate some speech to Voxforge to help improve our acoutic models. 
    159  
    160 =head2 Step 6 - Update Pronunciation Lexicon 
    161  
    162 If you are submitting your segmented audio to VoxForge, please include your validated Out-of-Vocabulary word pronunciations 
    163 with your submission as a separate file called: "OOV_pron.txt". 
    164  
    165  
    166 =head2 Step 7 - Missing word processing 
    167  
    168 Use interactive command line tool (using the -i switch, after having run with -v and -w swtiches - this class requires the missingword.xml to  
    169 work properly) to line to generate suggested pronunciations (phone lists) using Sequitor G2P and HVite forced alignment to generate most  
    170 probable pronunciation. 
     190pronunciations.  Please donate some speech to Voxforge to help improve our acoutic models. 
     191 
     192=head2 Step 7 - Iteractive Missing Word Validation 
     193 
     194You can also use the script interactively (using the -v switch) to review the Sequitor G2P suggested phone lists and HVite pronunciations.  It  
     195is a simple command line script. 
     196 
     197This mode requires the output (an xml version of the MissingWords_combined file called MissingWords.xml) from the -w switch (which needs the -v  
     198switch).  This parameter uses the contents of the missingword.xml file to prompt the user to select or edit a suggested pronunciation.  Results  
     199are placed in the MissingWords_final file, and if the -d switch is selected, then that dictionary will be updated with the results, like this:  
     200 
     201  $./AudioBook.pm -i -d /home/me/voxforge/VoxForgeDict 
     202 
     203=head2 Step 8 - Validated Pronunciation Lexicon 
     204 
     205If you are submitting your segmented audio to VoxForge, please include your *validated* Out-of-Vocabulary word pronunciations 
     206with your submission as a separate file called: "OOV_pron.txt" 
    171207 
    172208=head1 ALGORITHM 
     
    179215duration can be found, increasing the number of words to look behind and ahead each time.  
    180216 
    181 The default pause duration is 2000000 in units of 100ns.  This can be changed (using the "-p" switch") if the speech audio file does segment well  
     217The default pause duration is 2000000 in units of 100ns.  This can be changed (using the "-p" switch") if the speech audio file doesn't segment well  
    182218enough with this default. 
    183219 
     
    196232        http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html 
    197233 
    198 =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code, but there are distribution restrictions) 
     234=item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code - but there are distribution restrictions) 
    199235 
    200236        http://htk.eng.cam.ac.uk/ 
     
    206242 
    207243        Term::ReadLine::Gnu 
    208  
    209  
     244        Audio::Wav 
     245        Lingua::EN::Numbers 
     246        XML::LibXML 
     247        File::Copy 
    210248 
    211249=cut  
     
    415453                $self->{"verify_segments"}=1;    
    416454                $self->{"verify_out_of_vocabulary_pronunciations"}=1;    
    417                 $self->{"README"}="AudioBook/input_files/README"; 
    418                 $self->{"LICENSE"}="AudioBook/input_files/LICENSE"; 
     455                $self->{"README"}="AudioBook/test/README"; 
     456                $self->{"LICENSE"}="AudioBook/test/LICENSE"; 
    419457        } elsif ($opt_a and $opt_t) {    
    420458                if (-r $opt_a) { 
     
    695733=head1 Change Log     
    696734 
    697   2008/06/12 - 0.1 - created CommandLine class to permit interactive validation of missing word pronunciations 
     735  2008/06/12 - 0.2.2 - created CommandLine class to permit interactive validation of missing word pronunciations 
    698736  2008/06/1 - 0.2.1 - refacture to create Chapter, Segments & MissingWords classes 
    699737  2008/06/09 - 0.2.1 - refacture to create Chapter, Segments & MissingWords classes