Changeset 2624
- Timestamp:
- 06/26/08 21:51:06 (4 months ago)
- Files:
Legend:
- Unmodified
- Added
- Removed
- Modified
- Copied
- Moved
Trunk/Scripts/Audio_scripts/AudioSegmentation/AudioBook.pm
r2616 r2624 1 1 #! /usr/bin/perl 2 $VERSION = 0.2. 1;2 $VERSION = 0.2.2; 3 3 4 4 =head1 NAME … … 25 25 use AudioBook::Chapter; 26 26 27 28 29 27 =head1 SYNOPSIS 30 28 31 $./AudioBook -h display help29 $./AudioBook -h display help 32 30 $./AudioBook -a speechfile.wav -t text.txt minimal run configuration 33 31 … … 65 63 =head1 Suggested Segmentation Approach: 66 64 67 68 =head2 Step 1 - First Pass Forced Alignment - Getting it to Run Completely Without Errors 69 70 Execute the script as follows using only the '-a' and '-t' parameters: 65 =head2 Step 1 66 67 Spell check the text for the audiobook and remove any mistakes or archiac spellings (its good to remove 68 these to ensure that the pronunciation dictionary does not get cluttered). 69 70 71 =head2 Step 2 - First Pass Forced Alignment - Getting it to Run Completely Without Errors 72 73 Execute the script as follows using only the audio file ('-a') and and text file ('-t') parameters: 71 74 72 75 $./AudioBook.pm -a audio -t eText.txt … … 80 83 =head4 Text Does not Match Audio 81 84 82 The text file *must exactly* match the contents of the speech audio file. 85 The text file *must exactly* match the contents of the speech audio file (...well actually, it will run OK even if some words do not exactly 86 match... it only needs about 98-99% accuracy). 83 87 84 88 If there are any errors when you are trying to run the segmentation script for the first time on a new set of text and speech audio files, 85 89 the likely reason is that there is something in the text file that does not match what was said in the audio file. Figuring this out usually 86 ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run). 90 ends up being an interative process (i.e. you fix an error, run the script, fix another error, ... until you get an error-free run). Look 91 for non-alphanumeric characters and remove them from the text - like multiple dashes (---), multiple periods (...) - any weird non-alphanumeric 92 characters not being automatically removed by the script. 87 93 88 94 If there are a large divergence in the text from the speech audio, then you will have to manually listen to the speech audio to determine 89 where the biggest transcription errors lie, and then modify the original text file to match the speech audio file. 90 95 where the biggest transcription errors lie, and then modify the original text file to match the speech audio file. This may involves mistakes 96 (e.g the reader missing a line while reading the text) or formatting issues in the text (e.g. there might be columnar data in the text, 97 and it is read by column by the reader - you then need to rewrite the text to match how the reader read the passage). 98 91 99 =head4 Dealing With Out-of-vocabulary Words 92 100 93 Forced Alignment is performed with HTK's HVite tool. HVite requires that each word in the text to be forced aligned have a pronunciation entry 94 in the pronunications lexicon. The script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide initial 95 pronunciations for Out-Of-Vocabulary words so that the first pass forced alignment will work. This seems to be "good-enough" to find silences 96 of reasonable lengths. Using this information, the script can create a prompt entries and corresponding audio segment. 101 Forced Alignment is performed with HTK's HVite tool as part of the segmentation process. Force Alignment simply means that the HTK tools 102 listens to the audio and looks up the most probable phone sequence in the pronunciation dictionary, and returns the word that corresponds 103 to this phone sequence. 104 105 HVite requires that each word in the text to be "forced aligned" to have a pronunciation entry in the pronunications lexicon. The AudioBook.pm 106 script uses Sequitor G2P (trained on the VoxForge pronunciation lexicon) to provide draft pronunciations for Out-Of-Vocabulary words so that 107 the first pass forced alignment will work. 108 109 This seems to be "good-enough" to find silences of reasonable lengths. Using this information, the script can create a prompt entries and 110 corresponding audio segment. 97 111 98 112 =head4 Segmenting Large Audio Files … … 101 115 corresponding text files. 102 116 103 =head2 Step 2 - First Pass Forced Alignment - Runs OK, but there are Errors 104 105 If the transcription errors are minor, then the first pass forced alignment usually completes successfully. 106 107 However, you might see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). 108 Ensure that the prompt text matches the prompt audio. 109 110 =head2 Step 3 - First Pass Forced Alignment - Verify the Segments 117 =head4 Automatic Numeric Conversion 118 119 This script converts numbers to their word equivalent using these Perl packages: 120 121 Lingua::EN::Numbers qw(num2en num2en_ordinal); 122 Lingua::EN::Numbers::Years; 123 124 These packages make assumptions that need to be validated. Usually 1, 2, and 3 digit numbers get processed OK. 125 4 digit numbers can be pronounced a couple of ways, and should be checked. For example, the script will converted 126 these numbers as follows: 127 128 converted number:7500: to seven thousand five hundred 129 converted number:8500: to eight thousand five hundred 130 131 But the actual pronunciation the user used is Seventy Five Hundred and Eighty Five Hundred. These need to be 132 corrected manually. 133 134 This script makes the assumptin that 4 digit numbers between 1000 and 2100 are years - this needs to be validated. 135 136 =head2 Step 3 - First Pass Forced Alignment - Runs completely, but there are Errors 137 138 If the transcription errors are only minor, then the first pass forced alignment usually completes successfully. However, you might 139 see "No tokens survived to final node of network at beam" errors in the HVite log (located in interim_files/logs). 140 141 You need to fix these errors by ensuring that the prompt text matches the prompt audio. 142 143 =head2 Step 4 - First Pass Forced Alignment - Verify the Segments 111 144 112 145 Get the script to perform a forced alignment on each of the segments, and display the worst 15 "average log likelihood per frame" 113 scores. Check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed.146 scores. Then check the transcription and listen to the corresponding audio, and make corrections to the text, repeat as needed. 114 147 115 148 Run the script as follows: … … 118 151 119 152 The verify switch performs a forced alignment on the individual segments generated from the first pass forced alignment. Low scores 120 (i.e. the lowest average log likelihood per frame score) indicate that the transcription text mightnot match the corresponding audio153 (i.e. the lowest average log likelihood per frame score) indicate that the transcription text *might* not match the corresponding audio 121 154 file. Look at the segment text and listen to the corresponding audo file to determine if they match. If they do not match (they might 122 155 still match, but just have a low score), then fix the text in your original text transcription file, repeat this process (i.e. running 123 156 the AudioBook program again with the verify switch on) until you can get a clean run. 124 157 125 =head2 Step 4- First Pass Forced Alignment - Adjusting Prompt Length158 =head2 Step 5 - First Pass Forced Alignment - Adjusting Prompt Length 126 159 127 160 After you can get the First Pass Forced Alignment to run without errors, check the AudioBook.log log file (in the output_files directory) and 128 review the length of the created prompts. If there are too many prompts over 30 words long , reduce the size of the pause ("-p" switch)129 and run First Pass Forced Alignment again - something like this:161 review the length of the created prompts. If there are too many prompts over 30 words long (one or two prompts in the low 30s is passable...), 162 reduce the size of the pause ("-p" switch) and run First Pass Forced Alignment again - something like this: 130 163 131 164 $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 … … 136 169 137 170 The worst case scenario is that you cannot segment your audio because it does not have any pauses that are long enough to use for a 138 segment. This is unlikely, given that people need to breath in every once in a while. What will occur is that you will have a few very long 139 segments because the person spoke continuously for a long period of time. You will likely have to segment these longer prompts manually. 140 141 =head2 Step 5 - Validate Suggested Out-of-Vocabulary Word Pronunciations 142 143 The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any new pronunciations are added to the 144 pronunciation dictionary. One way to do this is to use Speech Recognition to determine the pronunciation of the word in the actual audio file. 171 segment. This is unlikely, given that people need to breath in every once in a while. What will likely occur is that you will have a few 172 very long segments because the person spoke continuously for a long period of time. You will probably have to segment these longer 173 prompts manually. 174 175 =head2 Step 6 - Validate Suggested Out-of-Vocabulary Word Pronunciations 176 177 The pronunciations generated by the Sequitor G2P scripts need to be manually reviewed before any they are added to the 178 pronunciation dictionary. One way to do this is to use Speech Recognition to determine the phone set of the word in the actual audio file. 145 179 You can do this with the '-w' switch: 146 180 147 181 $./AudioBook.pm -a audio -t eText.txt -v - p 1000000 -w 148 182 149 The script thengenerates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text183 The -w generates a report (MissingWords_combined) that contains a list of all the OOV words, with the speech segment ID and text 150 184 (so you can listen to the audio segment), the g2p recommended phone list, and HVite phone list recommendations (determined using speech 151 185 recognition), so you can manually validate the final pronunciations. … … 154 188 155 189 That this approach is only as good as the acoustic model you are using. The pronunciations still need to be validated against the Sequitor G2P recommended 156 pronunciations. 157 158 Please donate some speech to Voxforge to help improve our acoutic models. 159 160 =head2 Step 6 - Update Pronunciation Lexicon 161 162 If you are submitting your segmented audio to VoxForge, please include your validated Out-of-Vocabulary word pronunciations 163 with your submission as a separate file called: "OOV_pron.txt". 164 165 166 =head2 Step 7 - Missing word processing 167 168 Use interactive command line tool (using the -i switch, after having run with -v and -w swtiches - this class requires the missingword.xml to 169 work properly) to line to generate suggested pronunciations (phone lists) using Sequitor G2P and HVite forced alignment to generate most 170 probable pronunciation. 190 pronunciations. Please donate some speech to Voxforge to help improve our acoutic models. 191 192 =head2 Step 7 - Iteractive Missing Word Validation 193 194 You can also use the script interactively (using the -v switch) to review the Sequitor G2P suggested phone lists and HVite pronunciations. It 195 is a simple command line script. 196 197 This mode requires the output (an xml version of the MissingWords_combined file called MissingWords.xml) from the -w switch (which needs the -v 198 switch). This parameter uses the contents of the missingword.xml file to prompt the user to select or edit a suggested pronunciation. Results 199 are placed in the MissingWords_final file, and if the -d switch is selected, then that dictionary will be updated with the results, like this: 200 201 $./AudioBook.pm -i -d /home/me/voxforge/VoxForgeDict 202 203 =head2 Step 8 - Validated Pronunciation Lexicon 204 205 If you are submitting your segmented audio to VoxForge, please include your *validated* Out-of-Vocabulary word pronunciations 206 with your submission as a separate file called: "OOV_pron.txt" 171 207 172 208 =head1 ALGORITHM … … 179 215 duration can be found, increasing the number of words to look behind and ahead each time. 180 216 181 The default pause duration is 2000000 in units of 100ns. This can be changed (using the "-p" switch") if the speech audio file does segment well217 The default pause duration is 2000000 in units of 100ns. This can be changed (using the "-p" switch") if the speech audio file doesn't segment well 182 218 enough with this default. 183 219 … … 196 232 http://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html 197 233 198 =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code ,but there are distribution restrictions)234 =item 2 - HTK Hidden Markov Model Toolkit (note: the source is "open" - i.e. you can read the code - but there are distribution restrictions) 199 235 200 236 http://htk.eng.cam.ac.uk/ … … 206 242 207 243 Term::ReadLine::Gnu 208 209 244 Audio::Wav 245 Lingua::EN::Numbers 246 XML::LibXML 247 File::Copy 210 248 211 249 =cut … … 415 453 $self->{"verify_segments"}=1; 416 454 $self->{"verify_out_of_vocabulary_pronunciations"}=1; 417 $self->{"README"}="AudioBook/ input_files/README";418 $self->{"LICENSE"}="AudioBook/ input_files/LICENSE";455 $self->{"README"}="AudioBook/test/README"; 456 $self->{"LICENSE"}="AudioBook/test/LICENSE"; 419 457 } elsif ($opt_a and $opt_t) { 420 458 if (-r $opt_a) { … … 695 733 =head1 Change Log 696 734 697 2008/06/12 - 0. 1- created CommandLine class to permit interactive validation of missing word pronunciations735 2008/06/12 - 0.2.2 - created CommandLine class to permit interactive validation of missing word pronunciations 698 736 2008/06/1 - 0.2.1 - refacture to create Chapter, Segments & MissingWords classes 699 737 2008/06/09 - 0.2.1 - refacture to create Chapter, Segments & MissingWords classes