wiki:PhoneSet
Last modified 12 years ago Last modified on 01/02/08 17:46:56

Phoneme Inventory

Most likely, we will have to write our own pronunciation dictionary. Thus, we are free in the design of our phoneme inventory. Personally, I (timo) don't know much about phoneme inventories for ASR (having a background in TTS). I think the main theme is "as few as possible, as many as necessary".

Phoneme Inventories for German

Comparison:

GSAMPAVMlexBasTimoDAmy opinionExampleIPA
Plosives
pppp+Pein[p]
bbbb+Bein[b]
tttt+Teich[t]
dddd+Deich[d]
kkkk+Kunst[k]
gggg+Gunst[g]
??Q?+(Q)Verein[ʔ]
Affricates
pfpfp fPfahl
tstst sZahl
tStSt Sdeutsch
dZd ZDschungel
Fricatives
ffff+fast[f]
vvvv+was[v]
T+engl. thing[θ]
D+engl. this[ð]
ssss+Tasse[s]
zzzz+Hase[z]
SSSS+waschen[ʃ]
ZZZZ+Loge[ʒ]
CCCC+sicher[ç]
w+engl. what[w]
jjjj+Jahr[j]
xxxx+Buch[x]
hhhh+Hand[h]
Sonorants
mmmm+mein[m]
nnnn+nein[n]
NNNN+Ding[ŋ]
llll+Leim[l]
Rrrr+Reim[ʁ]
Rengl. wrong[ʀ]
short vowels
ii:/IPolitik[i]
IIII+Sitz[ɪ]
eee:Meteor[e]
EEEE+Bett[ɛ]
aaaa+Satz[a]
oo:/OPolitik[o]
OOOO+Trotz[ɔ]
uuu:/UKulisse[u]
UUUU+Schutz[ʊ]
yyy:/YKyrillisch[y]
YYYY+hübsch[ʏ]
22:/YÖkonom[ø]
9999+plötzlich[œ]
long vowels
i:i:i:i:+Lied[iː]
e:e:e:e:+Beet[eː]
E:E:E:E:+spät[ɛː]
a:a:a:a:+Tat[aː]
o:o:o:o:+rot[oː]
u:u:u:u:+Blut[uː]
y:y:y:y:+süß[yː]
2:2:2:2:+blöd[øː]
Diphthongues
aIaIaIaI+Eis[aɪ̯]
EIE Ihey[ɛɪ̯]
aUaUaUaU+Haus[aʊ̯]
OYOYOYOY+Kreuz[ɔʏ̯]
nasalized vowels
a~a~a~a N/ORestaurant[ɑ̃]
E~e~E:(N)Teint[̃̃ε̃]
O~o~O NSaison[ɔ̃]
9~u~y m/9Parfum[œ̃]
Schwa-Vowels
@@@@+bitte[ə]
6666+besser[ɐ]
6-Diphthongues6 V

Proposition

I (timo) would say, we use the phonemes that I marked with a '+'-sign in the table above. For the others I have given best matches in the proposed set. The phone set doesn't contain any foreign language phones (see the BITS article).

We should probably mark words, that would gain much by or even require foreign language phones. For the time being, we can probably just avoid foreign words and add them to our speech data base later on.

Once we decide on a phone set, we can write a mapping of the eSpeak-output to our phoneset and start to build a dictionary. Any opinions?

Discussion

Hi Timo,

you might want to post a note in the German forum (and a link to here) asking for feedback

Ken

Hello Timo,

Great work. Here is my opinion:

  1. Why do you need "@" (bitte) or "6" (besser) if you are already using "E" (Bett)? OK, maybe this is necessary. It seems to be a similar problem like the English words "item" and "pet." The vowel "e" has different pronunciation, but it is nearly indistinguishable.
  1. The fricative "j" (German: Jahr) is necessary. But in the English-language, they would use the character "y" (English: yes). Why not use the character "y" instead of "j"? In the English language, they would interpret the character "j" as in "judge."
  1. It is OK to avoid foreign words. But why not include for example the fricative "T" (English: thing)? So it would be possible to include words like "Thunderbird" in the German dictionary. In the long term, we need the fricative "T". So why not include it in the beginning?

Greetings, Ralf

Reply

Hi Ralf, hi Ken,

thanks for your feedback. I have reworked the table above and will post a link in the forums. Also, I have started work on translating eSpeak output to this format (it will be flexible enough to allow for future changes). Now the technical part:

  1. Both schwa vowels (/@/ and /6/) are definitely necessary, even though they seem to be close to /E/. They differ in that they are never stressed, much shorter and also in the position of the tongue compared to the /E/ (which is a front vowel, the schwa vowels are centered). This shows in the formants and thus in the features.
  1. The proposed phoneme set is based on SAMPA, which is in turn based on the international phonetic alphabet. We should stay as close as possible and only divert if necessary. Thus, using 'y' instead of 'j' for the voiced palatal approximant (the sound in "ja" and in "yes") would be wrong. Also, the 'y' is already taken by the 9th cardinal vowel (fronted, closed and rounded) as in "süß" (with a lengthening diacritic). The consonants in "judge" would be transcribed with /d Z/ for the first (as in "Dschungel") and maybe /t S/ for the second (if it is devoiced) in our system.
  1. I am really unsure about the foreign language phones. As I see it, keeping the number of phonemes low is even more important in ASR than in TTS. It's really a trade-off between precision and data sparsity. That is also the reason why I left out the afficates /pf/, /ts/, /tS/ and /dZ/. Now for the phonemes in question (and more thoughts about the phonemes in the BITS paper):

3a. I admit that /T/, /D/, /R/ and /w/ are probably necessary. We can also assume, that German speakers speak sufficiently good English as to not mispronounce them as /s/, /z/, /r/ and /v/ any more. Whether we need them now or later is a matter of procedure and we should probably include them right from the start.

3b. At the same time, I am very unsure about /L/. I don't even know what it is. For American and British English, SAMPA only defines /l/. SAMPA-/L/ is defined as in Italian "famiglia". I don't think this is meant in the BITS paper. The BITS paper seems to use it to properly pronounce American English palatalized /l/ as in "well" without a German accent. I do not think, that Germans (while talking German) correctly distinguish these two sounds. Also, Germans who learned British instead of American pronunciation will not palatize /l/. Thus, I do not think, that we should add /L/ to our phoneme set.

3c. The diphthongue /@U/ (as in British English "nose") is definitely not accounted for in our system. On the other hand, "nose" is transcribed /n o z/ in American English. Again, Germans speaking German will likely stay within the range of their native vowel system. We should ignore /@U/ and instead transcribe it as /o/.

3d. The diphthongue /EI/ occurs in German colloquial speech ("Hey", "ey", "e-bay", ...). But how does it differ from the two constituting vowels next to each other: /E I/? I am thinking hard about examples, but I really cannot find a word that contains /E/ and /I/ next to each other (for example at a syllable boundary) and that would sound different with the diphthongue /EI/.

3e. For the nasals /a~/, /E~/, /o~/ and /9~/ (the latter in addition to BITS) I am quite unsure: Who ever says /p a r f 9~/ (for "Parfum")? I say /p a @ f y m/ while talking German. I say /r E s t o: r a N/ and I say /z @ z O N/. I never say the funny word "teint" for the color of my skin (this is in part due to the fact, that /E~/ is just outside the scope of my German vowel system and I don't know how else to pronounce the word). There *are* people that talk 'correctly', but I think we would not even improve recognition rates for them, because of the additional ambiguity due to the extra phonemes in our models. There wouldn't be enough data to reliably train these phonemes either. This is especially true for the phoneme /E~/ which on the other hand would be the most valuable, because there is really no other way to say things like "teint". To sum it up: I'd say we stick to the best matches in the table. Then again: Does anybody have experience with ASR and nasalization? How different are the feature vectors? Do we probably *have* to model all the nasals if we ever expect nasalized vowels in our input?