Version 24 (modified by DavidGelbart, 12 years ago) (diff)


Automatic Speech Recognition Theory and Algorithms

This page is about resources for learning more about the theory and algorithms behind automatic speech recognition (ASR) technology.

Online educational resources:

Book recommendations:

David Gelbart's book recommendations:

  • Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang, Alex Acero, Hsiao-Wuen Hon. The table of contents can be viewed here.
  • Speech and Audio Signal Processing: Processing and Perception of Speech and Music by Ben Gold and Nelson Morgan. The table of contents can be viewed on
  • Speech Processing -- A Dynamic and Optimization-Oriented Approach by Li Deng and D. O'Shaughnessy. The table of contents can be viewed on
  • Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin. The table of contents (and long excerpts) can be viewed here.
  • Pattern Classification by Duda, Hart, and Stork. This is about pattern recognition in general, not ASR in particular. The table of contents can be viewed here.
  • A Course in Phonetics by Ladefoged. This is a good place to learn about phonemes (which are used in ASR pronunciation dictionaries), acoustic phonetics (which relates to the design of ASR feature extraction methods such as MFCC), and articulatory phonetics (which is often used in designing decision tree rules for HMM state tying). The audio and video files accompanying the book are here and you might find them interesting even if you don't own the book.

(The books vary in focus so I've mentioned how to find tables of contents online. I also recommend checking out the reviews on

Gunnar Evermann's book recommendations (a summary of this page):

  • Pattern Classification by Duda, Hart, and Stork
  • Introduction to Statistical Pattern Recognition by Fukunaga
  • Automatische Spracherkennung -- Grundlagen, statistische Modelle und effiziente Algorithmen by Schukat-Talamazzini
  • The above-mentioned book by Xuang, Acero and Hon.
  • Automatic Speech Recognition -- The Development of the SPHINX Recognition System by Lee.
  • Statistical Methods for Speech Recognition by Jelinek.
  • Corpus-Based Methods in Language and Speech Processing, edited by Young and Bloothooft

Applications of ASR:

Speech Technology Magazine is a good source for information about applications of speech technology. The magazine can be read free online. They also have a blog.

Current ASR research:

If you want to know what techniques are currently attracting attention at the cutting edge of ASR research, papers that describe speech recognition systems that were built for official project benchmarks can be a good source of information. These systems tend to use a lot of different, carefully chosen techniques. Many of these papers have a year (the year of the benchmark) and the word "system" in the title, which makes them easy to find. For example, many such papers can be found with a Google Scholar search for:

intitle:2007 intitle:system speech recognition

Some of these papers use the word "recent" in the title instead of the year. So in that case the above search would change to

intitle:recent intitle:system speech recognition

NIST organizes many of these benchmarks and they have information on their web site.

Different types of papers:

Sometimes you will have a choice whether to read about some work in a conference paper, a journal paper, or a master's or PhD thesis. A journal paper will often have more background and details than a conference paper, and a thesis will often have more than a journal paper.

The delay between when a conference paper is finished and when it is published tends to be shorter than the delay for a journal paper. And the delay for a PhD thesis is often shorter than for a conference paper. Thus it can happen that a conference or journal paper is published at the same time as the author's PhD thesis, but the thesis contains many more results since it was actually finished later.