Speech Processing
Automaticspeechrecognizer

Automatic Speech Recognition

Speech recognition is, in its most general form, a convenition from an acoustic waveform to a written equivalent of the message information. The nature of the speech recognition problem is heavily dependent upon the constraints paced on speaker, speaking situation and message context. The potentional applications od speech recognition systems are many and varied: e.g. a voice operated typewriter and voice communication with computers. Also, a speech recognizing system combined with a speech synthesizing system.
A source-channel mathematical model is often used to formulate speech recognition problems. The source is passed through a noisy communication channel that consists of the speaker’s vocal apparatus to produce the speech waveform and the speech signal processing component of the speech recognizer. Finally, the speech decoder aims to decode the acoustic signal into a word sequence, which is hopefully close to the original word sequence.
A typical practical speech recognition system consists of basic components shown in the dotted box of Figure 1.2. Applications interface with the decoder to get recognition results that may be used to adapt other components in the system. Acoustic models include the representation of knowledge about acoustics, phonetics, microphone and environment variability, gender and dialect differences among speakers, etc. Language models refer to a system`s knowledge of what constitutes a possible word, what words are likely to co-occur, and in what sequence. The semantics and functions related to an operation a user may wish to perform may also be necessary for the language model. Many uncertainties exist in these areas, associated with speaker characteristics, speech style and rate, recognition of basic speech segments, possible words, likely words, unknown words, grammatical variation, noise interference, nonnative accents, and confidence scoring of results. A successful speech recognition system must contend with all of these uncertainties. But that is only the beginning. The acoustic uncertainties of the different accents and speaking styles of individual speakers are compounded by the lexical and grammatical complexity and variations of spoken language, which are all represented in the language model. The speech signal is processed in the signal processing module that extracts salient feature vectors for the decoder. The decoder uses both acoustic and language models to generate the word sequence that has the maximum posterior probability for the input feature vectors. It can also provide information needed for the adaptation component to modify either the acoustic or language models so that improved performance can be obtained.

References:

[1] Xuedong Huang, Alex Acero, Hsiao-Weun Hon : Prentice Hall - Spoken Language Processing a quide to Theory, Algorithm, and System Development, Carnegie Mellon University, Redmond 2001

[2] L.R. Rabiner, R.W. Schafer: Digital Processing of Speech Signals, ISBN 0-13-213603-1, Prentice-Hall, Inc.,Englewood Cliffs, New Jersey 07632