Automatic Speech Recognition
Speech recognition is, in its most general form, a convenition from an acoustic waveform to a written equivalent of the message information. The nature of the speech recognition problem is heavily dependent upon the constraints paced on speaker, speaking situation and message context. The potentional applications od speech recognition systems are many and varied: e.g. a voice operated typewriter and voice communication with computers. Also, a speech recognizing system combined with a speech synthesizing system.
A source-channel mathematical model is often used to formulate
speech recognition problems. The source is passed through a noisy communication
channel that consists of the speaker’s vocal apparatus
to produce the speech waveform and the speech signal processing component of the speech
recognizer. Finally, the speech decoder aims to decode the acoustic signal into a word
sequence, which is hopefully close to the original word sequence.
A typical practical speech recognition system consists of basic components shown in
the dotted box of Figure 1.2. Applications interface with the decoder to get recognition results
that may be used to adapt other components in the system. Acoustic models include the
representation of knowledge about acoustics, phonetics, microphone and environment variability,
gender and dialect differences among speakers, etc. Language models refer to a system`s
knowledge of what constitutes a possible word, what words are likely to co-occur, and
in what sequence. The semantics and functions related to an operation a user may wish to
perform may also be necessary for the language model. Many uncertainties exist in these
areas, associated with speaker characteristics, speech style and rate, recognition of basic
speech segments, possible words, likely words, unknown words, grammatical variation, noise
interference, nonnative accents, and confidence scoring of results. A successful speech recognition
system must contend with all of these uncertainties. But that is only the beginning.
The acoustic uncertainties of the different accents and speaking styles of individual speakers
are compounded by the lexical and grammatical complexity and variations of spoken language,
which are all represented in the language model.
The speech signal is processed in the signal processing module that extracts salient
feature vectors for the decoder. The decoder uses both acoustic and language models to generate
the word sequence that has the maximum posterior probability for the input feature
vectors. It can also provide information needed for the adaptation component to modify either
the acoustic or language models so that improved performance can be obtained.
References:
[1] Xuedong Huang, Alex Acero, Hsiao-Weun Hon : Prentice Hall - Spoken Language Processing a quide to Theory, Algorithm, and System Development, Carnegie Mellon University, Redmond 2001[2] L.R. Rabiner, R.W. Schafer: Digital Processing of Speech Signals, ISBN 0-13-213603-1, Prentice-Hall, Inc.,Englewood Cliffs, New Jersey 07632