Tải bản đầy đủ (.pdf) (17 trang)

Tài liệu Digital Signal Processing Handbook P47 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (140.17 KB, 17 trang )

Lawrence R. Rabiner, et. Al. “Speech Recognition by Machine.”
2000 CRC Press LLC. <>.
SpeechRecognitionbyMachine
LawrenceR.Rabiner
AT&TLabs—Research
B.H.Juang
BellLaboratories
LucentTechnologies
47.1Introduction
47.2CharacterizationofSpeechRecognitionSystems
47.3SourcesofVariabilityofSpeech
47.4ApproachestoASRbyMachine
TheAcoustic-PhoneticApproach[1]

“Pattern-Matching”
Approach[2]

ArtificialIntelligenceApproach[3,4]
47.5SpeechRecognitionbyPatternMatching
SpeechAnalysis

PatternTraining

PatternMatching

De-
cisionStrategy

ResultsofIsolatedWordRecognition
47.6ConnectedWordRecognition
PerformanceofConnectedWordRecognizers


47.7ContinuousSpeechRecognition
Sub-WordSpeechUnitsandAcousticModeling

WordMod-
elingFromSub-WordUnits

LanguageModelingWithinthe
Recognizer

PerformanceofContinuousSpeechRecognizers
47.8SpeechRecognitionSystemIssues
RobustSpeechRecognition[18]

SpeakerAdaptation[25]

KeywordSpotting[26]andUtteranceVerification[27]

Barge-In
47.9PracticalIssuesinSpeechRecognition
47.10ASRApplications
References
47.1 Introduction
Overthepastseveraldecadesaneedhasarisentoenablehumanstocommunicatewithmachinesin
ordertocontroltheiractionsortoobtaininformation.Initialattemptsatprovidinghuman-machine
communicationsledtothedevelopmentofthekeyboard,themouse,thetrackball,thetouchscreen,
andthejoystick.However,noneofthesecommunicationdevicesprovidestherichnessortheease
ofuseofspeechwhichhasbeenthemostnaturalformofcommunicationbetweenhumansfortens
ofcenturies.Hence,aneedhasarisentoprovideavoiceinterfacebetweenhumansandmachines.
Thisneedhasbeenmet,toalimitedextent,byspeechprocessingsystemswhichenableamachine
tospeak(speechsynthesissystems)andwhichenableamachinetounderstand(speechrecognition

systems)humanspeech.Weconcentrateonspeechrecognitionsystemsinthissection.
Speechrecognitionbymachinereferstothecapabilityofamachinetoconverthumanspeechto
atextualform,providingatranscriptionorinterpretationofeverythingthehumanspeakswhile
themachineislistening.Thiscapabilityisrequiredfortasksinwhichthehumaniscontrollingthe
actionsofthemachineusingonlylimitedspeakingcapability,e.g.,whilespeakingsimplecommands
orsequencesofwordsfromalimitedvocabulary(e.g.,digitsequencesforatelephonenumber).In
c

1999byCRCPressLLC
the moregeneral case, usually referredtoasspeech understanding, the machine need only recognize
a limited subset of the user input speech, namely the speech that specifies enough about the action
requested so that the machine can either respond appropriately, or initiate some action in response
to what was understood.
Speech recognition systems have been deployed in applications ranging from control of desktop
computers, to telecommunication services, to business services, and have achieved varying degrees
of success and commercialization.
In this section we discuss a range of issues involved in the design and implementation of speech
recognition systems.
47.2 Characterization of Speech Recognition Systems
A number of issues define the technologyof speech recognition systems. These include:
1. The manner in which a user speaks to the machine. There are generally three modes of
speaking, including:
• isolated word (or phrase) mode in which the user speaks individual words (or
phrases) drawn from a specified vocabulary;
• connected word mode in which the user speaks fluent speech consisting entirely of
words from a specified vocabulary(e.g., telephone numbers);
• continuous speech mode in which the user can speak fluently from a large (often
unlimited) vocabulary.
2. The size of the recognition vocabulary,including:
• smallvocabulary systemswhichproviderecognitioncapabilityforupto100words;

• medium vocabulary systems which provide recognition capability for from 100 to
1000 words;
• largevocabularysystemswhichproviderecognition capability forover1000words.
3. The knowledge of the user’s speech patterns, including:
• speaker dependent systems which have been custom tailored to each individual
talker;
• speaker independent systems which work on broad populations of talkers, most of
which the system has never encountered or adapted to;
• speaker adaptive systems which customize their knowledge to each individual user
over time while the system is in use.
4. The amount of acoustic and lexical knowledge used in the system, including:
• simple acoustic systems w hich have no linguistic knowledge;
• systems which integrate acoustic and linguistic knowledge, where the linguistic
knowledge is generally represented via syntactical and semantic constraints on the
output of the recognition system.
5. The degree of dialogue between the human and the machine, including:
• one-way (passive) communication in which each user spoken input is acted upon;
• system-driven dialog systems in which the system is the sole initiator of a dialog,
requesting information from the user via verbal input;
c

1999 by CRC Press LLC
• natural dialogue systems in w hich the machine conducts a conversation with the
speaker, solicits inputs, acts in response to user inputs, or even tries to clarify am-
biguity in the conversation.
47.3 Sources of Variability of Speech
Speech recognition bymachineisinherentlydifficultbecauseofthevariability inthesignal. Sources
of this variability include:
1. Within-speakervariabilityinmaintainingconsistentpronunciationanduseofwordsand
phrases.

2. Across-speaker variability due to physiological differences (e.g., different vocal tract
lengths) regional accents, foreign languages, etc.
3. Transducer variability whilespeaking over different microphones/telephone handsets.
4. Variability introduced by the transmission system (the media through which speech is
transmitted, telecommunication networks, cellular phones, etc.).
5. Variabilityinthespeakingenvironment,includingextraneousconversationsandacoustic
background events (e.g., noise, door slams).
47.4 Approaches to ASR by Machine
47.4.1 The Acoustic-Phonetic Approach [1]
Theearliestapproachestospeechrecognitionwerebasedonfindingspeechsoundsandprovidingap-
propriatelabels tothesesounds. This isthebasisofthe acoustic-phoneticapproachwhichpostulates
thatthereexistfinite, distinctivephoneticunits(phonemes) inspokenlanguage,and thattheseunits
are broadly characterized by a set of acoustic properties that are manifest in the speech signal over
time. Even though the acoustic properties of phonetic units are highly variable, both with speakers
and with neighboring sounds (the so-called coarticulation), it is assumed in the “acoustic-phonetic
approach” that the rules governing the variabilityare straightforward and can be readily lear ned (by
a machine). The first step in the acoustic-phonetic approach is a segmentation and labeling phase
in which the speech signal is segmented into stable acoustic regions, followed by attaching one or
more phonetic labels to each segmented region, resulting in a phoneme lattice characterization of
the speech (see Fig . 47.1). The second step attempts to determine a valid word (or string of words)
from the phonetic label sequences produced in the first step. In the validation process, linguistic
constraints of the task (i.e., the vocabulary, the syntax, and other semantic rules) are invoked in
order to access the lexicon for word decoding based on the phoneme lattice. The acoustic-phonetic
approach has not been widely used in most commercial applications.
47.4.2 “Pattern-Matching” Approach [2]
The “pattern-matching approach” involves two essential steps, namely, pattern trainingand pattern
comparison. The essential feature of this approach is that it uses a well- formulated mathematical
framework,andestablishesconsistentspeechpatternrepresentationsforreliablepatterncomparison
fromasetoflabeledtrainingsamplesviaaformaltrainingalgorithm. Aspeechpatternrepresentation
can be in the form of a speechtemplateorastatisticalmodel,andcanbeappliedtoasound(smaller

than a word), a word, or a phrase. In the pattern-comparison stage of the approach, a direct
comparison is made between the unknown speech (the speech to be recognized) witheach possible
c

1999 by CRC Press LLC
FIGURE 47.1: Segmentation and labeling for word sequence“seven-six”.
pattern learned in the training stage, in orderto determinetheidentityoftheunknown accordingto
thegoodnessof matchofthe patterns. The patternmatchingapproachhas becomethepredominant
method of speech recognition in the last decade and we shall elaborate on it in subsequent sections.
47.4.3 Artificial Intelligence Approach [3, 4]
The “artificial intelligence approach” attempts to mechanize the recognition procedure according to
the way a person applies intelligence in visualizing, analyzing, and characterizing speech based on a
set of measured acoustic features. Among the techniques used within this class of methods are use
of an expert system (e.g., a neural network) which integrates phonemic, lexical, syntactic, semantic,
and evenpragmaticknowledgeforsegmentationandlabeling,and uses toolssuchasartificialneural
networks for learning the relationships among phonetic events. The focus in this approach has been
mostly in the representation of knowledge and integration of knowledge sources. This method has
not been used widelyin commercial systems.
47.5 Speech Recognition by Pattern Matching
Figure 47.2 is a block diagram that depicts the pattern matching framework. The speech signal is
first analyzed and a feature representation is obtained for comparison with either stored reference
templatesorstatisticalmodelsinthepatternmatchingblock. Adecisionschemedeterminestheword
or phonetic class of the unknown speech based on the matching scores with respect to the stored
reference patterns.
There are two types of reference patterns that can be used with the model of Fig. 47.2. The first
type,calledanonparametricreferencepattern[5](or oftena template),is apatterncreatedfromone
ormorespokentokens(exemplars)of thesoundassociatedwiththepattern. Thesecondtype, called
a statistical reference model, is created as a statistical characterization (via a fixed type of model) of
the behavior of a collection of tokens of the sound associated with the pattern. The hidden Markov
model [6] is an example of the statistical model.

c

1999 by CRC Press LLC
FIGURE 47.2: Block diagram of pattern-recognition speech recognizer.
Themodelof Fig. 47.2hasbeen used(eitherexplicitly orimplicitly)for almostallcommercialand
industrial speech recognition systems for the following reasons:
1. It is invariant to different speech vocabularies, user sets, feature sets, pattern matching
algorithms, and decision rules.
2. It is easy to implement in software (and hardware).
3. Itworkswellinpractice.
We now discuss the elements of the pattern recognition model and show how it has been used in
isolated word, connected word, and continuous speech recognition systems.
47.5.1 Speech Analysis
The purpose of the speech analysis block is to transform the speech waveform into a parsimonious
representation which characterizes the time varying properties of the speech. The transfor mation
is normally done on successive and possibly overlapped short inter vals 10 to 30 msec in duration
(i.e., short-timeanalysis)due to the time-varying nature of speech. The representation [7] could be
spectral parameters, such as the output from a filter bank, a discrete Fourier transform (DFT), or a
linear predictive coding (LPC) analysis, or theycould be temporal parameters, such as the locations
of variouszero or level crossing times in the speech signal.
Empirical knowledge gained over decades of psychoacousticstudies suggests that the power spec-
trum has the necessary acoustic information for high accuracy sound identity. Studies in psychoa-
cousticsalsosuggestthatourauditoryperceptionofsoundpowerandloudnessinvolvescompression,
leadingtotheuseofthelogarithmicpowerspectrumandthecepstrum[8],whichistheFouriertrans-
formofthelog-spectrum. Thelowordercepstralcoefficients(upto10to20)provideaparsimonious
representationoftheshort-timespeechsegmentwhichisusuallysufficientforphoneticidentification.
The cepstralparametersare oftenaugmentedbytheso-called delta cepstrum[9]whichcharacter-
izes dynamic aspects of the time-varyingspeech process.
47.5.2 Pattern Training
Pattern training is the method by which representative sound patterns (for the unit being trained)

are converted into reference patterns for use by the pattern matching algorithm. There are several
ways in which pattern training can be performed, including:
1. Casualtraininginwhich asinglesoundpatternisused directlytocreateeither a template
or a crude statistical model (due to the paucityof data).
2. Robust training in which se veral (typically 2 to 4) versions of the sound pattern (usually
extracted from the speech of a single talker) are used to create a single merged template
or statistical model.
c

1999 by CRC Press LLC
3. Clustering training in which a large number of versions of the sound pattern (extracted
fromawiderangeoftalkers)isusedtocreateoneormoretemplatesorareliablestatistical
model of the sound pattern.
Inordertobetterunderstandhowandwhystatisticalmodelsaresobroadlyusedinspeechrecognition,
we now formally define an important class of statistical models, namely the hidden Markov model
(HMM) [6].
The HMM
TheHMMisastatisticalcharacterizationofboththedynamics(timevaryingnature)andstatics
(the spectral characterization of sounds) of speech during speaking of a sub-word unit, a word, or
even a phrase. The basic premise of the HMM is that a Markov chain can be used to describe the
probabilistic nature of the temporal sequence of sounds in speech, i.e., the phonemes in the speech,
via a probabilistic state sequence. The states in the sequence are not observed with certaintybecause
the correspondence between linguistic sounds and the speech waveform is probabilistic in nature;
hence the concept of a hidden model. Instead, the states manifest themselves through the second
component of the HMM which is a set of output distributions governing the production of the
speechfeaturesin eachstate(thespectral characterizationofthesounds). Inotherwords,theoutput
distributions (which are observed) represent the local statistical knowledge of the speech pattern
within the state, and the Markov chain characterizes, through a set of state transition probabilities,
how these sound processes evolve from one sound to another. Integrated together, the HMM is
particularly well suited for modeling speech processes.

FIGURE 47.3: Characterization of a word (or phrase, or subword) using a N(5) state, left-to-right,
HMM, withcontinuous observation densities in each state of the model.
An example of an HMM of a speech pattern is shown in Fig. 47.3. The model has five states
(corresponding to five distinct “sounds” or “phonemes” w ithin the speech), and the state (corre-
sponding to the sound being spoken) proceeds from left-to-right (as time progresses). Within each
state (assumed to represent a stable acoustical distribution) the spectral features of the speechsignal
c

1999 by CRC Press LLC
are characterized bya mixtureGaussian density of spectral features (called the observationdensity),
alongwithan energydistribution,andastatedurationprobability. The statesrepresentthe changing
temporal nature of the speech signal; hence indirectly they represent the speech sounds within the
pattern.
ThetrainingproblemforHMMsconsistsofestimatingtheparametersofthestatisticaldistributions
within each state (e.g., means, variances, mixture gains, etc.), along with the state transitionproba-
bilities for the composite HMM. Well-established techniques(e.g., the Baum-Welchmethod [10]or
the segmental K-means method [11]) have been defined for doing this pattern training efficiently.
47.5.3 Pattern Matching
Pattern matching refers to the process of assessing the similarity between two speech patterns, one
of which represents the unknown speechand one of w hich represents the reference pattern (derived
from the training process) of each element that can be recognized. When the reference pattern is a
“typical” utterance template, pattern matching produces a gross similarity (or dissimilarity) score.
Whenthereferencepattern consistsofaprobabilisticmodel,suchas anHMM,theprocessof pattern
matching is equivalent to using the statistical knowledge contained in the probabilistic model to
assess the likelihood of the speech (which led to the model) being realized as the unknown pattern.
FIGURE47.4: Resultsoftime aligning twoversionsoftheword“seven”, showinglinearalignmentof
thetwoutterances(toppanel);optimaltime-alignmentpath(middlepanel);andnonlinearlyaligned
patterns (lower panel).
A majorproblemincomparingspeech patternsisdueto speakingratevariations. HMMs provide
animplicittimenormalizationaspartoftheprocessformeasuringlikelihood. However,fortemplate

c

1999 by CRC Press LLC
approaches, explicit time normalization is required. Figure 47.4 demonstrates the effect of explicit
timenormalizationbetweentwopatternsrepresentingisolatedwordutterances. Thetoppanelofthe
figure shows the log energy contour of the two patterns (for the spoken word “seven”) — one called
the reference (known) pattern and the other called the test (or unknown input) pattern. It can be
seen that the inherent duration of the two patterns, 30 and 35 frames (where each frame is a 15-ms
segment ofspeech),isdifferentandthat linearalignmentisgrosslyinadequatefor internallyaligning
events within the two patterns (compare the locations of the vowel peaks in the two patterns). A
basic principle of time alignment is to nonuniformly warp the time scale so as to achieve the best
possible matching score between the two patterns (regardless of whether the two patterns are of the
same word identity or not). This can be accomplished by a dynamic programming procedure, often
calleddynamictimewarping(DTW)[12]whenappliedtospeechtemplatematching. The“optimal”
nonlinear alignment result of dynamic time warping is shown at the bottom of Fig. 47.4 in contrast
to the linear alignment of the patterns at the top. It is clear that the nonlinear alignment provides a
more realistic measure of similarity between the patterns.
47.5.4 Decision Strategy
The decision strategy takes all the matching scores (from the unknown pattern to each of the stored
reference patterns) into account, finds the “closest” match, and decides if the quality of the match is
goodenoughtomakearecognition decision. If not,theuseris askedtoprovide anothertokenofthe
speech (e.g., the word or phrase) for another recognition attempt. This is necessary because often
the user may speak words that are incorrect in some sense (e.g.,hesitation,incorrectly spoken word,
etc.) or simply outside of the vocabulary of the recognition system.
47.5.5 Results of Isolated Word Recognition
Using the pattern recognition model of Fig. 47.2, and using either the non-parametric template
approach or the statistical HMM method to derive reference patterns, a wide variety of tests of the
recognizer have been performed on telephone speech with isolated word inputs in both speaker-
dependent (SD) and speaker-independent (SI) modes. Vocabulary sizes have ranged from as few
as 10 words (i.e., the digits zero–nine) to as many as 1109 words. Table 47.1 gives a summary of

recognizer performance under the conditions describedabove.
TABLE 47.1 Performance of Isolated Word Re cognizers
Vocabulary Mode Word error rate (%)
10 Digits SI 0.1
SD 0.0
39 Alphadigits SI 7.0
SD 4.5
129 Airline terms SI 2.9
SD 1.0
1109 Basic English SD 4.3
47.6 Connected Word Recognition
The systems we have been describing in previous sections have all been isolated word recognition
systems. In this section we consider extensions of the basic processing methods described in pre-
c

1999 by CRC Press LLC
vious sections in order to handle recognition of sequences of words, the so-called connected word
recognition system.
The basic approach to connected word recognition is shown in Fig. 47.5. Assume we are given
a fluently spoken sequence of words, represented by the (unknown) test pattern T , and we are also
givenasetofV reference patterns, {R
1
,R
2
, ,R
V
} each representing one of the words in the
vocabulary. The connectedwordrecognitionproblemconsistsoffindingtheconcatenatedreference
pattern, R
S

, which best matches the test pattern, in the sense that the overall similarity between T
and R
S
is maximum over all sequence lengths and over all combinations of vocabulary words.
FIGURE 47.5: Illustration of the problem of matching a connected word string, spoken fluently,
using whole word patterns concatenated together to provide the best match.
There are several problems associated with solving the connected word recognition problem, as
formulated above. First of all, we do not know how many words were spoken; hence, we have to
considersolutionswitharangeonthenumberofwordsintheutterance. Second,wedonotknownor
can we reliably find word boundar ies within the test pattern. Hence, we cannot use word boundary
information to segment the problem into simple “word-matching” recognition problems. Finally,
sincethecombinatoricsof tryingtosolvetheproblemexhaustively(bytryingtomatcheverypossible
string) are exponential in nature, we need to deviseefficient algorithms to solve this problem. Such
efficient algorithms have been developed and they solve the connected word recognition problem
by iteratively building up time-aligned matches between sequences of reference patterns and the
unknown test pattern, one frame at a time [13, 14, 15].
47.6.1 Performance of Connected Word Recognizers
Typical recognition performance for connected word recognizers is given in Table 47.2 for a range
of vocabularies, and for a range of associated tasks. In the next section we will see how we exploit
linguistic constraints of the task to improve recognition accuracy for word strings beyond the level
one would expect on the basis of word error rates of the system.
c

1999 by CRC Press LLC
TABLE 47.2 Performance of Connected Word Recognizers
Word error String error
Vocabulary Mode rate (%) Task rate (%)
10 Digits SD 0.1 Variable length 0.4
SI 0.2 digit strings 0.8
(1–7 digits)

26 Letters of SD 10.0 Nameretrieval 4.0
the alphabet SI 10.0 from directory 10.0
of 1700 names
129 Airline terms SD 0.1 Sentences in a 1.0
SI 3.0 grammar 10.0
47.7 Continuous Speech Recognition
The techniques used in connected word recognition systems cannot be extended to the problem of
continuous speech recognition for several reasons. First of all, as the size of the vocabulary of the
recognizergrows,itbecomesimpracticaltotrainpatternsforeachindividualwordinthevocabulary.
Hence, continuous speech recognizers generally use sub-word speech units as the basic patterns to
be trained, and use a lexicon to define the structure of word patterns in terms of the sub-word
units. Second, the words spoken during continuous speech generally have a syntax associated with
the word order, i.e., they are spoken according to a grammar. In order to achieve good recognition
performance, account must be taken of the word grammar so as to constrain the set of possible
recognized sentences. Finally, the spoken sentence often must make sense according to a semantic
model of the task which the recognizer is asked to perform. Again, by explicitly including these
semantic constraints on the spoken sentence, as part of the recognition process, performance of the
system improves.
Based on the discussion above, there are three distinct new problems associated with continuous
speech recognition [16], namely:
1. Choiceofsub-wordunitused torepresentthesounds of speech,andmethods ofcreating
appropriate acoustic models for these sub-word units;
2. Choice of a representation of words in the recognition vocabulary, in terms of the sub-
word units;
3. Choice of a method for integrating syntactic (and possibly semantic) information into
the recognition process so as to properly constrain the sentences that are allowed by the
system.
47.7.1 Sub-Word Speech Units and Acoustic Modeling
For the basic sub-word speech recognition unit, one could consider a range of linguistic units,
including syllables, half syllables, dyads, dyphones, or phonemes. The most common choice is a

simple phoneme set, which for English comprises about 40 to 50 units, depending on fine choices
as to what constitutes a unique phoneme. Since the number of phonemes is limited, it is usually
straig htforward to collect sufficient speech training data for reliable estimation of statistical models
of the phonemes. The resulting set of sub-word speech models are usually referred to as “context
independent” phone-like units (CI-PLU) since each unit is trained independently of the context of
neighboringunits. TheproblemwithusingsuchCI-PLUmodelsisthatphonemesarehig hlyvariable
according to different contexts, and therefore using models which cannot represent this variability
properly leads to inferiorspeech recognition performance.
A straightforward way to improve the modeling of phonemes is to augment the CI-PLU set with
phonememodelsthatarecontextdependent. Inthismanner,atargetphonemeismodeleddifferently
depending on the phonemes that precede and follow it. By using such context dependent PLUs (in
c

1999 by CRC Press LLC
addition to the CI-PLUs) the “resolution” of the acoustic models is increased, and the performance
of the recognition system improves.
47.7.2 Word Modeling From Sub-Word Units
Once the base set of sub-word units is chosen, one can use standard lexical modeling techniques to
represent words in terms of these units. The key problem here is variability of word pronunciation
across talkers with different regional accents. Hence, for each word in the recognition vocabulary,
the lexicon contains a baseform (or standard) pronunciation of the word, as well as alternative
pronunciations, as appropriate.
Thelexiconusedinmostrecognitionsystemsisextractedfromastandardpronouncingdictionary,
andeachwordpronunciationisrepresentedasalinear sequenceofphonemes. Thislexicaldefinition
is basically data independent because no speech or text data are used to derive the pronunciation.
Hencethelexicalvariabilityofawordinspeechischaracterizedonlyindirectlythroughthesub-word
unit models. To improve lexical modeling capability, the use of (multiple) pronunciation networks
has been proposed [17].
47.7.3 Language Modeling Within the Recognizer
In order to determine the best match to a spoken sentence, a continuous speech recognition system

has to evaluate both an acoustic match score (corresponding to the “local” acoustic matches of the
words in the sentence) and a language match score (corresponding to the match of the words to the
grammar and syntax of the task). The acoustic matching score is readily determined using dynamic
programming methods much like those used in connected word recognition systems. The language
match scores are computed according to a production model of the syntax and the semantics. Most
often the language model is represented as a finite state network (FSN) for whichthe language score
is computed according to arc scores along the best decoded path (according to an integrated model
where acoustic and language modeling are combined) in the network. Other models of language
include word pair models as well as N -gram word probabilities.
47.7.4 Performance of Continuous Speech Recognizers
Table 47.3 illustrates current capabilities in continuous speech recognition, for three distinct tasks,
namelydatabaseaccess(ResourceManagement),naturallanguage queries(ATIS)forairtravelreser-
vations, and read text from a set of business publications (NAB).
TABLE 47.3 Performance of Continuous Speech Recognition Systems
Task Syntax Mode Vocabulary Word error rate (%)
Resource Finite state SI
management grammar fluent input 1,000 Words 4.4
(DARPA) (perplexity
= 60)
Air travel Backoff trigram SI
information system (perplexity
= 18) natural 2,500 Words 3.6
(DARPA) language
North American Backoff 5-gram SI
business (perplexit y
= 173) fluent input 60,000 Words 10.8
(DARPA)
c

1999 by CRC Press LLC

47.8 Speech Recognition System Issues
This section discusses some key issues in building “real world” speech recognition systems.
47.8.1 Robust Speech Recognition [18]
Robust speech recognition refers to the problem of designing an ASR system that works equally
well in various unknown or adverse operating environments. Robustness is important because the
performanceofexistingASRsystems,whosedesignsarepredicatedonknownorcleanenvironments,
often degradesrapidly under field conditions.
Therearebasicallyfourtypesof sound degradation, namely, noise, distortion, articulationeffects,
and pronunciation variations. Noise is an inevitable component of the acoustic environment and is
normally consideredadditive with the speech. Distor tion refers to modificationto the spectral char-
acteristics of the signalby the room, the transducer (microphone), the channel (e.g.,transmission),
etc. Articulationeffectsresultfromthefactorsthataffectatalker’sspeakingmannerwhenresponding
to a machine rather than a human. One well-known phenomenon is the Lombard effect which is
relatedtothechangesinarticulationwhenthetalkerspeaksinanoisyenvironment. Finally,different
speakers will pronounce a word differently depending on the reg ional accent. These conditions are
often not known a priori when the recognizer is trainedin the laboratory and are often detrimental
to the recognizer performance.
There are essentially two broad categories of techniques that have been proposed for dealing with
adverse conditions. These are invariant methods and adaptive methods, respectively.
Invariant methods use speech features (or the associated similarity measures) that are invariant
under a wide range of conditions, e.g., liftering and RASTA [19] (which suppress speech features
that are more susceptible to signal variabilities), the short-time modified coherence (SMC) [20]
(which has a built-in noise averaging advantage), and the Ensemble Interval Histogram (EIH) [21]
(which mimics the human auditor y mechanism). Robust distortion measures include the group-
delay measure[22] and a family of distortionmeasures based on the projectionoperator [23]which
were shown to be effective in conditions involving additive noise.
Adaptive methods differ from invariant methods in the way the characteristics of the operating
environment are taken into account. Invariant methods assume no explicit knowledge of the signal
environment,whileadaptivemethodsattempttoestimatetheadverseconditionandadjustthesignal
(or the reference models) accordingly in order to achieve reliable matching results.

When channel or transducer distortions are the major factor, it is convenient to assume that the
lineardistortioneffectappearsasanadditivesignalbiasinthecepstraldomain. Thisdistortionmodel
leadstothe methodofcepstralmeansubtractionand,moregenerally,signalbias removal[24]w hich
makes a maximum likelihood estimate of the bias due to distortion and subtracts the estimated bias
from the cepstral features before pattern matching is performed.
47.8.2 Speaker Adaptation [25]
Given sufficient training data, a SD recognition system usually performs better than a SI system for
the same task. Manysystemsare designed for SI applications, however, due to the fact that it is often
difficult to collect speaker-specific training data that would be adequate for reliable performance.
One way to bridge the performance gap is to apply the method of speaker adaptation which uses
a very limited amount of speaker-specific data to modify the model parameters of a SI recognition
system in order to achieve a recognition a ccuracy, approaching that of a well-trainedSD system.
c

1999 by CRC Press LLC
47.8.3 Keyword Spotting [26] and Utterance Verification [27]
Anautomaticspeechrecognitionsystemneedstohavebothhighaccuracyandauser-friendlyinterface
in order to be acceptable to the users. One major component in a friendly user-interface is to allow
theuserto speaknaturallyand spontaneouslywithoutimposinga rigid speakingformat. Inatypical
spontaneously spoken utterance, however, we usually observe various kinds of disfluency, such as
hesitation and extraneous sounds such as um and ah and false starts, and unanticipated ambient
noise, such as mouth clicks and lip smacks, etc. In the conventional paradig m, which formulates
speech recognition as decoding of an unknown utterance into a contiguous sequence of phonetic
units,thetask isequivalenttodesigninganunlimitedvocabulary continuousspeechrecognitionand
understanding system which is, unfortunately, beyond reach with today’s technology.
One alternative totheaboveapproach, particularly when implementing domain-specific services,
istofocusonafinitesetofvocabularywordsmostrelevanttotheintendedtaskanddesignthesystem
usingthetechnologyofkeywordspottingand, moregenerally, utteranceverification (UV). WithUV
incorporated into the speech recognition system, the user is allowed to speak spontaneously so long
as the keywords appear somewhere in the spoken utterance. The system then detects and identifies

the in-vocabularywords (i.e., keywords), while rejecting all other superfluous acoustic events in the
utterance (which include out-of-vocabulary words, invalid inputs — any form of disfluency as well
as lack of keywords — and ambient sounds). In such cases, no critical constraints are imposed on
the users’ speaking format, making the user interface natural and effective.
47.8.4 Barge-In
In human-human conversation, talkers often interrupt each other during speaking. This is called
“barge-in”. For human-machineinteractions, in whichmachineprompts areoftenroutinemessages
or instructions, the capability of allowing talkers to “barge in” becomes an important enabling
technology for a naturalhuman-machine interface.
Twokey technologiesareintegratedinthe implementationof“barge-in”, namely, anechocanceler
(toremovethespokenmessagefromthemachinetotherecognizer)andapartialrejectionmechanism.
With “barge-in”, the recognizer needs to be activated and listen starting from the beginning of
the system prompt. An echo canceler, with a proper double talk detector, is used to cancel the
system prompt while attempting to detect if the near-end signal from the talker (i.e., speech to be
recognized) is present. The tentatively detected signal is then passed through the recognizer with
rejection thresholds to produce the partial recognition results. The rejection technique is critical
because extraneous input is very likely to be present, both from the ambient background and from
the talker (breathing, lip smacks, etc.), during the long period when the recognizer is activated.
47.9 Practical Issues in Speech Recognition
As progress is made in fundamental recognition technologies, we need to examine carefully the key
attributes that a recognition machine must possess in order for it to be useful. These include: high
recognitionperformanceintermsofspeedandaccuracy,easeofuse,andlowcost. Arecognizermust
be able to deliver highrecognition accuracy without excessive delay. A system that does not provide
high performance oftenaddsto users’ frustration and mayeven beconsideredcounterproductive. A
recognitionsystemmustalso beeasy touse. Themorenatur ally asysteminteractswiththe user(e.g.,
doesnotrequirewordsin asentencetobe spokeninisolation),thehighertheperceivedeffectiveness.
Finally, the recognition systemmust below costtobecompetitivewith alternative technologiessuch
as keyboard or mouse devices in computer interface applications.
c


1999 by CRC Press LLC
47.10 ASR Applications
Speech recognition has been successfully applied in a r ange of systems. We categorize these applica-
tions into five broad classes.
1. Office or business system. Typical applications include data entry onto forms, database
management and control, keyboard enhancement, and dictation. Examples of voice-
activated dictation machines include the Tangora system [28] and the Dragon Dictate
system [29].
2. Manufacturing. ASR is used to provide “eyes-free, hands-free” monitoring of manufac-
turing processes (e.g., parts inspection) for quality control.
3. Telephoneor telecommunications. Applicationsincludeautomationofoperatorassisted
services (the Voice Recognition Call Processing system by AT&T to automate operator
service routing according tocalltypes),inbound and outbound telemarketing, informa-
tion services (the ANSER system by NTT for limited home banking services, the stock
price quotation system by Bell Northern Research, Universal Card services by Conver-
sant/AT&T for account information retrie v al), voice dialing by name/number (AT&T
VoiceLine, 800 Voice Calling services, Conversant FlexWord, etc.), directory assistance
call completion, catalog ordering, and telephone calling feature enhancements (AT&T
VIP — Voice Interactive Phone for easy activation of advanced calling features such as
call waiting,call forwarding, etc. by voice rather than by keying in the code sequences).
4. Medical. The applicationisprimarily invoicecreation andeditingof specializedmedical
reports (e.g.,Kurzweil’s system).
5. Other. This category includes voice controlled and operated toys and games, aids for the
handicapped and voice control of non-essential functions in moving vehicles (such as
climate control and the audio system).
References
[1] Hemdal,J.F.andHughes, G.W.,Afeaturebased computerrecognition program forthe model-
ingofvowelperception,in
ModelsforthePerce ptionofSpeechandVisualForm,Wathen-Dunn,
W. Ed. MIT Press, Cambridge, MA.

[2] Itakura, F., Minimumpredictionresidualprincipleapplied to speech recognition,
IEEE Trans.
Acoustics, Speech, and Signal Processing,
ASSP-23,57–72, Feb. 1975.
[3] Lesser, V.R., Fennell,R.D.,Erman,L.D.andReddyD.R.,OrganizationoftheHearsay-IISpeech
UnderstandingSystem,
IEEE Trans. Acoustics, Speech, andSignal Processing, ASSP-23(1),11–
23, 1975.
[4] Lippmann, R., An introduction to computing with neur al networks,
IEEE ASSP Magazine,
4(2),4–22, Apr. 1987.
[5] Rabiner, L.R. and Levinson, S.E., Isolated and connected word recognition — theory and
selected applications,
IEEE Trans. Commun., COM-29(5),621–659, May 1981.
[6] Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recog-
nition,
Proc. IEEE, 77(2),257–286, Feb. 1989.
[7] Rabiner, L.R.andJuang,B.H.,
Fundamentalsof SpeechRecognition, Prentice-Hall,Englewood
Cliffs, NJ, 1993.
[8] Davis, S.B. and Mermelstein, P., Comparison of parametric representations for monosyllabic
wordrecognitionincontinuouslyspokensentences,
IEEETrans. Acoustics, Speech,and Signal
Processing,
ASSP-28(4),357–366, Aug.1980.
c

1999 by CRC Press LLC
[9] Furui, S., Speaker independent isolated word recognition using dynamic features of speech
spectrum,

IEEETrans.Acoustics, Speech,and SignalProcessing,ASSP-34(1),52–59,Feb.1986.
[10] Baum, L.E., Petrie, T., Soules, G. and Weiss, N., A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains,
Ann. Math. Stat., 41(1),164–
171, 1970.
[11] Juang, B.H. and Rabiner, L.R., The segmental k-means algorithm for estimating parameters
of hidden Markov models,
IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-
30(9),1639–1641, Sept. 1990.
[12] Sakoe, H. and Chiba, S., Dynamic programming optimization for spoken word recognition,
IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-26(1),43–49, Feb. 1978.
[13] Sakoe, H., Two-level DP matching—adynamic programming-based pattern matching algo-
rithm for connected word recognition,
IEEE Trans. Acoustics, Speech, and Signal Processing,
ASSP-27(6),588–595, Dec. 1979.
[14] Myers,C.S.and Rabiner, L.R., Alevel buildingdynamictimewarpingalgorithm forconnected
wordrecognition,
IEEE Trans.Acoustics, Speech, andSignal Processing,ASSP-29(3),351–363,
June 1981.
[15] Bridle, J.S., Brown, M.D. and Chamberlain, R.M., An algorithm for connected word recogni-
tion,
Proc. ICASSP-82, 899–902, May 1982.
[16] Lee,C.H.,Rabiner,L.R.andPieraccini,R.,Speakerindependentcontinuousspeechrecognition
usingcontinuousdensityhiddenMarkovmodels,in
Proc.NATO-ASI,SpeechRecognition and
Understanding: Recent Advances, Trends and Applications,
Laface, P. and DeMori, R., Eds.,
Springer-Verlag, Cetraro, Italy, 1992, 135–163.
[17] Riley, M.D., A statistical model for generating pronunciation networks,
Proc. ICASSP-91,

2,737–740, 1991.
[18] Juang, B.H., Speech recognition in adverse environments,
Computer Speech and Language,
5,275–294, 1991.
[19] Hermansky,H.etal.,RASTA-PLP speechanalysistechnique,
Proc.ICASSP-29,121–124, 1992.
[20] Mansour, D. and Juang, B.H., The short-time modified coherence representation and noisy
speech recognition,
IEEE Trans. Acoustics, Speech, and Signal Processing, ASSP-37(6),795–
804, June 1989.
[21] Ghitza, O., Auditory nerve representation as a front-end for speech recognition in a noisy
environment,
Comp. Speech Lang., 1(2),109–130, Dec. 1986.
[22] Itakura, F. and Umezaki, T., Distance measure for speech recognition based on the smoothed
group delay spectrum,
Proc. ICASSP-87, 1257–1260, Apr. 1987.
[23] Mansour, D. and Juang, B.H., A family of distortion measures based upon projection opera-
tion for robust speech recognition,
Proc. ICASSP-88, Apr. 1988. Also in IEEE Trans., ASSP-
37(11),1659–1671, Nov. 1989.
[24] Rahim, M.G. and Juang, B.H., Signal bias removal for robust telephone speech recognition in
adverse environments,
Proc. ICASSP-94, Apr. 1994.
[25] Lee, C H., Lin, C H. and Juang, B.H., A study on speaker adaptation of the parameters
of continuous density hidden Markov models,
IEEE Trans. Acoustics, Speech, and Signal
Processing,
ASSP-39(4),806–814, Apr. 1991.
[26] Wilpon, J.G., Rabiner, L.R., Lee, C H. and Goldman, E., Automatic recognition of keywords
in unconstrained speech using hidden Markov models,

IEEE Trans. Acoustics, Speech, and
Signal Processing,
38(11),1870–1878, Nov. 1990.
[27] Rahim, M., Lee, C H. and Juang, B.H., Robust utterance verification for connected digit
recognition,
Proc. ICASSP-95, WA02.02,May 1995.
c

1999 by CRC Press LLC
[28] Jelinek, F., The development of an experimental discrete dictation recognizer, IEEE Proc.,
73(11),1616–1624, Nov. 1985.
[29] Baker,J.M.,Large vocabularyspeechrecognitionprototype,
Proc.DARPA Speechand Natural
Language Workshop,
414–415, June 1990.
c

1999 by CRC Press LLC

×