Tài liệu 45 Speech Coding docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (177.72 KB, 20 trang )

Richard V.Cox. “Speech Coding.”
2000 CRC Press LLC. <>.
SpeechCoding
RichardV.Cox
AT&TLabs—Research
45.1Introduction
ExamplesofApplications
•
SpeechCoderAttributes
45.2UsefulModelsforSpeechandHearing
TheLPCSpeechProductionModel
•
ModelsofHumanPer-
ceptionforSpeechCoding
45.3TypesofSpeechCoders
Model-BasedSpeechCoders
•
TimeDomainWaveform-
FollowingSpeechCoders
•
FrequencyDomainWaveform-
FollowingSpeechCoders
45.4CurrentStandards
CurrentITUWaveformSignalCoders
•
ITULinearPrediction
Analysis-by-SynthesisSpeechCoders
•
DigitalCellularSpeech
CodingStandards
•

SecureVoiceStandards
•
Performance
References
45.1 Introduction
Digitalspeechcodingisusedinawidevarietyofeverydayapplicationsthattheordinaryperson
takesforgranted,suchasnetworktelephonyortelephoneansweringmachines.Byspeechcoding
wemeanamethodforreducingtheamountofinformationneededtorepresentaspeechsignalfor
transmissionorstorageapplications.Formostapplicationsthismeansusingalossycompression
algorithmbecauseasmallamountofperceptibledegradationisacceptable.Thissectionreviews
someoftheapplications,thebasicattributesofspeechcoders,methodscurrentlyusedforcoding,
andsomeofthemostimportantspeechcodingstandards.
45.1.1 ExamplesofApplications
Digitalspeechtransmissionisusedinnetworktelephony.Thespeechcodingusedisjustsample-by-
samplequantization.Thetransmissionrateformostcallsisﬁxedat64kilobitspersecond(kb/s).
Thespeechissampledat8000Hz(8kHz)andalogarithmic8-bitquantizerisusedtorepresenteach
sampleasoneof256possibleoutputvalues.Internationalcallsovertransoceaniccablesorsatellites
areoftenreducedinbitrateto32kb/sinordertoboostthecapacityofthisrelativelyexpensive
equipment.Digitalwirelesstransmissionhasalreadybegun.InNorthAmerica,Europe,andJapan
therearedigitalcellularphonesystemsalreadyinoperationwithbitratesrangingfrom6.7to13kb/s
forthespeechcoders.SecuretelephonyhasexistedsinceWorldWarII,basedontheﬁrstvocoder.
(Vocoderisacontractionofthewordsvoicecoder.)Securetelephonyinvolvesﬁrstconvertingthe
speechtoadigitalform,thendigitallyencryptingitandthentransmittingit.Atthereceiver,it
isdecrypted,decoded,andreconvertedbacktoanalog.Currentvideotelephonyisaccomplished
c

1999byCRCPressLLC
through digital transmission of both the speech and the video signals. An emerging use of speech
coders is for simultaneous voice and data. In these applications, users exchange data (text, images,
FAX, or any other form of digitalinformation) while carrying on a conversation.

All of the above examples involve real-time conversations. Today we use speech coders for many
storage applications that make our lives easier. For example, voice mail systems and telephone
answering machines allow us to leave messages for others. T he calledparty canretrieve the message
when they wish, even from halfway around the world. The same storage technology can be used to
broadcast announcements to many different individuals. Another emerging use of speech coding is
multimedia. Most forms ofmultimedia involve only one-way communications, so we include them
with storage applications. Multimedia documents on computers can have snippets of speech as an
integral part. Capabilities currently exist to allow users to make voice annotations onto documents
stored on a personal computer (PC) or workstation.
45.1.2 Speech Coder Attributes
Speech coders have attributes that can be placed in four groups: bit rate, quality, complexity, and
delay. For a given application, some of these attributes are pre-determined while tradeoffs can be
made among the others. For example, the communications channel may set a limit on bit rate, or
cost considerations may limit complexity. Quality can usually be improved by increasing bit rate or
complexity,andsometimesbyincreasingdelay. Inthefollowing sections,wediscusstheseattributes.
Primarily we will be discussing telephone bandwidth speech. This is a slightly nebulous term. In
the telephone network, speech is ﬁrst bandpass ﬁltered from roughly 200 to 3200Hz. This is often
referredtoas3kHzspeech. Speechissampledat8kHzinthetelephonenetwork. Theusualtelephone
bandwidth ﬁlter rolls off to about 35 dB by 4 kHz in order to eliminate the aliasing artifacts caused
by sampling.
There is a second bandwidth of interest. It is referred to aswideband speech. The sampling rate is
doubled to 16 kHz. The lowpass ﬁlter is assumed to begin rolling off at 7 kHz. At the low end, the
speechisassumed tobeuncontamined byline noise andonlythe DC componentneedstobeﬁltered
out. Thus,thehighpass ﬁltercutofffrequencyis50Hz. Whenwerefertowidebandspeech,wemean
speech with a bandwidth of 50 to 7000 Hz and a sampling rate of 16 kHz. This is also referred to as
7 kHz speech.
Bit Rate
Bitratetellsusthedegreeofcompressionthatthecoderachieves. Telephonebandwidthspeech
issampledat8kHzanddigitizedwithan8-bitlogarithmicquantizer,resultinginabitrateof64kb/s.
Fortelephone bandwidth speechcoders,wemeasurethedeg ree ofcompressionby howmuchthe bit

rate is lowered from 64 kb/s. International telephone network standards currently exist for coders
operating from 64kb/s down to 5.3 kb/s. Thespeech coders for regional cellular standards spanthe
range from 13 to 3.45 kb/s and those for secure telephony span the range from 16 kb/s to 800 b/s.
Finally,there are proprietaryspeech coders that are in common use which span the entire range.
Speech coders need not have a constant bit rate. Considerable compression can be gained by not
transmitting speech during the silence intervals of aconversation. Nor is it necessary to keepthe bit
rate ﬁxed during the talkspurts of a conversation.
Delay
The communication delay of the coder is more important for transmission than for storage
applications. In real-time conversations, a large communication delay can impose an awkward
protocol on talkers. Large communication delays of 300 ms or greater are particularly objectionable
to users even if there are no echoes.
c

1999 by CRCPress LLC
Most low bit rate speech coders are block coders. They encode a block of speech, also known as
a frame, at a time. Speech coding delay can be allocated as follows. First, there is algorithmic delay.
Some coders have anamount of look-ahead or other inherent delays in addition to their frame size.
The sum of frame size and other inherent delays constitutes algorithmic delay. The coder requires
computation. Theamountoftimerequired for this is calledprocessingdelay. It is dependent onthe
speed of the processor used. Other delays in a complete system are the multiplexing delay and the
transmission delay.
Complexity
The degree ofcomplexity is adetermining factor inboththecostandpowerconsumptionofa
speechcoder. Cost is almost alwaysafactorintheselectionof a speechcoder foragivenapplication.
With the advent of wireless and portable communications, power consumption has also become an
important factor. Simple scalar quantizers, such as linear or logarithmic PCM, are necessary in any
coding system and have the lowest possible complexity.
More complex speech coders are ﬁrst simulated on host processors, then implemented on DSP
chips and may later be implemented on special purpose VLSI devices. Speed and random access

memory(RAM)are the two most importantcontributing factors of complexity. The faster the chip
or the greater the chip size, the greater the cost. In fact, complexity is a determining factor for both
cost and power consumption. Generally 1 word of RAM takes up as much on-chip area as 4 to 6
words of readonly memory (ROM). Most speechcoders areimplemented on ﬁxed point DSPchips,
soonewayto comparethecomplexity ofcodersistomeasuretheirspeedand memor y requirements
when efﬁciently implemented on commercially available ﬁxed point DSP chips.
DSP chips are available in both 16-bit ﬁxed point and 32-bit ﬂoating point. 16-bit DSP chips
are generally preferred for dedicated speech coder implementations because the chips are usually
less expensive and consume less power than implementations based on ﬂoating point DSPs. A
disadvantage of ﬁxed-point DSP chips is that the speech coding algorithm must be implemented
using 16-bit arithmetic. As part of the implementation process, a representation must be selected
for eachandevery variable. Some can berepresentedina ﬁxedformat, someinblockﬂoatingpoint,
and still others may require double precision. As VLSI technology has advanced, ﬁxed point DSP
chips contain a richer set of instructions to handle the data manipulations required to implement
representations such as block ﬂoating point. The advantage of ﬂoating point DSP chips is that
implementing speech coders is much quicker. Their arithmetic precision is about the same as that
of a highlevel language simulation, so the steps of determining the representation of each and e very
variable andhow these representations affect performance can beomitted.
Quality
The attribute of quality has many dimensions. Ultimately quality is determined by how the
speech sounds to a listener. Some of the factors that affect the performance of a coder are whether
the input speech is clean or noisy, whether the bit stream has been corrupted by errors, and whether
multiple encodings have taken place.
Speech coder quality ratings are determined by means of subjective listening tests. The listening
is done in a quiet booth and may use speciﬁed telephone handsets, headphones, or loudspeakers.
The speech material is presented to the listeners at speciﬁed levels and isoriginally prepared to have
particular frequency characteristics. The most often used test is the absolute category rating (ACR)
test. Subjects hear pairs of sentences and are asked to give one of the follow ing ratings: excellent,
good, fair, poor,orbad. A typicaltest contains a variety of differenttalkers and a numberofdifferent
coders or reference conditions. The data resulting from this test can be analyzed in many ways. The

simplest way is to assign anumerical rankingto each response, givinga5tothebestpossible rating,
4 to the next best, down to a 1 for the worst rating, then computing the mean rating for each of the
c

1999 by CRC Press LLC
conditions under test. This is a referred to as a mean opinionscore (MOS) and the ACR test is often
referred to as a MOS test.
Therearemanyotherdimensionstoqualitybesidesthosepertainingtonoiselesschannels. Biterror
sensitivity is another aspect of quality. For some low bit rate applications such as secure telephones
over 2.4 or 4.8 kb/s modems, it might be reasonable to expect the distribution of bit errors to be
random and coders should be made robust for low random bit error rates up to 1 to 2%. For radio
channels, such as in digital cellular telephony, provision is made for additional bits to be used for
channel codingtoprotecttheinformationbearing bits. Errors aremore likely tooccur inburstsand
the speech coder requires a mechanism to recover from an entire lost frame. This is referred to as
frame erasure concealment, another aspect of quality for cellular speech coders.
Forthepurposesofconservingbandwidth,voiceactivitydetectorsaresometimesusedwithspeech
coders. During non-speech intervals, the speech coder bit stream is discontinued. At the receiver
“comfort noise” is injected to simulate the background acoustic noise at the encoder. This method
is used for some cellular systems and also in digital speech interpolation (DSI) systems to increase
the effective number of channels or circuits. Most international phone calls car ried on undersea
cables orsatellitesuseDSI systems. Thereissomeimpacton quality when thesetechniquesareused.
Subjective testing can determine the degree of degradation.
45.2 Useful Models for Speech and Hearing
45.2.1 The LPC Speech Production Model
Human speech is produced in the vocal tract by a combination of the vocal cords in the glottis
interacting with the articulators of the vocal tract. The vocal tract can be approximated as a tube
of varying diameter. The shape of the tube gives r ise to resonant frequencies called formants. Over
the years, the most successfulspeechcoding techniques havebeen based on linear predictioncoding
(LPC).TheLPCmodelisderivedfromamathematicalapproximationtothevocaltractrepresentation
as a variable diameter tube. The essential element of LPC is the linear prediction ﬁlter. This is an

all poleﬁlter which predicts the value of the next sample based on a linear combination of previous
samples.
Let x
n
be the speech sample value at sampling instant n. The object is to ﬁnd a set of prediction
coefﬁcients {a
i
} such thatthe prediction error fora frame ofsize M is minimized:
ε =
M−1

m=0

I

i=1
a
i
x
n+m−i
+ x
n+m

2
(45.1)
where I is theorder of the linear prediction model. The prediction value for x
n
is given by
˜x
n

=−
I

i=1
a
i
x
n−i
(45.2)
The prediction error signal {e
n
} is also referred to as the residual signal. In z-transform notation
we can write
A(z) = 1 +
I

i=1
a
i
z
−i
(45.3)
1/A(z) isreferredtoastheLPC synthesis ﬁlter and (ironically) A(z) isreferredtoastheLPC inverse
ﬁlter.
c

1999 by CRC Press LLC
LPCanalysisiscarriedoutasablockprocessona frameofspeech. Themostoftenusedtechniques
arereferredtoastheautocorrelationandtheautocovariancemethods[1]–[3]. Bothmethodsinvolve
inverting mat rices containing correlation statistics of thespeech signal. If the polesof the LPC ﬁlter

are close to the unit circle, then these matrices become more ill-conditioned, which means that
the techniques used for inversion are more sensitive to errors caused by ﬁnite numerical precision.
Various techniques for dealing with this aspect of LPC analysis include windows for the data [1, 2],
windows for the correlation statistics [4], and bandwidth expansion of the LPC coefﬁcients.
For forward adaptive coders, the LPC information must also be quantized and transmitted or
stored. Direct quantization of LPC coefﬁcients is not efﬁcient. A small quantization error in a
single coefﬁcient can render the entire LPC ﬁlter unstable. Even if the ﬁlter is stable, sufﬁcient
precision is required and too many bits will be needed. Instead, it is better to transform the LPC
coefﬁcientstoanotherdomaininwhichstabilityismoreeasilydeterminedandfewerbitsarerequired
for representing the quantizationlevels.
The ﬁrst such domain to be considered is the reﬂection coefﬁcient [5]. Reﬂection coefﬁcients are
computed as a byproduct of LPC analysis. One of their properties is that all reﬂection coefﬁcients
must have magnitudes less than 1, making stability easily veriﬁed. Direct quantization of reﬂection
coefﬁcients is still not efﬁcient because the sensitivity of the LPC ﬁlter to errors is much greater
when reﬂection coefﬁcients are nearly 1 or −1. More efﬁcient quantizers have been designed by
transformingtheindividualreﬂectioncoefﬁcientswithanonlinearitythatmakestheerrorsensitivity
more uniform. Two such nonlinear functions are the inverse sine function, arcsin(k
i
), and the
logarithm ofthe area ratio, log
1+k
i
1−k
i
.
Aseconddomainthathasattractedevengreaterinterestrecentlyisthelinespectralfrequency(LSF)
domain [6]. The transformation is given as follows. We ﬁrst use A(z) to deﬁne two polynomials:
P(z) = A(z) + z
−(I +1)
A


z
−1

(45.4a)
Q(z) = A(z) − z
−(I +1)
A

z
−1

(45.4b)
These polynomials can be shown to have two useful properties: all zeroes of P(z)and Q(z) lie on
the unit circle and they are interlaced with each other. Thus, stability is easily checked by assuring
both the interlaced property and that no two zeroes are too close together. A second property
is that the frequencies tend to be clustered near the formant frequencies; the closer together two
LSFs are, the sharper the formant. LSFs have attracted more interest recently because they typically
resultinquantizers having eitherbetterrepresentationsor using fewerbits thanreﬂectioncoefﬁcient
quantizers.
The simplest quantizers are scalar quantizers [8]. Each of the values (in whatever domain is
being used to represent the LPC coefﬁcients) is represented by one of the possible quantizer levels.
The individual values are quantized independently of each other. There may also be additional
redundancy between successive frames, especially during stationary speech. In such cases, values
may be quantized differentially between frames.
Amoreefﬁcient,butalso morecomplex, methodofquantizationiscalled vectorquantization[9].
Inthistechnique,thecompletesetofvaluesisquantized jointly. The actualsetofvaluesiscompared
againstallsets inthecodebook usingadistancemetric. Thesetthat isnearestisselected. Inpractice,
an exhaustive codebook search is too complex. For example, a 10-bit codebook has 1024 entries.
This seems like a practical limit for most codebooks, but does not give sufﬁcient performance for

typical 10th order LPC. A 20-bit codebook would give increased performance, but would contain
over 1 million vectors. This is both too much storage and too much computational complexity to
be practical. Instead of using large codebooks, product codes are used. In one technique, an initial
codebook is used, then the remaining error vector is quantized by a second stage codebook. In the
c

1999 by CRC Press LLC
secondtechnique,thevectorissub-dividedandeachsub-vectorisquantizedusingitsowncodebook.
Both of these techniques lose efﬁciency compared to a full-search vector quantizer, but represent a
good means for reducing computational complexity and codebook size for bit rate or quality.
45.2.2 Models of Human Perception for Speech Coding
Our ears have a limited dynamic range that depends on both the level and the frequency content of
the input signal. The typical bandpass telephone ﬁlter has a stopband of onlyabout35 dB. Also, the
logarithmicquantizercharacteristicsspeciﬁedbyCCITTRec. G.711resultinasignal-to-quantization
noiseratioofabout35dB.Isthisacoincidence? Ofcoursenot! Ifasignal maintainsan SNRof about
35 dB or greater for telephone bandwidth,then most humans will perceive little or no noise.
Conceptually, the masking property tells us that we can permit greater amounts of noise in and
near the formant regions andthatnoisewillbemostaudiblein the spectral valleys. Ifwe use acoder
that produces a white noise characteristic, then the noise spectrum is ﬂat. The white noise would
probably be audible in all but the formant regions.
In modern speech coders, an additional linear ﬁlter is added to weight the difference between the
original speech signal and the synthesized signal. The object is to minimize the error in a space
whose metric is like that of the human auditory system. If the LPC ﬁlter information is available, it
constitutes the best available estimate of the speech spectrum. It can be used to form the basis for
this “perceptual weighting ﬁlter” [10]. The perceptual weighting ﬁlter isgiven by
W(z) =
1 − A(z/γ
1
)
1 − A(z/γ

2
)
0 <γ
2
<γ
1
< 1 (45.5)
The perceptual weighting ﬁlter de-emphasizes the importance of noise in the formant region and
emphasizes its importance in spectral valleys. The quantization noise will have a spectral shapethat
is similar to that of the LPC spectral estimate, making iteasier to mask.
The adaptive postﬁlter is an additional linear ﬁlter that is combined with the synthesis ﬁlter to
reducenoiseinthespectralvalleys[11]. OnceagaintheLPCsynthesisﬁlterisavailableastheestimate
of the speech spectrum. As in the perceptual weighting ﬁlter, the synthesis ﬁlter is modiﬁed. This
idea was later further extended to include a long-term (pitch) ﬁlter. A tilt-compensation ﬁlter was
added to correct for thelow pass characteristic that causes a mufﬂed sound. A gain control st rategy
helped prevent any segments from being either too loud or too soft. Adaptive postﬁlters are now
included as a part of many standards.
45.3 Types of Speech Coders
Thispartofthesectiondescribesavarietyofspeechcodersthatarewidelyused. Theyaredividedinto
two categories: waveform-following coders and model-based coders. Waveform-following coders
havethepropertythatiftherewerenoquantizationerror,theoriginalspeechsignalwouldbe exactly
reproduced. Model-based coders are based on parametric models of speech production. Only the
values of the parameters are quantized. If there were no quantization error, the reproduced signal
would not be the original speech.
45.3.1 Model-Based Speech Coders
LPC Vocoders
AblockdiagramoftheLPCvocoderisshowninFig.45.1. LPCanalysisisperformedonaframe
of speech and the LPCinformation is quantized and transmitted. A voiced/unvoiced determination
is made. The decision may be based on either the original speech or the LPC residual signal, but it
c


1999 by CRC Press LLC
will always be based on the degree of periodicit y of the signal. If the frame is classiﬁed as unvoiced,
the excitation signal is white noise. If the frame is voiced, the pitch period is transmitted and the
excitationsignalisa periodic pulsetrain. Ineithercase, theamplitudeof theoutputsignalisselected
such that its power matches that of the original speech. For more informationon the LPC vocoder,
the reader isreferred to [12].
FIGURE 45.1: Block diagram of LPCvocoder.
Multiband Excitation (MBE) Coders
Figure45.2is ablockdiagramofamultibandsinusoidalexcitationcoder. Thebasicpremiseof
these coders is that the speech waveform can be modeled as a combination of harmonically related
sinusoidal waveforms and narrowband noise. Within a given bandwidth, the speech is classiﬁed as
periodic or aperiodic. Harmonically relatedsinusoidsare used togenerate the periodic components
and white noise is used to generate the aperiodic components. Rather than transmitting a single
voiced/unvoiced decision, a frame consists of a number of voiced/unvoiced decisions corresponding
to the different bands. In addition, the spectral shape and gain must be transmitted to the receiver.
LPC may or may not be used to quantize the spectral shape. Most often the analysis of the encoder
is performed via fast Fourier t ransform (FFT). Synthesis at the decoder is usually performed by a
number of parallel sinusoid and white noise generators. MBE coders are model-based because they
do not t ransmit the phase of the sinusoids, nor do they attempt to capture anything more than the
energy of the aperiodic components. For more information the reader is referred to [13]–[16].
FIGURE 45.2: Block diagram of multiband excitation coder.
c

1999 by CRC Press LLC
Waveform Interpolation Coders
Figure 45.3 is a block diagram of a waveform interpolation coder. In this coder, the speech
is assumed to be composed of a slowly evolving periodic waveform (SEW) and a rapidly evolving
noise-like waveform (REW). A frame is analyzed ﬁrst to extract a “characteristic waveform”. The
evolution of these waveforms is ﬁltered to separate the REW from the SEW. REW updates are made

several times more often than SEW updates. The LPC, the pitch, the spectra of the SEW and REW,
and the overall energy are all transmitted independently. Atthereceiver a parametric representation
of the SEW and REW information is constructed, summed, and passed through the LPC synthesis
ﬁlter to produce output speech. For more information the reader is referred to [17, 18].
FIGURE 45.3: Block diagram of waveform interpolation coder.
45.3.2 Time Domain Waveform-Following Speech Coders
Allofthetimedomainwaveformcodersdescribedinthissectionincludeapredictionﬁlter. Webegin
with the simplest.
Adaptive Differential Pulse Code Modulation (ADPCM)
Adaptive differential pulse code modulation (ADPCM) [19] is based on sample-by-sample
quantization of the prediction error. A simple blockdiagram is shown in Fig.45.4. Two partsofthe
coder may be adaptive: the quantizer step-size and/or the prediction ﬁlter. ITU Recommendations
G.726 and G.727 adapt both. The adaptation may be either forward or backward adaptive. In a
backward adaptive system, the adaptation is based only on the previously quantized sample values
and the quantizer codewords. At the receiver, the backward adaptive par ameter values must be
recomputed. An important feature of such adaptation schemes isthat they must use predictors that
include a leakage factor thatallows the effectsoferroneousvaluescausedbychannel errorstodieout
over time. In a forward a daptive system, the adapted values are quantized and transmitted. This
additional“sideinformation” uses bitrate, butcanimprovequality. Additionally, itdoesnot require
recomputation at the decoder.
Delta Modulation Coders
Indeltamodulationcoders[20],thequantizer isjustthesignbit. Thequantizationstep sizeis
adaptive. Not all the adaptation schemes used for ADPCM will work for delta modulation because
the quantization is so coarse. The quality of delta modulation coders tends to be proportional to
their sampling clock: the greater the sampling clock, the greater the correlation between successive
samples, and the ﬁner the quantization step size that can be used. The block diagram for delta
modulation is the same as that of ADPCM.
c

1999 by CRC Press LLC

FIGURE 45.4: ADPCM encoder and decoder block diagrams.
Adaptive Predictive Coding
The better the performance of the prediction ﬁlter, the lower the bit rate needed to encode a
speech signal. This is the basis of the adaptive predictive coder [21] shown in Fig. 45.5. A forward
adaptive higher order linear prediction ﬁlter is used. The speech is quantized on a frame-by-frame
basis. In this way the bit rate for the excitation can be reduced compared to an equivalent quality
ADPCM coder.
FIGURE 45.5: Adaptive predictive coding encoder and decoder.
Linear Prediction Analysis-by-Synthesis Speech Coders
Figure45.6showsatypicallinearpredictionanalysis-by-synthesisspeechcoder[22]. LikeAPC,
theseareframe-by-framecoders. TheybeginwithanLPCanalysis. Typically theLPCinformationis
forwardadaptive,butthereareexceptions. LPAS codersborrowtheconceptfromADPCMof having
alocallyavailabledecoder. Thedifferencebetweenthequantizedoutputsignalandtheoriginalsignal
ispassedthrough aperceptualweightingﬁlter. Possibleexcitationsignalsareconsideredandthebest
(minimum mean squareerrorintheperceptual domain) is selected. Thelong-term prediction ﬁlter
removes long-term correlation (the pitch str ucture) in the signal. If pitch structure is present in the
coder, the parameters for the long-term predictor are determined ﬁrst. The most commonly used
system is theadaptive codebook, where samples from previous excitation sequences are stored. The
pitchperiodandgainthat resultinthegreatestreductionofperceptual errorareselected, quantized,
and transmitted. The ﬁxed codebook excitation is next considered and, again, the excitation vector
c

1999 by CRC Press LLC
that most reduces the perceptual error energy is selected and its index and gain are transmitted. A
variety of different p ossible ﬁxed excitation codebooks and their corresponding names have been
created for coders that fall into this class. Our enumeration touches only the highlights.
FIGURE 45.6: Linear prediction analysis-by-synthesis coder.
Multipulse Linear Predictive Coding (MPLPC) assumes that the speech frame is sub-divided into
smaller sub-fr ames. After determining the adaptive codebook contribution, the ﬁxed codebook
consists of a number of pulses. Typically the number of pulses is about one-tenth the number of

samples in a sub-frame. The pulse that makes the greatest contribution to reducing the error is
selected ﬁrst, then the pulse making the next largest contribution, etc. Once the requisite number of
pulses have been selected, determination of the pulses is complete. For each pulse, its location and
amplitude must betransmitted.
Codebook Excited Linear Predictive Coding (CELP) assumes that the ﬁxed codebook is composed
of vectors. This is similar in nature to the Vector Excitation Coder (VXC). In the ﬁrst CELP coder,
the codebooks were composed of Gaussian random numbers. It was subsequently discovered that
center-clipping these random number codebooks resulted in better quality speech. This had the
effect of making the codebooklookmorelike a collection of multipulseLPCexcitationvectors. One
means for reducing the ﬁxed codebook search is if the codebook consists of overlapping vectors.
Vector Sum Excitation Linear Predictive Coding (VSELP) assumes that the ﬁxed codebook is com-
posed of a weighted sum of a set of basis vectors. The basis vectors are orthogonal to each other.
The weights onanybasisvector arealwayseither−1 or+1. Afastsearch techniqueispossiblebased
on using a pseudo-Gray code method of exploration. VSELP was used for several ﬁrst or second
generation digital cellular phone standards [23].
45.3.3 Frequency Domain Waveform-Following Speech Coders
Sub-Band Coders
Figure 45.7 shows the structures of a typical sub-band encoder and decoder [19, 24]. The
conceptbehindsub-bandcodingis quitesimple: dividethespeechsignalintoanumberoffrequency
bands and quantize eachbandseparately. In this waythequantizationnoise is keptwithin the band.
Typically quadrature mirrororwavelet ﬁlterbanksareused. Thesehavethepropertiesthat(1)in the
absence of quantization error all aliasing caused by decimation in the analysis ﬁlterbank is canceled
in the synthesis ﬁlterbank and (2)the bands can be critically sampled, i.e., the number offrequency
c

1999 by CRC Press LLC
domain samples is the same as the number of time domain samples. The effectiveness of these
coders depends largely on the sophistication of the quantization algorithm. Generally, algorithms
that dynamically allocate the bits according to the current spectral characteristics of the speech give
the best performance.

FIGURE 45.7: Sub-band coder.
Adaptive Transform Coders
Adaptivetransfor m coding(ATC)canbeviewedasafurtherextensiontosub-bandcoding[19,
24]. The ﬁlterbank structure of SBC is replaced with a transform such as the FFT, the discrete
cosine transform (DCT), wavelet transform or other transform-ﬁlterbank. They provide a higher
resolutionanalysisthanthesub-bandﬁlterbanks. Thisallowsthecodertoexploitthepitchharmonic
structure of the spectrum. As in the case of SBC, the ATC codersthat use sophisticatedquantization
techniques that dynamically allocate thebits usually give the best performance. Most recently, work
has combined transform coding with LPCand time-domain pitch analysis [25]. The residual signal
is coded usingATC.
45.4 Current Standards
This partof the section is divided into descriptions of current speech coder standards and activities.
Thesubsectionscontaininformationonspeechcodersthathavebeenorwillsoonbestandardized. We
begin ﬁrst by brieﬂy describing thestandards organizations who formulate speech coding standards
and the processes they follow in making these standards.
The International Telecommunications Union (ITU) is an agency of the United Nations Eco-
nomic, Scientiﬁc and Cultural Organization (UNESCO) charged withall aspects of standardization
in telecommunications and radio networks. Its headquarters are in Geneva, Switzerland. The ITU
Telecommunications Standardization Sector (ITU-T) formulates standards related to both wireline
andwirelesstelecommunications. TheITURadioStandardizationSector(ITU-R)handlesstandard-
izationrelatedtoradioissues. Thereisalsoathirdbranch,theITU–TelecommunicationsStandards
Bureau(ITU-B)isthe bureaucracy handling allofthepaperwork. Speechcodingstandards arehan-
dledjointlybyStudyGroups16and12withintheITU-T.OtherStudyGroupsmayoriginaterequests
forspeechcodersforspeciﬁcapplications. ThespeechcodingexpertsarefoundinSG16. Theexperts
on speech performance are found in SG12. When a new standard is being formulated, SG16 draws
up a list of requirements based on the intended applications. SG12 and other interested bodies may
review the requirements before they are ﬁnalized. SG12 then creates a test plan and enlists the help
of subjective testing laboratories to measure the quality of the speech coders under the various test
conditions. The process of standardization can be time consuming and take between 2 to 6 years.
Three different standards bodies make regional cellular standards, including those for the speech

c

1999 by CRC Press LLC
coders. InEurope, theparentbody istheEuropeanTelecommunicationsStandardsInstitute(ETSI).
ETSI is an organizationthat is composed mainly of telecommunications equipment manufacturers.
In North America, the parent body is the American National Standards Institute (ANSI). The body
chargedwithmakingdigitalcellularstandardsistheTelecommunicationsIndustryAssociation(TIA).
In Japan, the body charged with making digital cellular standards is the Research and Development
Center for RadioSystems (RCR).
There are also speech coding standards for satellite, emergencies, and secure telephony. Some of
these standards were promulgated by government bodies, while otherswere promulgated by private
organizations.
Each of thesestandards organizations worksaccordingto itsown rules and regulations. However,
there is a set of common threads among all of the organizations. These are the standards making
process. Creatingastandardisalongprocess,nottobeundertakenlightly. First,aconsensusmustbe
reachedthata standardisneeded. In most cases this is obvious. Second, the terms of referenceneed
to be created. This becomes the governing document for the entire effort. If deﬁnes the intended
applications. Based on these applications, requirements can be set on the attributes of the speech
coder: quality,complexity,bitrate,anddelay. Therequirementswilllaterdeterminethetestprogram
that is needed to ascertain whether any candidates are suitable.
Finally, the members of the group need to deﬁne a schedule for doing the work. There needs to
be an initialperiod to allow proponents to designcoders that are likely to meet therequirements. A
deadline is set for submissions. The services of one or more subjective test labs need to be secured
andatestplanneedstobe deﬁned. Ahostlabisalsoneededtoprocessallofthedatathatwill beused
in the Selection Test. Some criteria are needed for determining how to make the selection. Based
on the selection, a draft standard needs to be written. Only after the standard is fully speciﬁed can
manufacturers begin to produce implementations of thestandard.
45.4.1 Current ITU Waveform Signal Coders
Table 45.1 describes current ITU speech coding recommendations that are based on sample-by-
samplescalarquantization. Threeofthesecodersoperateinthetimedomainontheoriginalsampled

signal whilethe fourth is based on a two-band sub-band coder forwideband speech.
TABLE 45.1 ITUWaveform Speech Coders
Standard body ITU ITU ITU ITU
Number G.711 G.726 G.727 G.722
Year 1972 1990 1990 1988
Typeofcoder Companded PCM ADPCM ADPCM SBC/ADPCM
Bit rate 64 kb/s 16–40 kb/s 16–40 kb/s 48, 56 64kb/s
Quality Toll
≤ Toll ≤ Toll Commentary
Complexity
MIPS
 11110
RAM 1 byte
< 50 bytes < bytes 1Kwords
Delay
Frame size 0.125 ms 0.125 ms 0.125 ms 1.5 ms
Speciﬁcation type
Fixed point Bit exact Bit exact Bit exact Bit exact
The CCITT standardizedtwo 64kb/scompanded PCMcoders in1972. NorthAmericaandJapan
use µ -law PCM. The rest of the world uses A-law PCM. Both coders use 8 bits to represent the
signal. Their effectivesignal-to-noise ratioisabout35dB.Thetablesfor both oftheG.711 quantizer
c

1999 by CRC Press LLC
characteristics are contained in [19]. Both coders are considered equivalent in overall quality. A
tandem encoding with either coder is considered equivalent to dropping the least signiﬁcant bit
(which is equivalent to reducing the bit rate to 56 kb/s). Both coders are extremely sensitive to bit
errors in the most signiﬁcant bits. Their complexity is verylow.
32 kb/s ADPCM was ﬁrst standardized bythe ITU in 1984 [26]–[28]. Itsprimary application was
intendedtobedigitalcircuit multiplicationequipment(DCME).Incombinationwithdigitalspeech

interpolation, a 5:1 increase in the capacity of undersea cables and satellite links was realized for
voice conversations. An additional reason for its creation was that such links often encountered the
problem of having µ-law PCM at one end and A-law at the other. G.726 can accept either µ-law or
A-law PCM as inputs or outputs. Perhaps its most unique feature is a property called synchronous
tandeming. If a circuit involves two ADPCM codings with a µ -law or A-law encoding in-between,
no additional degradation occurs because of the second encoding. The second bit stream will be
identical to the ﬁrst! In 1986 theRecommendation was revised to eliminate the all-zeroes codeword
and so thatcertain low rate modem signals would be passed satisfactorily. In 1988 extensions for 24
and 40 kb/s were added and in 1990 the 16 kb/s rate was added. All of these additional rates were
added for use in digital circuit multiplication equipment applications.
G.727includesthesameratesasG.726,butallofthequantizershaveanevennumberoflevels. The
2-bit quantizer isembedded in the 3-bitquantizer, which is embedded in the 4-bit quantizer, which
is embedded in the 5-bit quantizer. The is needed for Packet Circuit Multiplex Equipment (PCME)
where the least signiﬁcant bits in the packet can be discarded when there is an overload condition.
Recommendation G.722 is a wideband speech coding standard. Its principal applications are
teleconferencesandvideoteleconferences[29]. Thewiderbandwidth (50–7000 Hz) ismore natural
sounding and less fatiguing than telephone bandwidth (200 – 3200 Hz). The wider bandw idth
increases the intelligibility of the speech, especially for fricative sounds such as /f/ and /s/, which
are difﬁcult to distinguish for telephone bandwidth. The G.722 coder is a two-band sub-band
coder with ADPCM coding in both bands. The ADPCM is similar in structure to that of the G.727
recommendation. Theupper banduses anADPCM coderwitha2-bit adaptivequantizer. Thelower
band uses anADPCMcoderwith anembedded4-5-6bitadaptive quantizer. This makes theratesof
48, 56,and64kb/s allpossible. A24-tap quadraturemirrorﬁlterisused toefﬁcientlysplit the signal.
45.4.2 ITU Linear Prediction Analysis-by-Synthesis Speech Coders
Table 45.2 describes three current analysis-by-synthesis speech coder recommendations of the ITU.
All three are block coders based on extensions of theoriginal multipulse LPCspeech coder.
TABLE 45.2 ITULinear Prediction Analysis-By-Synthesis Speech
Coders
Standard body ITU ITU ITU
Number G.728 G.729 G.723.1

Year 1992 and 1994 1995 1995
Typeofcoder LD-CELP CS-ACELP MPC-MLQ and ACELP
Bit rate 16 kb/s 8 kb/s 6.3 &5.3 kb/s
Quality Toll Toll
≤ Toll
Complexity
MIPS 30
≤ 22 ≤ 16
RAM 2 K
< 2.5 K 2.2 K
Delay
Frame size 0.625 ms 10 ms 30 ms
Look ahead 0 5 ms 7.5 ms
Speciﬁcation type
Floating point Algorithmexact None None
Fixed point Bit exact Bit exact C Bit exact C
c

1999 by CRC Press LLC
G.728 Low-Delay CELP (LD-CELP) [30] is a backward adaptive CELP coder whose quality is
equivalent to that of 32 kb/s ADPCM. It was initially speciﬁed as a ﬂoating point CELP coder that
required implementers to follow exactly the algorithm speciﬁed in the recommendation. A set of
test vectors for verifying correct implementation was created. Subsequently, a bit exact ﬁxed point
speciﬁcation was requested and completed in 1994. The performance of G.728has been extensively
tested by SG12. It gives robust performance for signals with background noise or music. It is very
robust to random bit errors, more so than previous ITU standards G.711, G.726, G.727, and the
newer standards described below. In additionto passing low bit rate modem signals as highas 2400
bps, it passes all network signalingtones.
In response to a request from CCIR Task Group 8/1 for a speech coder for wireless networks
as envisioned in the Future Public Land Mobile Telecommunication Service (FPLMTS), the ITU

initiateda workprogr am foratollquality8kb/s speechcoderwhichresultedinG.729. Itisaforward
adaptive CELP coder with a 10-ms frame size that usesalgebraic CELP (ACELP) excitation.
The work program for G.723.1 was initiated in 1993 by the ITU as part of a group of standards
to specify a low bit rate videophone for use on the public switched toll networks (PSTN) carried
over a high speed modem. Other standards in this group include the video coder, modem, and
data multiplexing scheme. A dual rate coder was selected. The two rates differ primarily by their
excitation scheme. The higher rate used Multipulse LPC with Maximum Likelihood Quantization
(MPC-MLQ) while the lower rate used ACELP. G.723.1 and G.729 are the ﬁrst ITU coders to be
speciﬁed by a bit exact ﬁxed point ANSIC code simulation of the encoder and decoder.
45.4.3 Digital Cellular Speech Coding Standards
Table 45.3 describes the ﬁrst and second generation of speech coders to be standardized for digital
cellular telephony. The ﬁrst generation coders provided adequate quality. Two of the second gener-
ation coders are so-called half-rate coders that have been introduced in order to double the capacity
of the rapidly growing digital cellular industry. Another generation of coders willsoon follow them
in order to bring the voice quality of digital cellular service up to that of current wireline network
telephony.
TABLE 45.3 Digital Cellular Telephony Speech Coders
Standard body CEPT ETSI TIA TIA RCR RCR
Standard name GSM GSM1/2 IS-54 IS-96 PDC PDC1/2
Rate Rate
Typeofcoder RPE-LTP VSELP VSELP CELP VSELP PSI-CELP
Date 1987 1994 1989 1993 1990 1993
Bit rate 13 kb/s 5.6 kb/s 7.95 kb/s 0.8 to 8.5 6.7 kb/s 3.45 kb/s
Quality
< toll = GSM = GSM < GSM < GSM = PDC
Est. complexity
MIPS 4.5 30 20 20 20 50
RAM 1K 4K 2K 2K 2K 4K
Delay
Framesize 20ms 20ms 20ms 20ms 20ms 40ms

Look ahead 0 5 ms 5 ms 5 ms 5ms 10 ms
Speciﬁcation type Bit Bit Bit Bit Bit Bit
ﬁxed point exact exact C stream stream stream stream
The RPE-LTP coder [33] was standardized by the Group Special Mobile (GSM) of CEPT in 1987
forpan-Europeandigitalcellulartelephony. RPE-LTPstands forRegularPulseExcitationwithLong-
Term Predictor. The GSM full-rate channel supports 22.8 kb/s. The additional 9.8 kb/s is used for
c

1999 by CRC Press LLC
channel coding to protect the coder from bit errors in the radio channel. Voice activ ity detection
and discontinuous transmission are included as part of this standard. In addition to digital cellular
telephony, thiscoderhassincebeenusedforotherapplications, suchasmessaging,becauseofitslow
complexity.
The GSM half-rate coderwasstandardized byETSI (an off-shoot of CEPT) in ordertodoublethe
capacity of the GSMcellular system. The coder is a 5.6 kb/sVSELP coder[23]. A greater percentage
of the channel bits are used for error protection because the half-rate channel has less frequency
diversity than the full-rate system. The overall performance was measured to be similar to that of
RPE-LTP, except for certain signals with background noise.
VectorSumExcitationLinerPredictionCoding(VSELP)wasstandardizedbytheTelecommunica-
tionsIndustry Association(TIA)fortimedivisionmultipleaccess(TDMA)digitalcellulartelephony
in North America as apart of Interim Standard 54 (IS-54). It was selected on thebasis of subjective
listeningtestsin1989. Thequality ofthiscoderand RPE-LTParesomewhatdifferentinthecharacter
of their distortion, but they usually receive about the same MOS in subjective listening tests. IS-54
doesnothavea bitexactspeciﬁcation. Implementationsneedonly conformtothebit streamspeciﬁ-
cation. The TIA does have a qualiﬁcation procedure, IS-85, to verify whether theperformance of an
implementationisgood enough tobeusedfordigitalcellular[34]. Inaddition,Motorolaprovideda
ﬂoating point Cprogram for their version of the coder, which implementers may use as aguideline.
TheIS-96coder[35]wasstandardizedbytheTIAforcodedivisionmultipleaccess(CDMA)digital
cellulartelephonyin North America. Itisapartof IS-96 and isusedin thesystemspeciﬁedby IS-95.
CDMAsystemcapacityisitsmostattractivefeature. Whenthereisnospeech,therateofthechannels

is reduced. IS-96 is a variable rate CELP coder which uses digital speech interpolation to achieve
this rate reduction. It runs at 8.5 kb/s during most of a talk spurt. When there is no speech on the
channel,itdropsdowntojust0.8kb/s. Atthisrate,itisjustsupplyingstatisticsaboutthebackground
noise. These two rates are the ones most often used during operation of IS96, although the coder
does transition throughthetwointermediate ratesof2and 4 kb/s. The validationprocedure forthis
coder is similarto that of IS-85.
The Personal Digital Cellular (PDC) full-rate speech coder was standardized by the Research and
Development Center for RadioSystems (RCR) for TDMA digitalcellular telephone service in Japan
as RCR STD-27B. The coder is very similar to IS-54 VSELP. The principal difference is that instead
of two vector sum excitation codebooks, there is only one.
The PDC half-rate coder [37] was standardized by RCR to double the capacity of the Japanese
TDMAPDCsystem. PitchsynchronousinnovationCELP(PSI-CELP) uses ﬁxedcodebooksthatare
modiﬁed as a function of the pitch in order to improve the speech quality for such a low rate coder.
Ifthepitchperiodislessthanthe fr ame size, then all vectors intheﬁxedcodebookforthatframe are
made periodic. It has a background noise pre-processor as partof the standard. When it senses that
the background noise exceeds a certain threshold, the pre-processor attemptstoimprove the quality
of the speech. To date, this coder appears to be the most complex yet standardized.
45.4.4 Secure Voice Standards
Table 45.4presentsinformationaboutthreesecurevoicestandards. Twoareexistingstandards,while
the third describesa standard that the U.S. government hopes to promulgate in 1996.
FS1015 [12]isaU.S. Federal Standard2.4kb/sLPC vocoderthatwas createdoveralong period of
time beginning in the late 1970s. It was standardized bytheU.S. Department of Defense (DoD) and
later the North Atlantic Treaty Organization (NATO) before becoming a U.S. Federal Standard in
1984. Itwasalwaysintendedforsecurevoiceterminals. Itdoesnotproducenaturalsoundingspeech,
but over the years its intelligibility has been greatly improved through a series of changes to both its
encoderanddecoder. Remarkably, these changesneverrequiredchanges tothebit stream. Presently
the intelligibilityof FS1015 for clean input speech having telephone bandwidthis almost equivalent
c

1999 by CRC Press LLC

TABLE 45.4 Secure Telephony Speech Coding Standards
Standard body U.S. Dept. of Defense U.S. Dept. of Defense U.S.Dept. of Defense
Standard number FS-1015 FS-1016 ?
Typeofcoder LPCvocoder CELP Model-based
Year 1984 1991 1996
Bit rate 2.4kb/s 4.8 kb/s 2.4 kb/s
Quality high DRT
< IS-54 = FS-1016
Complexity
MIPS 20 19 41
a
RAM 2K 1.5K Unknown
Delay
Frame size 22.5 ms 30ms 22.5
Look ahead 90 ms 7.5 ms 23
Speciﬁcation type Bit stream Bit stream Bit stream
a
Actual goal is 40 MIPSﬂoating point or 80 MIPSﬁxed point.
to that of the source material as measured by the diagnosticrhyme test (DRT). Most recently an 800
bps vector quantized version of FS1015has been standardized by NATO [39].
FS1016 [40] is the result of a project undertaken by DoD to increase the naturalness of the secure
telephoneunitIII(STU-3)bytheintroductionof4.8kb/smodemtechnology. DoDsurveyedavailable
4.8kb/sspeechcodertechnologyin1988and1989. ItselectedaCELP-based coderhavingaso-called
ternary codebook, meaning that allexcitationamplitudesare +1, −1, or 0before scalingby the gain
for that sub-frame. This allows an easier codebook search. FS1016 deﬁnitely preserves far more of
the naturalness of the original speech than FS1015, but the speech still contains many artifacts and
the quality is substantially below that of the cellular coders such as GSM of IS54. Both FS-1015 and
FS-1016 have bit stream speciﬁcations, but there are C code simulations of them available from the
government.
The next coder to be standardized by DoD is a new 2.4 kb/s coder to replace both FS1015 and

FS1016. A 3-year project was initiated in 1993 which should culminate in a new standard in 1997.
Subjective testingwasdonein1993and1994onsoftware versions of potentialcoders and a realtime
hardware evaluation took place in 1995 and 1996 to select a best candidate. The Mixed Excitation
Linear Prediction (MELP) coder was selected [41]–[43]. The need for this coder is due to the lack
of a sufﬁcient number of satellite channels at 4.8 kb/s. The quality target for this coder is to match
or exceed the qualityand intelligibility of FS1016 for most scenarios. Many of the scenarios include
severe background noise and noisy channel conditions. At 2.4 kb/s, there is not enoug h bit rate
available for explicit channel coding, so the speech coder itself must be designed to be robust for
the channel conditions. The noisy background conditions have proven to be difﬁcult for vocoders
making voiced/unvoicedclassiﬁcation decisions, whether the decisions are made forallbandsor for
individual bands.
45.4.5 Performance
Figure 45.8 is includedto give an impression of the relative performance for cleanspeech of most of
the standard coders that were included above. There has never been a single subjective test which
included all of the above coders. Figure 45.8 is based on the relative performances of these coders
acrossanumberof teststhathavebeenreported. Inthecaseofcodersthatarenotyetstandards,their
performance is projected and shown as a circle. The vertical axis of Fig. 45.8 gives the approximate
single encoding quality for clean input speech. The horizontal axis is a logarithmic scale of bit rate.
Figure 45.8 only includes telephone bandwidth speech coders. The 7-kHz speech coders have been
omitted. Figure 45.9 comparesthecomplexity as measured in MIPS and RAM for a ﬁxedpointDSP
c

1999 by CRC Press LLC
FIGURE 45.8: Approximate speech quality of speech coding standards.
FIGURE 45.9: Approximate complexity of speech coding standards.
c

1999 by CRC Press LLC
implementationformostofthesamestandardcoders. ThehorizontalaxisisinRAMandthevertical
axis is in MIPS.

References
[1] Markel, J.D. and Gray, Jr., A.H., Linear Prediction of Speech, Springer-Verlag,B erlin, 1976.
[2] Rabiner,L.R.andSchafer,R.W.,
DigitalProcessingofSpeechSignals, Prentice-Hall,Englewood
Cliffs, NJ, 1978.
[3] LeRoux,J.andGueguen,C.,Aﬁxedpointcomputationofpartialcorrelationcoefﬁcients,
IEEE
Trans. ASSP,
ASSP-27, 257–259, 1979.
[4] Tohkura, Y., Itakura, F., and Hashimoto, S., Spectral smoothing technique in PARCOR speech
analysis/synthesis,
IEEE Trans. ASSP, 27, 257–259, 1978.
[5] Viswanathan,R.andMakhoul,J.,Quantizationpropertiesoftransmissionparametersinlinear
predictive systems,
IEEE Trans. ASSP, 23, 309–321, 1975.
[6] Sugamura,N. andItakura, F., Speechanalysisandsynthesis methodsdevelopedat ECLinNTT
— from LPC to LSP,
Speech Commun., 5, 199–215, 1986.
[7] Soong, F. and Juang, B H., Optimal quantization of LSP parameters,
IEEE Trans. Speech and
Audio Processing,
1, 15–24, 1993.
[8] Lloyd,S.P.,LeastsquaresquantizationinPCM,
IEEETrans.Inform.Theory,28,129–137,1982.
[9] Gersho, A. and Gray, R.M.,
Vector Quantization and Signal Compression, Kluwer-Academic
Publishers, Dordrecht, Holland, 1991.
[10] Schroeder, M.R., Atal, B.S., and Hall, J.L., Optimizing digital speech coders by exploiting
masking properties of thehuman ear,
J. Acoustical Soc. Am., 66, 1647–1652,Dec.1979.

[11] Chen, J H. and Gersho, A., Adaptive postﬁltering for quality enhancement of coded speech,
IEEE Trans. on Speech and Audio Processing, 3, 59–71, 1995.
[12] Tremain, T., The Government Standard Linear Predictive Coding Algor ithm: LPC-10,
Speech
Technol.,
40–49, Apr. 1982. Federal Standard 1015 is available from the U.S. government, as is
C source code.
[13] McAulay,R.J.andQuatieri,T.F.,Speechanalysis/synthesisbasedonasinusoidalrepresentation,
IEEE Trans. ASSP, 34, 744–754, 1986.
[14] McAulay, R.J. and Quatieri, T.F., Low-rate speech coding based on the sinusoidal model, in
Advances in Acoustics and Speech Processing, Sondhi, M. and Furui, S., Eds., Marcel-Dekker,
New York, 1992, 165–207.
[15] Grifﬁn, D.W. and Lim, J.S., Multiband excitation vocoder,
IEEE Trans. ASSP, 36, 1223–1235,
1988.
[16] Hardwick, J.C. and Lim, J.S., The application of the IMBE speech coder to mobile communi-
cations,
Proc. ICASSP ‘91, 249–252,1991.
[17] Kleijn,W.B.andHaagen,J.,Transformationanddecompositionofthespeechsignalforcoding,
IEEE Sig nal Processing Lett., 136–138, 1994.
[18] Kleijn, W.B. and Haagen, J., A general waveform interpolation structure for speech coding, in
Signal Processing VII, Holt, M.J.J., Grant, P.M. and Sandham, W.A., Eds., Kluwer Academic
Publishers, Dordrecht, Holland, 1994.
[19] Jayant, N.S. and Noll, P.,
Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs, NJ,
1984, 232–233.
[20] Steele, R.,
Delta Mo dulation Systems, Halsted Press, New Yor k, 1975.
[21] Atal, B.S.,Predictive codingofspeech at low bitrates,
IEEE Trans. Comm., 30,600–614,1982.

[22] Gersho, A., Advances in speech and audio compression,
Proc. IEEE, 82, 900–918, 1994.
[23] Gerson,I.A. andJasiuk,M.A.,TechniquesforimprovingtheperformanceofCELP-typespeech
coders,
IEEE JSAC, 10, 858–865, 1992.
c

1999 by CRC Press LLC
[24] Crochiere, R.E. and Tribolet,J., Frequency domain coding of speech, Proc. IEEE Trans. ASSP,
1979.
[25] Lefebvre,R.,Salami,R.,Laﬂamme, C.,andAdoul,J P.,Highqualitycodingofwidebandaudio
signals using transform coded excitation (TCX),
Proc. ICASSP ‘94, I-193–196,Apr. 1994.
[26] Petr, D.W., 32 kb/s ADPCM-DLQ coding for network applications,
Proc. IEEE GLOBECOM
‘82,
A8.3-1-A8.3-5, 1982.
[27] Daumer, W.R., Maitre,X.,Mermelstein, P., and Tokizawa,I.,Overview of the 32 kb/s ADPCM
algorithm,
Proc. IEEE GLOBECOM ‘84, 774–777, 1984.
[28] Taka, M., Maruta,R.,and LeGuyader, A.,Synchronoustandem algorithmfor32kb/sADPCM,
Proc. IEEE GLOBECOM ‘84, 791–795, 1984.
[29] Taka, M. and Maitre, X., CCITT standardizing activities in speech coding,
Proc. ICASSP ‘86,
817–820, 1986.
[30] Chen, J H., Cox, R.V., Lin, Y C., Jayant, N., and Melchner, M.J., A low-delay CELP coder for
the CCITT 16 kb/s speech coding standard,
IEEE JSAC, 10, 830–849, 1992.
[31] Johansen, F.T., A non bit-exact approach for implementation veriﬁcation of the CCITT LD-
CELP speech coder,

Speech Commun., 12, 103–112, 1993.
[32] South, C.R., Rugelbak, J., Usai, P., Kitawaki, N., Irii, H., Rosenberger, J., Cavanaugh, J.R.,
Adesanya, C.A., Pascal, D., Gleiss, N., and Barnes, G.J., Subjective performance assessment of
CCITT’s 16 kbit/s speech coding algorithm,
Speech Commun., 12, 113–134, 1993.
[33] Vary, P., Hellwig, K., Hofmann, R., Sluyter, R.J., Galand, C., and Russo, M., Speech codec for
the European mobile radio system,
Proc. ICASSP ‘88, 227–230,1988.
[34] TIA/EIA Interim Standard 85, Recommended minimum performance standards for full rate
speech codes, May 1992.
[35] DeJaco, A., Gardner, W., Jacobs, P., and Lee, C., QCELP: the North American CDMA digital
cellular variable rate speech coding standard,
Proc. IEEE Workshop on Speech Coding for
Telecommunications,
5-6, 1993.
[36] TIA/EIAInterimStandard125,Recommendedminimumperformancefordigitalcellularwide-
band spread spectrum speech service option 1, Aug. 1994.
[37] Miki,T.,5.6kb/sPSI-CELPfordigitalcellularmobileradio,
Proc.FirstInternationalWorkshop
on Mobile Multimedia Communications,
Tokyo, Japan, Dec. 7-10, 1993.
[38] Ohya,T.,Suda,H.,andMiki,T.,5.6kb/sPSI-CELPofthehalf-ratePDCspeechcodingstandard,
Proc. IEEE Vehicular Technol. Conf., 1680–1684, June 1994.
[39] Nouy, B., de la Noue, P., and Goudezeune, G., NATO stanag 4479, a standard for an 800 bps
vocoder and redundancy protection in HF-ECCM system,
Proc. ICASSP ‘95, 480–483, May
1995.
[40] Campbell, J.P., Welch, V.C., and Tremain, T.E., The new 4800 bpsvoice coding standard,
Proc.
Military Speech Tech. 89,

64–70,Nov1989.CopiesofFederalStandard1016areavailablefrom
the U.S. Government, as is C source code.
[41] McCree,A.,Truong,K.,George,E.,Barnwell,T.,andViswanathan,V., A2.4kbit/sMELPcoder
candidate for the new U.S. federal standard,
Proc. ICASSP ‘96, May 1996.
[42] Kohler, M., A comparison of the new 2400 bps MELP federal standard with other standard
coders,
Proc. ICASSP’97, pp. 1587–1590, April 1997.
[43] Supplee,L., Cohn,R., Collura,J., andMcCree,A., MELP:thenewfederalstandardat2400bps,
Proc. ICASSP’97, pp. 1591–1594, April 1997.
c

1999 by CRC Press LLC

Tài liệu 45 Speech Coding docx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về