Tải bản đầy đủ (.pdf) (19 trang)

Tài liệu 27 Robust Speech Processing as an Inverse Problem docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (273.57 KB, 19 trang )

Mammone, R.J. & Zhang, X. “Robust Speech Processing as an Inverse Problem”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c

1999byCRCPressLLC
27
Robust Speech Processing as an
Inverse Problem
Richard J. Mammone
Rutgers University
Xiaoyu Zhang
Rutgers University
27.1 Introduction
27.2 Speech Production and Spectrum-Related Parameterization
27.3 Template-Based Speech Processing
27.4 Robust Speech Processing
27.5 Affine Transform
27.6 Transformation of Predictor Coefficients
Deterministic Convolutional Channel as a Linear Transform

Additive Noise as a Linear Transform
27.7 Affine Transform of Cepstral Coefficients
27.8 Parameters of Affine Transform
27.9 Correspondence of Cepstral Vectors
References
27.1 Introduction
This section addresses the inverse problem in robust speech processing. A problem that speaker and
speech recognition systems regularly encounter in the commercialized applications is the dramatic
degradation of performance due to the mismatch of the training and operating environments. The


mismatchgenerallyresults fromthediversityof theoperating environments. Forapplicationsoverthe
telephone network, the operating environments may vary from offices and laboratories to household
places and airports. The problem becomes worse when speech is transmitted over the wireless
network. Herethe systemexperiencescross-channelinterferencesinaddition tothechannelandnoise
degradations that exist in the regular telephone network. The key issue in robust speech processing is
to obtain good performance regardless of the mismatchin the environmental conditions. The inverse
problem in this sense refers to the process of modeling the mismatch in the form of a transformation
and resolving it via an inverse transformation. In this section, we introduce the method of modeling
the mismatch as an affine transformation.
Before getting into the details of the inverse problem in robust speech processing, we would like to
giveabrief reviewof themechanismof speechproduction,as wellasthe retrievalofusefulinformation
from the speech for the recognition purposes.
c

1999 by CRC Press LLC
27.2 Speech Production and Spectrum-Related Parameterization
The speech signal consists of time-varying acoustic waveforms produced as a result of acoustical
excitation of the vocal tract. It is nonstationary in that the vocal tract configuration changes over
time. A time-varying digital filter is generally used to describe the vocal tract characteristics. The
steady-state system function of the filter is of the form [1, 2]:
S(z) =
G
1 −

p
i=1
a
i
z
−i

=
G

p
i=1

1 − z
i
z
−1

,
(27.1)
where p is the order of the system and z
i
denote the poles of the transfer function. The time domain
representation of this filter is
s(n) =
p

i=1
a
i
s(n− i)+ Gu(n) .
(27.2)
The speech sample s(n) is predicted as a linear combination of previous p samples plus the excitation
Gu(n),whereG is the gain factor. The factor G is generally ignored in the recognition-type tasks to
allow for robustness to variations in the energy of speech signals. This speech production model is
often referred to as the linear prediction (LP) model, or the autoregressive model, and the coefficients
a

i
are called the predictor coefficients.
The cepstrum of the speech signal s(n) is defined as
c(n) =

π
−π
log



S

e





e
jωn


.
(27.3)
It is simply the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform
S(e

) of the signal s(n).
From the definition of cepstrum in Eq. (27.3), we have

n=∞

n=−∞
c(n)e
−jωn
= log



S

e





=





log
1
1 −

p
n=1
a

n
e
−jωn





.
(27.4)
If we differentiate both sides of the equation with respect to ω and equate the coefficients of like
powers of e

, the following recursion is obtained:
c(n) =



log Gn= 0
a(n) +
1
n

n−1
i=1
ic(i)a(n − i) n > 0
(27.5)
The cepstral coefficients can be calculated using the recursion once the predictor coefficients are
solved. The zeroth order cepstral coefficient is generally ignored in speech and speaker recognition
due to its sensitivity to the gain factor, G.

An alternative solution for the cepstral coefficients is given by
c(n) =
1
q
p

i=1
z
n
i
.
(27.6)
It is obtained by equating the terms of like powers of z
−1
in the following equation:
n=∞

n=−∞
c(n)z
−n
= log
1

p
n=1

1 − z
n
z
−1


=−
p

i=1
log

1 − z
n
z
−1

,
(27.7)
c

1999 by CRC Press LLC
where the logarithm terms can be written as a power series expansion given as
log

1 − z
n
z
−1

=


k=1
1

k
z
k
n
z
−k
.
(27.8)
There are two standard methods of solving for the predictor coefficients, a
i
, namely, the autocor-
relation method and the covariance method [3, 4, 5, 6]. Both approaches are based on minimizing
the mean square value of the estimation error e(n) as given by
e(n) = s(n) −
p

i=1
a
i
s(n− i) .
(27.9)
The two methods differ with respect tothe details of numerical implementation. The autocorrelation
method assumes that the speech samples are zero outside the processing interval of N samples. This
results in a nonzero prediction error, e(n), outside the interval. The covariance method fixes the
interval over which the prediction error is computed and has no constraints on the sample values
outside the interval. The autocorrelation method is computationally simpler than the covariance
approach and assures a stable system where all poles of the transfer function lie within the unit circle.
A brief description of the autocorrelation method is given as follows.
The autocorrelation of the signal s(n) is defined as
r

s
(k) =
N−1−k

n=0
s(n)s(n + k) = s(n) ⊗ s(−n) ,
(27.10)
where N is the number of samples in the sequence s(n) and the sign ⊗ denotes the convolution
operation. The definition of autocorrelation implies that r
s
(k) is an even function. The predictor
coefficients a
i
can therefore be obtained by solving the following set of equations





r
s
(0)r
s
(1) ··· r
s
(p − 1)
r
s
(1)r
s

(0) ··· r
s
(p − 2)
.
.
.
.
.
.
.
.
.
.
.
.
r
s
(p − 1)r
s
(p − 2) ··· r
s
(0)









a
1
.
.
.
a
p



=



r
s
(1)
.
.
.
r
s
(p)



.
Denoting the p × p Toeplitz autocorrelation matrix on the left hand side by R
s
, the predictor

coefficient vector by a, and the autocorrelation coefficients by r
s
,wehave
R
s
a = r
s
.
(27.11)
The solution for the predictor coefficient vector a can be solved by the inverse relation
a = R
−1
s
r
s
.
This equation will be used throughout the analysis in the rest of this article. Since the matrix R
s
is
Toeplitz, a computationally efficient algorithm known as Levinson-Durbin recursion can be used to
solve for a [3].
27.3 Template-Based Speech Processing
The template-based matching algorithms for speech processing are generally conducted using the
similarity of the vocal tract characteristics inhabited in the spectrum of a particular speech sound.
c

1999 by CRC Press LLC
There are two types of speech sounds, namely, voiced and unvoiced sounds. Figure 27.1 shows the
speech waveforms, the spectra, and the spectral envelopes of the voiced and the unvoiced sounds.
Voiced sounds such as the vowel /a/ and the nasal sound /n/ are produced by the passage of a quasi-

periodic air wave through the vocal tract that creates resonances in the speech waveforms known
as formants. The quasi-periodic air wave is generated as a result of the vibration of the vocal cord.
The fundamental frequency of the vibration is known as the pitch. In the case of generating fricative
sounds such as /sh/, the vocal tract is excited by random noise, resulting in speech waveforms
exhibiting no periodicity, as can be seen in Fig. 27.1. Therefore, the spectral envelopes of voiced
sounds constantly exhibit the pitch as well as three to five formants when the sampling rate is 8 kHz,
whereas the spectral envelopes of the unvoiced sounds reveal no pitch and formant characteristics.
In addition, the formants of different voiced sounds differ with respect to the shape and the location
of the center frequencies of the formants. This is due to the unique shape of the vocal tract formed to
produce a particular sound. Thus, different sounds can be distinguished based on attributes of the
spectral envelope.
The cepstral distance given by
d =


n=−∞

c(n) − c

(n)

2
(27.12)
is one of the metrics for measuring the similarity of two spectra envelopes. The reason is as follows.
From the definition of cepstrum, we have


n=∞

c(n) − c


(n)

e
jωn
= log|S

e


|−log|S


e


|
= log
|S

e


|
|S


e



|
.
(27.13)
The Fourier transform of the differencebetweena pair of cepstra is equal to the differencebetween the
corresponding spectra pair. By applying the Parseval’s theorem, the cepstral distance can be related
to the log spectral distance as
d =


n=∞

c(n) − c

(n)

2
=

π
−π

log|S

e


|−log|S


e



|

2


.
(27.14)
The cepstral distance is usually approximated by the distance between the first few lower order
cepstral coefficients, the reason being that the magnitude of the high order cepstral coefficients is
small and has a negligible contribution to the cepstral distance.
27.4 Robust Speech Processing
Robust speech processing attempts to maintain the performance of speaker and speech recognition
system when variations in the operating environment are encountered. This can be accomplished if
the similarity in vocal tract structures of the same sound can be recovered under adverse conditions.
Figure 27.2 illustrates how the deterministic channel and random noise contaminate a speech
signal during the recording and transmission of the signal.
First of all, at the front end of the speech acquisition system, additive background noise N
1
(ω)
from the speaking environment distorts the speech waveform. Adverse background conditions are
also found to put stress on the speech production system and change the characteristics of the vocal
c

1999 by CRC Press LLC
FIGURE 27.1: Illustration of voiced/unvoiced speech.
c

1999byCRCPressLLC

FIGURE 27.2: The speech acquisition system.
tract. It is equivalent to performing a linear filtering of the speech. This problem will be addressed
in another chapter and will not be discussed here.
After being sampled and quantized, the speech samples corrupted by the background noise N
1
(ω)
are then passed through the transmission channel such as a telephone network to get to the receiver’s
site. The transmission channel generally involves two types of degradation sources: the deterministic
and convolutional filter with the transfer function H(ω), and the additive noise denoted by N
2
(ω)
in Fig. 27.2.
The signal observed at the output of the system is, therefore,
Y(ω)= H(ω)
[
X(ω)+ N
1
(ω)
]
+ N
2
(ω) .
(27.15)
The spectrum of the output signal is corrupted by both additive and multiplicative interferences.
The multiplicative interference due to the linear channel H(ω)is sometimes referred to as the mul-
tiplicative noise.
The various sources of degradation cause distortions of the predictor coefficients and the cepstral
coefficients. Fig. 27.4 shows the change of spatial clustering of the cepstral coefficients due to inter-
ferences of the linear channel, white noise, and the composite effect of both linear channel and white
noise.

• When the speech is interfered by a linear bandpass channel, the frequency response of
which is shown in Fig. 27.3, a translation of the cepstral clusters is observed, as shown in
Fig. 27.4(b).
• When the speech is corrupted by Gaussian white noise of 15 dB SNR, a shrinkage of the
cepstral vectors results. This is shown in Fig. 27.4(c), where it can be seen that the cepstral
clusters move toward the origin.
• When the speech is degraded by both the linear channel and Gaussian white noise, the
cepstral vectors are translated and scaled simultaneously.
There are three underlying thoughts behind the various solutions to robust speech processing.
The first is to recover the speech signal from the noisy observation by removing an estimate of the
noise from the signal. This is also known as the speech enhancement approach. Methods that
are executed in the speech sample domain include noise suppression [7] and noise masking [8].
Other speech enhancement methods are carried out in the feature domain, for example, cepstral
mean subtraction (CMS) and pole-filtered cepstral mean subtraction (PFCMS). In this category,
c

1999 by CRC Press LLC

×