Tài liệu 27 Robust Speech Processing as an Inverse Problem docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (273.57 KB, 19 trang )

Mammone, R.J. & Zhang, X. “Robust Speech Processing as an Inverse Problem”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c

1999byCRCPressLLC
27
Robust Speech Processing as an
Inverse Problem
Richard J. Mammone
Rutgers University
Xiaoyu Zhang
Rutgers University
27.1 Introduction
27.2 Speech Production and Spectrum-Related Parameterization
27.3 Template-Based Speech Processing
27.4 Robust Speech Processing
27.5 Afﬁne Transform
27.6 Transformation of Predictor Coefﬁcients
Deterministic Convolutional Channel as a Linear Transform
•
Additive Noise as a Linear Transform
27.7 Afﬁne Transform of Cepstral Coefﬁcients
27.8 Parameters of Afﬁne Transform
27.9 Correspondence of Cepstral Vectors
References
27.1 Introduction
This section addresses the inverse problem in robust speech processing. A problem that speaker and
speech recognition systems regularly encounter in the commercialized applications is the dramatic
degradation of performance due to the mismatch of the training and operating environments. The

mismatchgenerallyresults fromthediversityof theoperating environments. Forapplicationsoverthe
telephone network, the operating environments may vary from ofﬁces and laboratories to household
places and airports. The problem becomes worse when speech is transmitted over the wireless
network. Herethe systemexperiencescross-channelinterferencesinaddition tothechannelandnoise
degradations that exist in the regular telephone network. The key issue in robust speech processing is
to obtain good performance regardless of the mismatchin the environmental conditions. The inverse
problem in this sense refers to the process of modeling the mismatch in the form of a transformation
and resolving it via an inverse transformation. In this section, we introduce the method of modeling
the mismatch as an afﬁne transformation.
Before getting into the details of the inverse problem in robust speech processing, we would like to
giveabrief reviewof themechanismof speechproduction,as wellasthe retrievalofusefulinformation
from the speech for the recognition purposes.
c

1999 by CRC Press LLC
27.2 Speech Production and Spectrum-Related Parameterization
The speech signal consists of time-varying acoustic waveforms produced as a result of acoustical
excitation of the vocal tract. It is nonstationary in that the vocal tract conﬁguration changes over
time. A time-varying digital ﬁlter is generally used to describe the vocal tract characteristics. The
steady-state system function of the ﬁlter is of the form [1, 2]:
S(z) =
G
1 −

p
i=1
a
i
z
−i

=
G

p
i=1

1 − z
i
z
−1

,
(27.1)
where p is the order of the system and z
i
denote the poles of the transfer function. The time domain
representation of this ﬁlter is
s(n) =
p

i=1
a
i
s(n− i)+ Gu(n) .
(27.2)
The speech sample s(n) is predicted as a linear combination of previous p samples plus the excitation
Gu(n),whereG is the gain factor. The factor G is generally ignored in the recognition-type tasks to
allow for robustness to variations in the energy of speech signals. This speech production model is
often referred to as the linear prediction (LP) model, or the autoregressive model, and the coefﬁcients
a

i
are called the predictor coefﬁcients.
The cepstrum of the speech signal s(n) is deﬁned as
c(n) =

π
−π
log



S

e
jω




e
jωn
dω
2π
.
(27.3)
It is simply the inverse Fourier transform of the logarithm of the magnitude of the Fourier transform
S(e
jω
) of the signal s(n).
From the deﬁnition of cepstrum in Eq. (27.3), we have

n=∞

n=−∞
c(n)e
−jωn
= log



S

e
jω




=





log
1
1 −

p
n=1
a

n
e
−jωn





.
(27.4)
If we differentiate both sides of the equation with respect to ω and equate the coefﬁcients of like
powers of e
jω
, the following recursion is obtained:
c(n) =



log Gn= 0
a(n) +
1
n

n−1
i=1
ic(i)a(n − i) n > 0
(27.5)
The cepstral coefﬁcients can be calculated using the recursion once the predictor coefﬁcients are
solved. The zeroth order cepstral coefﬁcient is generally ignored in speech and speaker recognition
due to its sensitivity to the gain factor, G.

An alternative solution for the cepstral coefﬁcients is given by
c(n) =
1
q
p

i=1
z
n
i
.
(27.6)
It is obtained by equating the terms of like powers of z
−1
in the following equation:
n=∞

n=−∞
c(n)z
−n
= log
1

p
n=1

1 − z
n
z
−1


=−
p

i=1
log

1 − z
n
z
−1

,
(27.7)
c

1999 by CRC Press LLC
where the logarithm terms can be written as a power series expansion given as
log

1 − z
n
z
−1

=
∞

k=1
1

k
z
k
n
z
−k
.
(27.8)
There are two standard methods of solving for the predictor coefﬁcients, a
i
, namely, the autocor-
relation method and the covariance method [3, 4, 5, 6]. Both approaches are based on minimizing
the mean square value of the estimation error e(n) as given by
e(n) = s(n) −
p

i=1
a
i
s(n− i) .
(27.9)
The two methods differ with respect tothe details of numerical implementation. The autocorrelation
method assumes that the speech samples are zero outside the processing interval of N samples. This
results in a nonzero prediction error, e(n), outside the interval. The covariance method ﬁxes the
interval over which the prediction error is computed and has no constraints on the sample values
outside the interval. The autocorrelation method is computationally simpler than the covariance
approach and assures a stable system where all poles of the transfer function lie within the unit circle.
A brief description of the autocorrelation method is given as follows.
The autocorrelation of the signal s(n) is deﬁned as
r

s
(k) =
N−1−k

n=0
s(n)s(n + k) = s(n) ⊗ s(−n) ,
(27.10)
where N is the number of samples in the sequence s(n) and the sign ⊗ denotes the convolution
operation. The deﬁnition of autocorrelation implies that r
s
(k) is an even function. The predictor
coefﬁcients a
i
can therefore be obtained by solving the following set of equations





r
s
(0)r
s
(1) ··· r
s
(p − 1)
r
s
(1)r
s

(0) ··· r
s
(p − 2)
.
.
.
.
.
.
.
.
.
.
.
.
r
s
(p − 1)r
s
(p − 2) ··· r
s
(0)









a
1
.
.
.
a
p



=



r
s
(1)
.
.
.
r
s
(p)



.
Denoting the p × p Toeplitz autocorrelation matrix on the left hand side by R
s
, the predictor

coefﬁcient vector by a, and the autocorrelation coefﬁcients by r
s
,wehave
R
s
a = r
s
.
(27.11)
The solution for the predictor coefﬁcient vector a can be solved by the inverse relation
a = R
−1
s
r
s
.
This equation will be used throughout the analysis in the rest of this article. Since the matrix R
s
is
Toeplitz, a computationally efﬁcient algorithm known as Levinson-Durbin recursion can be used to
solve for a [3].
27.3 Template-Based Speech Processing
The template-based matching algorithms for speech processing are generally conducted using the
similarity of the vocal tract characteristics inhabited in the spectrum of a particular speech sound.
c

1999 by CRC Press LLC
There are two types of speech sounds, namely, voiced and unvoiced sounds. Figure 27.1 shows the
speech waveforms, the spectra, and the spectral envelopes of the voiced and the unvoiced sounds.
Voiced sounds such as the vowel /a/ and the nasal sound /n/ are produced by the passage of a quasi-

periodic air wave through the vocal tract that creates resonances in the speech waveforms known
as formants. The quasi-periodic air wave is generated as a result of the vibration of the vocal cord.
The fundamental frequency of the vibration is known as the pitch. In the case of generating fricative
sounds such as /sh/, the vocal tract is excited by random noise, resulting in speech waveforms
exhibiting no periodicity, as can be seen in Fig. 27.1. Therefore, the spectral envelopes of voiced
sounds constantly exhibit the pitch as well as three to ﬁve formants when the sampling rate is 8 kHz,
whereas the spectral envelopes of the unvoiced sounds reveal no pitch and formant characteristics.
In addition, the formants of different voiced sounds differ with respect to the shape and the location
of the center frequencies of the formants. This is due to the unique shape of the vocal tract formed to
produce a particular sound. Thus, different sounds can be distinguished based on attributes of the
spectral envelope.
The cepstral distance given by
d =
∞

n=−∞

c(n) − c

(n)

2
(27.12)
is one of the metrics for measuring the similarity of two spectra envelopes. The reason is as follows.
From the deﬁnition of cepstrum, we have
∞

n=∞

c(n) − c

|
.
(27.13)
The Fourier transform of the differencebetweena pair of cepstra is equal to the differencebetween the
corresponding spectra pair. By applying the Parseval’s theorem, the cepstral distance can be related
to the log spectral distance as
d =
∞

n=∞

c(n) − c

(n)

2
=

π
−π

log|S

e
jω

|−log|S


e

jω

|

2
dω
2π
.
(27.14)
The cepstral distance is usually approximated by the distance between the ﬁrst few lower order
cepstral coefﬁcients, the reason being that the magnitude of the high order cepstral coefﬁcients is
small and has a negligible contribution to the cepstral distance.
27.4 Robust Speech Processing
Robust speech processing attempts to maintain the performance of speaker and speech recognition
system when variations in the operating environment are encountered. This can be accomplished if
the similarity in vocal tract structures of the same sound can be recovered under adverse conditions.
Figure 27.2 illustrates how the deterministic channel and random noise contaminate a speech
signal during the recording and transmission of the signal.
First of all, at the front end of the speech acquisition system, additive background noise N
1
(ω)
from the speaking environment distorts the speech waveform. Adverse background conditions are
also found to put stress on the speech production system and change the characteristics of the vocal
c

1999 by CRC Press LLC
FIGURE 27.1: Illustration of voiced/unvoiced speech.
c

1999byCRCPressLLC

FIGURE 27.2: The speech acquisition system.
tract. It is equivalent to performing a linear ﬁltering of the speech. This problem will be addressed
in another chapter and will not be discussed here.
After being sampled and quantized, the speech samples corrupted by the background noise N
1
(ω)
are then passed through the transmission channel such as a telephone network to get to the receiver’s
site. The transmission channel generally involves two types of degradation sources: the deterministic
and convolutional ﬁlter with the transfer function H(ω), and the additive noise denoted by N
2
(ω)
in Fig. 27.2.
The signal observed at the output of the system is, therefore,
Y(ω)= H(ω)
[
X(ω)+ N
1
(ω)
]
+ N
2
(ω) .
(27.15)
The spectrum of the output signal is corrupted by both additive and multiplicative interferences.
The multiplicative interference due to the linear channel H(ω)is sometimes referred to as the mul-
tiplicative noise.
The various sources of degradation cause distortions of the predictor coefﬁcients and the cepstral
coefﬁcients. Fig. 27.4 shows the change of spatial clustering of the cepstral coefﬁcients due to inter-
ferences of the linear channel, white noise, and the composite effect of both linear channel and white
noise.

• When the speech is interfered by a linear bandpass channel, the frequency response of
which is shown in Fig. 27.3, a translation of the cepstral clusters is observed, as shown in
Fig. 27.4(b).
• When the speech is corrupted by Gaussian white noise of 15 dB SNR, a shrinkage of the
cepstral vectors results. This is shown in Fig. 27.4(c), where it can be seen that the cepstral
clusters move toward the origin.
• When the speech is degraded by both the linear channel and Gaussian white noise, the
cepstral vectors are translated and scaled simultaneously.
There are three underlying thoughts behind the various solutions to robust speech processing.
The ﬁrst is to recover the speech signal from the noisy observation by removing an estimate of the
noise from the signal. This is also known as the speech enhancement approach. Methods that
are executed in the speech sample domain include noise suppression [7] and noise masking [8].
Other speech enhancement methods are carried out in the feature domain, for example, cepstral
mean subtraction (CMS) and pole-ﬁltered cepstral mean subtraction (PFCMS). In this category,
c

1999 by CRC Press LLC

Tài liệu 27 Robust Speech Processing as an Inverse Problem docx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về