A. C. Surendran. “Inverse Problems in Microphone Arrays.”
2000 CRC Press LLC. <>.
InverseProblemsinMicrophone
Arrays
A.C.Surendran
BellLaboratories
LucentTechnologies
32.1Introduction:DereverberationUsingMicrophoneArrays
32.2SimpleDelay-and-SumBeamformers
ABriefLookatAdaptiveArrays
•
ConstrainedAdaptiveBeam-
formingFormulatedasanInverseProblem
•
MultipleBeam-
forming
32.3MatchedFiltering
32.4DiophantineInverseFilteringUsingtheMultiple
Input-Output(MINT)Model
32.5Results
SpeakerIdentification
32.6Summary
References
32.1 Introduction:DereverberationUsingMicrophone
Arrays
Anacousticenclosureusuallyreducestheintelligibilityofthespeechtransmittedthroughitbecause
thetransmissionpathisnotideal.Apartfromthedirectsignalfromthesource,thesoundisalso
reflectedoffoneormoresurfaces(usuallywalls)beforereachingthereceiver.Theresultingsignalcan
beviewedastheoutputofaconvolutioninthetimedomainofthespeechsignalandtheroomimpulse
response.Thisphenomenonaffectsthequalityofthetransmittedsoundinimportantapplications
suchasteleconferencing,cellulartelephony,andautomaticvoiceactivatedsystems(speakerand
speechrecognizers).Roomreverberationcanbeperceptuallyseparatedintotwobroadclasses.Early
roomechoesaremanifestedasirregularitiesor“ripples”intheamplitudespectrum.Thiseffect
dominatesinsmallrooms,typicallyoffices.Long-termreverberationistypicallyexhibitedasan
echo“tail”followingthedirectsound[1].
IfthetransferfunctionG(z)ofthesystemisknown,itmightbepossibletoremovethedeleterious
multi-patheffectsbyinversefilteringtheoutputusingafilterH(z)where
H(z)=
1
G(z)
.
(32.1)
TypicallyG(z)isthetransformoftheimpulseresponseoftheroomg(n).Ingeneral,thetransfer
functionofareverberantenvironmentisanon-minimumphasefunction,i.e.,allthezerosofthe
functiondonotnecessarilylieinside|z|=1.Aminimumphasefunctionhasastablecausalinverse,
whiletheinverseofanon-minimumphasefunctionisacausaland,ingeneral,infiniteinlength.
c
1999byCRCPressLLC
In general, G(z) can be expressed as a product of a minimum-phase function and a non-minimum
phase function:
G(z) = G
min
(z) · G
max
(z) .
(32.2)
Many approacheshavebeen proposed for dereverberating signals. The aim of all the compensation
schemes is to bring the impulse response of the system after dereverberation as close as possible to an
impulse function. Homomorphicfiltering techniques wereused to estimate the minimum phase part
of G(z) [2, 3]. In [2], the minimum phase component was estimated by zeroing out the cepstrum for
negativefrequencies. Thentheoutputsignalwasfilteredbythe inverseofthe minimum phasetransfer
function. But this technique still did not remove the reverberation contributed by the maximum-
phase part of the room response. In [3], the inverse of the maximum-phase part was also estimated
from the delayed and truncated version of the acausal inverse. But, the delay can be inordinate and
care must be taken to avoid temporal aliasing.
An alternate approach to dereverberation is to calculate, in some form, the least squares estimate
of the inverse of the transmission path, i.e., calculate the least squares solution of the equation
h(n) ∗ g(n) = d(n) ,
(32.3)
where d(n) is the impulse function and ∗ denotes convolution. Assuming that the system can be
modeled by an FIR filter, Eq. (32.3) can be expressed in matrix form as:
g(0)
g(1)g(0)
.
.
.g(1) ··· 0
g(m)
.
.
. ··· g(0)
0 g(m) ··· g(1)
00···
.
.
.
g(m)
h(0)
h(1)
.
.
.
h(i)
=
1
0
.
.
.
0
,
(32.4)
or,
GH = D,
(32.5)
where D is the unity matrix and G, H and D are matrices of appropriate dimensions as shown in
Eq. (32.4). The least squares method finds an approximate solution given by
ˆ
H(z) =
G
T
G
−1
G
T
D.
(32.6)
Thus, the error vector can be written as
=[D − G
ˆ
H ]
=[I − G
G
T
G
−1
G
T
]D
= ED ,
where E =[I − G(G
T
G)
−1
G
T
]. The mean square error or the energy in the error vector is
||||
2
=||ED||
2
≤|E|||D||
2
≤
λ
max
λ
min
||D||
2
,
(32.7)
where |E| is the norm of E and λ
max
and λ
min
are the maximum and minimum eigenvalues of
E. The ratio between the maximum and minimum eigenvalues is called the condition number of a
matrix and it specifies the noise amplification of the inversion process [4].
c
1999 by CRC Press LLC
FIGURE 32.1: Modeling a room with a microphone array as a multiple output FIR system.
Typically, the operation is done on the full-band signal. Sub-band approaches have been proposed
in [5, 7, 8]. All these approaches use a single microphone.
The amplitude spectrum of the room response has “ripples” which produce pronounced notches
in the signal output spectrum. As the location of the microphone in the room changes, the room
response for the same source changes and, as a result, the position of the notches in the amplitude
spectrum varies. This property was used to advantage in [1]. In this method, multiple microphones
were located in the room. Then, the output of each microphone was divided into multiple bands
of equal bandwidth. For each band, by choosing the microphone whose output has the maximum
energy, the ripples were reduced. In [9], the signals from all the microphones in each band were
first co-phased, and then weighted by a gain calculated from a normalized cross-correlation function
calculated based on the outputs of different microphones. Since the reverberation tails are uncorre-
lated, the cross-correlation-based gain turned off the tail of the signal. These techniques have had
modest success in combating reverberation.
In recent years, great progress has been made in the quality, availability, and cost of high perfor-
mance microphones. Fast digital signal processors that permit complex algorithms to operate in real
time have been developed. These advances have enabled the use of large microphone arrays that
deploy more sophisticated algorithms for dereverberation. Figure 32.1 shows a generic microphone
array system which can “invert” the room acoustics. Different choices of H
i
(z) lead to different
algorithms, each with their own advantages and disadvantages. In this report, we shall discuss single
and multiple beamforming, matched filtering, and Diophantine inverse filtering through multiple
input-output (MINT) modeling. In all cases we assume that the source location and the room
configuration or, alternatively, the G
i
(z)s, are known.
32.2 Simple Delay-and-Sum Beamformers
Arrays that form a single beam directed towards the source of the sound have been designed and
built[11]. Inthesesimple delay-and-sumbeamformers, theprocessingfilter hasthe impulse response
h
i
(n) = δ(n − n
i
),
(32.8)
where n
i
= d
i
/c, d
i
is the distance of the ith microphone from the source and c is the speed of
sound in air. Sound propagation in the room can be modeled by a set of successive reflections
off the surfaces (typically the walls) [10]. Figure 32.2 illustrates the impulse response of a single
c
1999 by CRC Press LLC
beamformer. The delay at the output of each microphone coheres the sound that arrives at the
microphone directly from the source. It can be seen from Fig. 32.2 that in the resulting response,
the strength of the coherent pulse is N and there are N(K − 1) distributed pulses. So, ideally, the
signal-to-reverberant noise ratio (measured as the ratio of undistorted signal power to reverberant
noise power) is N
2
/N(K − 1) [13]. In a highly reverberant room, as the number of images K
increases towards infinity, the SNR improvement, N/K − 1, falls to zero.
FIGURE 32.2: A single beamformer. (Source: Flanagan, J.L., Surendran, A.C., and Jan, E.-E.,
Spatially selective sound capture for speech and audio processing, Speech Commun., 13: 207–222,
1993. With kind permission of Elsevier Science - NL, Sara Burgerhartstraat 25, 1055 KV Amsterdam,
The Netherlands).
The single-beamforming system reported in [11] can automatically determine the direction of the
source and rapidly steer the array. But, as the beam is steered away from the broadside, the system
exhibits a reduction in spatial discrimination because the beam pattern broadens [12]. Further,
beamwidth varies with frequency, so an array has an approximate “useful bandwidth” given by the
upper and lower frequencies [12]:
f
upper
=
c
d| cos φ − cos φ
|
max
,
(32.9)
and
f
lower
=
f
upp er
N
,
(32.10)
where c is the speed of sound in air, N is the number of sensors in the array, d is the sensor spacing,
φ
is the steering angle measured with respect to the axis of the array, and φ is the direction of the
source.
c
1999 by CRC Press LLC
For example, consider an array with seven microphones and a sensor spacing of 6.5 cm. Further,
suppose the desired range of steering is ±30
◦
from broadside. Then, | cos φ −cos φ
|
max
= 1.5 and
hence f
upp er
≈ 3500Hzand f
lower
≈ 500Hz. So, to cover the bandwidth of speech, say from 250
Hz to 7 kHz, three harmonically nested arrays of spacing 3.25, 6.5, and 13 cm can be used. Further,
the beamwidth also depends on the frequency of the signal as well as the steering direction. If the
beam is steered to an angle φ
, then the direction of the source for which the beam response falls to
half its power is [12]
φ
3dB
= cos
−1
cos φ
±
2.8
Nωd
,
(32.11)
where ω = 2πf and f is the frequency of the signal.
Equation 32.11 shows that the smaller the array, the wider the beam. Since most of the energy of
a typical room interfering noise lies at lower frequencies, it would be advantageous to build arrays
that have higher directivity (smaller beamwidth) at lower frequencies. This, combined with the fact
that the array spacing is larger for lower frequency bands, gives yet another reason to harmonically
nest arrays (see Fig. 32.3).
FIGURE 32.3: Harmonically nested array that covers three frequency ranges.
Just as linear one-dimensional arrays display significant fattening of the beams when steered to-
wards the axis of the array, two-dimensional arrays exhibit widening of the beams when steered at
angles acute to the plane of the array. Three-dimensional microphone arrays can be constructed [13]
that have essentially a constant beamwidth over 4π steradians. Multiple beamforming using three-
dimensionalarrays ofsensors notonly providesselectivity in azimuthand elevation butalso selectivity
in the direction of the beam, i.e., it provides range selectivity.
The performance of single beamformers can degrade severely in the presence of other interfering
noise sources, especially if they fall in the direction of the sidelobes. This problem can be mitigated
using adaptive arrays. Adaptive arrays are briefly discussed in the next section.
32.2.1 A Brief Look at Adaptive Arrays
Adaptive signal processing techniques can be used to form a beam at the desired source while si-
multaneously forming a null in the direction of the interfering noise source. Such arrays are called
c
1999 by CRC Press LLC
“adaptive arrays”. Though adaptive arrays are not effective under conditions of severe reverberation,
they are included here because problems in adaptive arrays can be formulated as inverse problems.
Hence, we shall discuss adaptive arrays briefly without providing a quantitative analysis of them.
Broadband arrays have been analyzed in [14, 15, 16, 17, 18, 19]. In all these methods, the direction
of arrival of the signal is assumed to be known.
Let the array have N sensors and M delay taps per sensor. If X(k) =[x
1
(k)...x
i
(k)...x
NM
(k)]
T
(see Fig. 32.4) is the set of signals observed at the tap points, then X(k) = S(k) + N(k), where
FIGURE 32.4: General form of an adaptive filter.
S(k) is the contribution of the desired signal at the tap points and N(k) is the contribution of the
unknown interfering noise. The inputs to the sensors, x
(jM+1)
(k), j = 0,...,(N − 1), are the
noisy versions of g(k), the actual signal at the source. Now, the filter output y(k) = W
T
X(k),where
W
T
=[w
11
,...,w
1M
,w
21
,...,w
2M
,...,w
N1
,...,w
NM
] is the set of weights at the tap points.
The goal of the system is to make the output y(k)as close as possible to the source g(k). One way of
doing this is to minimize the error E{(g(k) − y(k))
2
}. The weight W
∗
that achieves this least mean
square (LMS) error is also called the Weiner filter, and is given by
W
∗
= R
−1
XX
C
gX
,
(32.12)
where R
XX
is the autocorrelation of X(k) and C
gX
is the set of cross-correlations between g(k) and
each element of X(k).Ifg(k) and N(k)are uncorrelated, then
C
gX
= E{g(k)X(k)}=E{g(k)S(k)}+E{g(k)N(k)}
= E{g(k)S(k)}
and
R
XX
= E{X(k)X
T
(k)}=E{(S(k) + N(k))(S(k) + N(k))
T
}
= R
SS
+ R
NN
,
where R
SS
and R
NN
are the autocorrelation matrices for the signal and noise.
Usually R
NN
is not known. In such cases, the exact inverse cannot be calculated and an iterative
approach to update the weights is needed. In Widrow’s approach [15], a known pilot-signal g(k)
c
1999 by CRC Press LLC
is injected into the array. Then, the weights are updated using the Widrow-Hopf algorithm that
increments the weight vector in the direction of the negative gradient of the error:
W
k+1
= W
k
+ µ[g(k) − y(k)]X(k),
where W
k+1
is the weight vector after the kth update and µ is the step size. Griffiths’ method also
uses the LMS approach, but minimizes the mean square error based on the autocorrelation and the
cross-correlation values between the input and the output, rather than the signals themselves. Since
the mean square error can be written as
E{
(
g(k) − y(k)
)
2
}=R
gg
− 2C
T
gS
W + W
T
R
XX
W,
where R
gg
isthe auto-correlationmatrix of g(k)and C
gS
isthe setof cross-correlation matrix between
g(k) and each element of S(k), the weight update can also be done by
W
k+1
= W
k
+ µ[C
gS
− R
XX
W
k
]
(32.13)
= W
k
+ µ[C
gS
− X(k)X
T
(k)W
k
]
(32.14)
= W
k
+ µ[C
gS
− y(k)X(k)] .
(32.15)
In the above methods, significant distortion is observed in the primary beam due to null-steering.
Constrained LMS techniques which place constraints on the performance of the main lobe can
be used to reduce distortion [18, 19]. By specifying the broad-band response and the array beam
characteristicsas constraints, morerobustbeams can be formed. The problemnow canbe formulated
as an optimization technique that minimizes the output power of the system. Given that the output
power is
E
y
2
(k)
= E
W
T
X(k)X
T
(k)W
= W
T
R
XX
W
= W
T
R
SS
W + W
T
R
NN
W,
if W can be chosen such that W
T
R
NN
W = 0, the noise can be eliminated. It was proposed [18]
that once the array is steered towards the source with appropriate delays, minimizing the output
power is equivalent to removing directional interference, since in-phase signals add coherently. In
an accurately steered array, the wavefronts arriving from the direction of steering generate identical
signals at each sensor. Hence, the array may be collapsed to a single sensor implementation which
is equivalent to an FIR filter [18], i.e., the columns of the broadband array sum to an FIR filter.
Additional constraints can be placed on this FIR filter. If the weights of the filters can be written as a
matrix:
ˆ
W =
w
11
w
12
... w
1M
.
.
.
.
.
.
.
.
.
.
.
.
w
N1
w
N2
... w
NM
,
then it can be specified that
N
i=1
w
ij
= f
j
,j= 1,...,M,wheref
j
,j= 1,...,M are the
taps of an FIR filter that provides the desired filter response. Hence, using this method, directional
interference can be suppressed by minimizing the output power and spectral interference can be
suppressed by constraining the columns of the weight coefficients.
Thus, the problem can be formulated as
Minimize: W
T
R
XX
W
(32.16)
subject to: C
T
W = F,
(32.17)
c
1999 by CRC Press LLC
where F is the desired FIR filter and
C =
100... 0100... 0 ... 100... 0
010... 0010... 0 ... 010... 0
.
.
.
.
.
.
.
.
.
.
.
.
000... 1 000... 1 ... 000... 1
.
(32.18)
C has M rows with NM entries on each row. The first row of C in Eq. 32.18 has ones in positions
1,(M+1),...,(N−1)∗M +1; the secondrow has ones in positions2,(M+2),...,(N−1)∗M +2,
etc. Equation 32.17 can be solved using Lagrange multipliers [18]. This optimization problem can
alternatively be posed as an inverse problem.
32.2.2 Constrained Adaptive Beamforming Formulated as an Inverse Problem
Using a similar cost function and the same constraint, the system can be formulated as an inverse
problem [19]. The function to be optimized, W
T
R
XX
W = 0, can be approximated by X
T
W = 0.
This, combined with the constraint in Eq. 32.17 is written as:
x
1
... x
M
... x
(N−1)∗M+1
... x
N∗M
1 ... 0 ... 1 ... 0
.
.
.
.
.
.
.
.
.
0 ... 1 ... 0 ... 1
∗
w
11
.
.
.
w
1M
.
.
.
w
N1
.
.
.
w
NM
=
0
f
1
.
.
.
f
M
,
(32.19)
AW = F
(32.20)
This equation can be solved with any techniquethat can invert a matrix. There areseveral problems
in solving Eq. 32.20. In general, the equation can be inconsistent. In addition, the system is rank
deficient. Further, traditional methods used to solve Eq. 32.20 are not robust to errors such as round-
off errors in digital computers, measurement inaccuracies, and noise corruption. In the least squares
solution (Eq. 32.6), the noise amplification is dictated by the condition number of the error matrix,
i.e., the ratio of the highest and the lowest eigenvalues of E. In the extreme case when λ
min
= 0, the
system is rank-deficient. In such cases, the pseudo-inverse solution can be used.
c
1999 by CRC Press LLC