Tải bản đầy đủ (.pdf) (26 trang)

Independent component analysis P17

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.14 MB, 26 trang )

17
Nonlinear ICA
This chapter deals with independent component analysis (ICA) for nonlinear mixing
models. A fundamental difficulty in the nonlinear ICA problem is that it is highly
nonunique without some extra constraints, which are often realized by using a suitable
regularization. We also address the nonlinear blind source separation (BSS) problem.
Contrary to the linear case, we consider it different from the respective nonlinear ICA
problem. After considering these matters, some methods introduced for solving the
nonlinear ICA or BSS problems are discussed in more detail. Special emphasis is
given to a Bayesian approach that applies ensemble learning to a flexible multilayer
perceptron model for finding the sources and nonlinear mixing mapping that have
most probably given rise to the observed mixed data. The efficiency of this method is
demonstrated using both artificial and real-world data. At the end of the chapter, other
techniques proposed for solving the nonlinear ICA and BSS problems are reviewed.
17.1 NONLINEAR ICA AND BSS
17.1.1 The nonlinear ICA and BSS problems
In many situations, the basic linear ICA or BSS model
x = As =
n
X
j =1
s
j
a
j
(17.1)
is too simple for describing the observed data
x
adequately. Hence, it is natural to
consider extension of the linear model to nonlinear mixing models. For instantaneous
315


Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright

2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
316
NONLINEAR ICA
mixtures, the nonlinear mixing model has the general form
x = f (s)
(17.2)
where
x
is the observed
m
-dimensional data (mixture) vector,
f
is an unknown real-
valued
m
-component mixing function, and
s
is an
n
-vector whose elements are the
n
unknown independent components.
Assume now for simplicity that the number of independent components
n

equals
the number of mixtures
m
. The general nonlinear ICA problem then consists of
finding a mapping
h : R
n
! R
n
that gives components
y = h(x)
(17.3)
that are statistically independent. A fundamental characteristic of the nonlinear
ICA problem is that in the general case, solutions always exist, and they are highly
nonunique. One reason for this is that if
x
and
y
are two independent random
variables, any of their functions
f (x)
and
g (y )
are also independent. An even more
serious problem is that in the nonlinear case,
x
and
y
can be mixed and still statistically
independent, as will be shown below. This is not unlike in the case of gaussian ICs

in a linear mixing.
In this chapter, we define BSS in a special way to clarify the distinction between
finding independent components, and finding the original sources. Thus, in the
respective nonlinear BSS problem, one should find the original source signals
s
that have generated the observed data. This is usually a clearly more meaningful
and unique problem than nonlinear ICA defined above, provided that suitable prior
information is available on the sources and/or the mixing mapping. It is worth
emphasizing that if some arbitrary independent components are found for the data
generated by (17.2), they may be quite different from the true source signals. Hence
the situation differs greatly from the basic linear data model (17.1), for which the
ICA or BSS problems have the same solution. Generally, solving the nonlinear BSS
problem is not easy, and requires additional prior information or suitable regularizing
constraints.
An important special case of the general nonlinear mixing model (17.2) consists
of so-called post-nonlinear mixtures. There each mixture has the form
x
i
= f
i
0
@
n
X
j =1
a
ij
s
j
1

A
 i =1::: n
(17.4)
Thus the sources
s
j
,
j = 1::: n
are first mixed linearly according to the basic
ICA/BSS model (17.1), but after that a nonlinear function
f
i
is applied to them to get
the final observations
x
i
. It can be shown [418] that for the post-nonlinear mixtures,
the indeterminacies are usually the same as for the basic linear instantaneous mixing
model (17.1). That is, the sources can be separated or the independent compo-
nents estimated up to the scaling, permutation, and sign indeterminacies under weak
conditions on the mixing matrix
A
and source distributions. The post-nonlinearity
assumption is useful and reasonable in many signal processing applications, because
NONLINEAR ICA AND BSS
317
it can be thought of as a model for a nonlinear sensor distortion. In more general
situations, it is a restrictive and somewhat arbitrary constraint. This model will be
treated in more detail below.
Another difficulty in the general nonlinear BSS (or ICA) methods proposed thus

far is that they tend to be computationally rather demanding. Moreover, the compu-
tational load usually increases very rapidly with the dimensionality of the problem,
preventing in practice the application of nonlinear BSS methods to high-dimensional
data sets.
The nonlinear BSS and ICA methods presented in the literature could be divided
into two broad classes: generative approaches and signal transformation approaches
[438]. In the generative approaches, the goal is to find a specific model that explains
how the observations were generated. In our case, this amounts to estimating both
the source signals
s
and the unknown mixing mapping
f ()
that have generated the
observed data
x
through the general mapping (17.2). In the signal transformation
methods, one tries to estimate the sources directly using the inverse transformation
(17.3). In these methods, the number of estimated sources is the same as the number
of observed mixtures [438].
17.1.2 Existence and uniqueness of nonlinear ICA
The question of existence and uniqueness of solutions for nonlinear independent
component analysis has been addressed in [213]. The authors show that there always
exists an infinity of solutions if the space of the nonlinear mixing functions
f
is
not limited. They also present a method for constructing parameterized families
of nonlinear ICA solutions. A unique solution (up to a rotation) can be obtained
in the two-dimensional special case if the mixing mapping
f
is constrained to be a

conformal mapping together with some other assumptions; see [213] for details.
In the following, we present in more detail the constructive method introduced in
[213] that always yields at least one solution to the nonlinear ICA problem. This
procedure might be considered as a generalization of the well-known Gram-Schmidt
orthogonalization method. Given
m
independent variables
y
=
(y
1
::: y
m
)
and a
variable
x
, a new variable
y
m+1
=
g (yx)
is constructed so that the set
y
1
:::y
m+1
is mutually independent.
The construction is defined recursively as follows. Assume that we have already
m

independent random variables
y
1
::: y
m
which are jointly uniformly distributed
in
0 1]
m
. Here it is not a restriction to assume that the distributions of the
y
i
are
uniform, since this follows directly from the recursion, as will be seen below; for a
single variable, uniformity can be attained by the probability integral transformation;
see (2.85). Denote by
x
any random variable, and by
a
1
::: a
m
b
some nonrandom
318
NONLINEAR ICA
scalars. Define
g (a
1
::: a

m
b p
yx
)=P (x  bjy
1
= a
1
::: y
m
= a
m
)
(17.5)
=
R
b
1
p
yx
(a
1
::: a
m
)d
p
y
(a
1
:::a
m

)
where
p
y
()
and
p
yx
()
are the marginal probability densities of
y
and
(yx)
,
respectively (it is assumed here implicitly that such densities exist),and
P (j)
denotes
the conditional probability. The
p
yx
in the argument of
g
is to remind that
g
depends
on the joint probability distribution of
y
and
x
.For

m =0
,
g
is simply the cumulative
distribution function of
x
.Now,
g
as defined above gives a nonlinear decomposition,
as stated in the following theorem.
Theorem 17.1 Assume that
y
1
::: y
m
are independent scalar random variables
that have a joint uniform distribution in the unit cube
0 1]
m
.Let
x
be any scalar
random variable. Define
g
as in (17.5), and set
y
m+1
= g (y
1
:::y

m
x p
yx
)
(17.6)
Then
y
m+1
is independent from the
y
1
:::y
m
, and the variables
y
1
:::y
m+1
are
jointly uniformly distributed in the unit cube
0 1]
m+1
.
The theorem is proved in [213]. The constructive method given above can be used
to decompose
n
variables
x
1
:::x

n
into
n
independent components
y
1
:::y
n
,
giving a solution for the nonlinear ICA problem.
This construction also clearly shows that the decomposition in independent com-
ponents is by no means unique. For example, we could first apply a linear trans-
formation on the
x
to obtain another random vector
x
0
=
Lx
, and then compute
y
0
=
g
0
(x
0
)
with
g

0
being defined using the above procedure, where
x
is replaced by
x
0
. Thus we obtain another decomposition of
x
into independent components. The
resulting decomposition
y
0
=
g
0
(Lx)
is in general different from
y
, and cannot be
reduced to
y
by any simple transformations. A more rigorous justification of the
nonuniqueness property has been given in [213].
Lin [278] has recently derived some interesting theoretical results on ICA that
are useful in describing the nonuniqueness of the general nonlinear ICA problem.
Let the matrices
H
s
and
H

x
denote the Hessians of the logarithmic probability
densities
log p
s
(s)
and
log p
x
(x)
of the source vector
s
and mixture (data) vector
x
,
respectively. Then for the basic linear ICA model (17.1) it holds that
H
s
= A
T
H
x
A
(17.7)
where
A
is the mixing matrix. If the components of
s
are truly independent,
H

s
should be a diagonal matrix. Due to the symmetry of the Hessian matrices
H
s
and
H
x
, Eq. (17.7) imposes
n(n  1)=2
constraints for the elements of the
n  n
matrix
A
. Thus a constant mixing matrix
A
can be solved by estimating
H
x
at two different
points, and assuming some values for the diagonal elements of
H
s
.
SEPARATION OF POST-NONLINEAR MIXTURES
319
If the nonlinear mapping (17.2) is twice differentiable, we can approximate it
locally at any point by the linear mixing model (17.1). There
A
is defined by the
first order term

@ f (s)=@ s
of the Taylor series expansion of
f (s)
at the desired point.
But now
A
generally changes from point to point, so that the constraint conditions
(17.7) still leave
n(n  1)=2
degrees of freedom for determining the mixing matrix
A
(omitting the diagonal elements). This also shows that the nonlinear ICA problem
is highly nonunique.
Taleb and Jutten have considered separability of nonlinear mixtures in [418, 227].
Their general conclusion is the same as earlier: Separation is impossible without
additional prior knowledge on the model, since the independence assumption alone
is not strong enough in the general nonlinear case.
17.2 SEPARATION OF POST-NONLINEAR MIXTURES
Before discussing approaches applicable to general nonlinear mixtures, let us briefly
consider blind separation methods proposed for the simpler case of post-nonlinear
mixtures (17.4). Especially Taleb and Jutten have developed BSS methods for this
case. Their main results have been represented in [418], and a short overview of their
studies on this problem can be found in [227]. In the following, we present the the
main points of their method.
A separation method for the post-nonlinear mixtures (17.4) should generally con-
sist of two subsequent parts or stages:
1. A nonlinear stage, which should cancel the nonlinear distortions
f
i
 i =

1::: n
. This part consists of nonlinear functions
g
i
(
i
u)
. The parameters

i
of each nonlinearity
g
i
are adjusted so that cancellation is achieved (at least
roughly).
2. A linear stage that separates the approximately linear mixtures
v
obtained after
the nonlinear stage. This is done as usual by learning a
n  n
separating matrix
B
for which the components of the output vector
y
=
Bv
of the separating
system are statistically independent (or as independent as possible).
Taleb and Jutten [418] use the mutual information
I (y)

between the components
y
1
::: y
n
of the output vector (see Chapter 10) as the cost function and independence
criterion in both stages. For the linear part, minimization of the mutual information
leads to the familiar Bell-Sejnowski algorithm (see Chapters 10 and 9)
@I(y )
@ B
= 
E
f x
T
g(B
T
)
1
(17.8)
where components

i
of the vector

are score functions of the components
y
i
of
the output vector
y

:

i
(u)=
d
du
log p
i
(u)=
p
0
i
(u)
p
i
(u)
(17.9)
320
NONLINEAR ICA
Here
p
i
(u)
is the probability density function of
y
i
and
p
0
i

(u)
its derivative. In practice,
the natural gradient algorithm is used instead of the Bell-Sejnowski algorithm (17.8);
see Chapter 9.
For the nonlinear stage, one can derive the gradient learning rule [418]
@I(y )
@ 
k
= 
E

@ log j g
0
k
(
k
x
k
) j
@ 
k


E
(
n
X
i=1

i

(y
i
)b
ik
@g
k
(
k
x
k
)
@ 
k
)
Here
x
k
is the
k
th component of the input vector,
b
ik
is the element
ik
of the matrix
B
,and
g
0
k

is the derivative of the
k
th nonlinear function
g
k
. The exact computation
algorithm depends naturally on the specific parametric form of the chosen nonlinear
mapping
g
k
(
k
x
k
)
. In [418], a multilayer perceptron network is used for modeling
the functions
g
k
(
k
x
k
)
,
k =1::: n
.
In linear BSS, it suffices that the score functions (17.9) are of the right type for
achieving separation. However, their appropriate estimation is critical for the good
performance of the proposed nonlinear separation method. The score functions (17.9)

must be estimated adaptively from the output vector
y
. Several alternative ways to
do this are considered in [418]. An estimation method based on the Gram-Charlier
expansion performs appropriately only for mild post-nonlinear distortions. However,
another method, which estimates the score functions directly, also provides very good
results for hard nonlinearities. Experimental results are presented in [418]. A well
performing batch type method for estimating the score functions has been introduced
in a later paper [417].
Before proceeding, we mention that separation of post-nonlinear mixtures also
has been studied in [271, 267, 469] using mainly extensions of the natural gradient
algorithm.
17.3 NONLINEAR BSS USING SELF-ORGANIZING MAPS
One of the earliest ideas for achieving nonlinear BSS (or ICA) is to use Kohonen’s
self-organizing map (SOM) to that end. This method was originally introduced by
Pajunen et al. [345]. The SOM [247, 172] is a well-known mapping and visualization
method that in an unsupervised manner learns a nonlinear mapping from the data to
a usually two-dimensional grid. The learned mapping from often high-dimensional
data space to the grid is such that it tries to preserve the structure of the data as well
as possible. Another goal in the SOM method is to map the data so that it would be
uniformly distributed on the rectangular (or hexagonal) grid. This can be roughly
achieved with suitable choices [345].
If the joint probability density of two random variables is uniformly distributed
inside a rectangle, then clearly the marginal densities along the sides of the rectangle
are statistically independent. This observation gives the justification for applying
self-organizing map to nonlinear BSS or ICA. The SOM mapping provides the
regularization needed in nonlinear BSS, because it tries to preserve the structure
NONLINEAR BSS USING SELF-ORGANIZING MAPS
321
of the data. This implies that the mapping should be as simple as possible while

achieving the desired goals.
0 5 10 15 20 25 30 35 40 45 50
−1
−0.5
0
0.5
1
0 5 10 15 20 25 30 35 40 45 50
−1.5
−1
−0.5
0
0.5
1
1.5
Original signals
Fig. 17.1
Original source signals.
0 5 10 15 20 25 30 35 40 45 50
−10
−5
0
5
10
0 5 10 15 20 25 30 35 40 45 50
−20
−10
0
10
20

Fig. 17.2
Nonlinear mixtures.
The following experiment [345] illustrates the use of the self-organizing map in
nonlinear blind source separation. There were two subgaussian source signals
s
i
shown in Fig. 17.1, consisting of a sinusoid and uniformly distributed white noise.
Each source vector
s
was first mixed linearly using the mixing matrix
A =

0:7 0:3
0:3 0:7

(17.10)
After this, the data vectors
x
were obtained as post-nonlinear mixtures of the sources
by applying the formula (17.4), where the nonlinearity
f
i
(t)
=
t
3
+ t
,
i =1 2
.These

mixtures
x
i
are depicted in Fig. 17.2.
0 5 10 15 20 25 30 35 40 45 50
−20
−15
−10
−5
0
0 5 10 15 20 25 30 35 40 45 50
−15
−10
−5
0
5
Fig. 17.3
Signals separated by SOM.
−10 −8 −6 −4 −2 0 2 4 6 8 10
−15
−10
−5
0
5
10
15
Fig. 17.4
Converged SOM map.
The sources separated by the SOM method are shown in Fig. 17.3, and the
converged SOM map is illustrated in Fig. 17.4. The estimates of the source signals

in Fig. 17.3 are obtained by mapping each data vector
x
onto the map of Fig. 17.4,
322
NONLINEAR ICA
and reading the coordinates of the mapped data vector. Even though the preceding
experiment was carried out with post-nonlinear mixtures, the use of the SOM method
is not limited to them.
Generally speaking, there are several difficulties in applying self-organizing maps
to nonlinear blind source separation. If the sources are uniformly distributed, then
it can be heuristically justified that the regularization of the nonlinear separating
mapping provided by the SOM approximately separates the sources. But if the true
sources are not uniformly distributed, the separating mapping providing uniform
densities inevitably causes distortions, which are in general the more serious the
farther the true source densities are from the uniform ones. Of course, the SOM
method still provides an approximate solution to the nonlinear ICA problem, but this
solution may have little to do with the true source signals.
Another difficulty in using SOM for nonlinear BSS or ICA is that computational
complexity increases very rapidly with the number of the sources (dimensionality of
the map), limiting the potential application of this method to small-scale problems.
Furthermore, the mapping provided by the SOM is discrete, where the discretization
is determined by the number of grid points.
17.4 A GENERATIVE TOPOGRAPHIC MAPPING APPROACH TO
NONLINEAR BSS *
17.4.1 Background
The self-organizing map discussed briefly in the previous section is a nonlinear
mapping method that is inspired by neurobiological modeling arguments. Bishop,
Svensen and Williams introduced the generative topographic mapping (GTM) method
as a statistically more principled alternative to SOM. Their method is presented in
detail in [49].

In the basic GTM method, mutually similar impulse (delta) functions that are
equispaced on a rectangular grid are used to model the discrete uniform density in the
space of latent variables, or the joint density of the sources in our case. The mapping
from the sources to the observed data, corresponding in our nonlinear BSS problem
to the nonlinear mixing mapping (17.2), is modeled using a mixture-of-gaussians
model. The parameters of the mixture-of-gaussians model, defining the mixing
mapping, are then estimated using a maximum likelihood (ML) method (see Section
4.5) realized by the expectation-maximization (EM) algorithm [48, 172]. After this,
the inverse (separating) mapping from the data to the latent variables (sources) can
be determined.
It is well-known that any continuous smooth enough mapping can be approximated
with arbitrary accuracy using a mixture-of-gaussians model with sufficiently many
gaussian basis functions [172, 48]. Roughly stated, this provides the theoretical
basis of the GTM method. A fundamental difference of the GTM method compared
with SOM is that GTM is based on a generative approach that starts by assuming
a model for the latent variables, in our case the sources. On the other hand, SOM
A GENERATIVE TOPOGRAPHIC MAPPING APPROACH *
323
tries to separate the sources directly by starting from the data and constructing a
suitable separating signal transformation. A key benefit of GTM is its firm theoretical
foundation which helps to overcome some of the limitations of SOM. This also
provides the basis of generalizing the GTM approach to arbitrary source densities.
Using the basic GTM method instead of SOM for nonlinear blind source separation
does not yet bring out any notable improvement, because the densities of the sources
are still assumed to be uniform. However, it is straightforward to generalize the GTM
method to arbitrary known source densities. The advantage of this approach is that
one can directly regularize the inverse of the mixing mapping by using the known
source densities. This modified GTM method is then used for finding a noncomplex
mixing mapping. This approach is described in the following.
17.4.2 The modified GTM method

The modified GTM method introduced in [346] differs from the standard GTM [49]
only in that the required joint density of the latent variables (sources) is defined as
a weighted sum of delta functions instead of plain delta functions. The weighting
coefficients are determined by discretizing the known source densities. Only the main
points of the GTM method are presented here, with emphasis on the modifications
made for applying it to nonlinear blind source separation. Readers wishing to gain a
deeper understanding of the GTM method should look at the original paper [49].
The GTM method closely resembles SOM in that it uses a discrete grid of points
forming a regular array in the
m
-dimensional latent space. As in SOM, the dimension
of the latent space is usually
m = 2
. Vectors lying in the latent space are denoted by
s(t)
; in our application they will be source vectors. The GTM method uses a set of
L
fixed nonlinear basis functions
f
j
(s)g
,
j =1::: L
, which form a nonorthogonal
basis set. These basis functions typically consist of a regular array of spherical
gaussian functions, but the basis functions can at least in principle be of other types.
The mapping from the
m
-dimensional latent space to the
n

-dimensional data
space, which is in our case the mixing mapping of Eq. (17.2), is in GTM modeled as
a linear combination of basis functions
'
j
:
x = f (s)=M'(s) ' ='
1
'
2
::: '
L
]
T
(17.11)
Here
M
is an
n  L
matrix of weight parameters.
Denote the node locations in the latent space by

i
. Eq. (17.11) then defines a
corresponding set of reference vectors
m
i
= M'(
i
)

(17.12)
in data space. Each of these reference vectors then forms the center of an isotropic
gaussian distribution in data space. Denoting the common variance of these gaussians
by

1
,weget
p
x
(x j i)=


2

n=2
exp



2
k m
i
 x k
2

(17.13)
324
NONLINEAR ICA
The probability density function for the GTM model is obtained by summing over
all of the gaussian components, yielding

p
x
(x(t) j M) =
K
X
i=1
P (i)p
x
(x j i)
=
K
X
i=1
1
K


2

n=2
exp



2
k m
i
 x k
2


(17.14)
Here
K
is the total number of gaussian components, which is equal to the number of
grid points in latent space, and the prior probabilities
P (i)
of the gaussian components
are all equal to
1=K
.
GTM tries to represent the distribution of the observed data
x
in the
n
-dimensional
data space in terms of a smaller
m
-dimensional nonlinear manifold [49]. The gaussian
distribution in (17.13) represents a noise or error model which is needed because the
data usually does not lie exactly in such a lower dimensional manifold. It is important
to realize that the
K
gaussian distributions defined in (17.13) have nothing to do with
the basis function
'
i
,
i =1::: L
. Usually it is advisable that the number
L

of
the basis functions is clearly smaller than the number
K
of node locations and their
respective noise distributions (17.13). In this way, one can avoid overfitting and
prevent the mixing mapping (17.11) to become overly complicated.
The unknown parameters in this model are the weight matrix
M
and the inverse
variance

. These parameters are estimated by fitting the model (17.14) to the
observed data vectors
x(1) x(2)::: x(T )
using the maximum likelihood method
discussed earlier in Section 4.5. The log likelihood function of the observed data is
given by
L(M)=
T
X
t=1
log p
x
(x(t)jM)=
T
X
t=1
log
Z
p

x
(x(t)js M)p
s
(s)ds
(17.15)
where

1
is the variance of
x
given
s
and
M
,and
T
is the total number of data
vectors
x(t)
.
For applying the modified GTM method, the probability density function
p
s
(s)
of the source vectors
s
should be known. Assuming that the sources
s
1
s

2
::: s
m
are statistically independent, this joint density can be evaluated as the product of the
marginal densities of the individual sources:
p
s
(s)=
m
Y
i=1
p
i
(s
i
)
(17.16)
Each marginal density is here a discrete density defined at the sampling points
corresponding to the locations of the node vectors.
The latent space in the GTM method usually has a small dimension, typically
m =2
. The method can be applied in principle for
m>2
, but its computational
load then increases quite rapidly just like in the SOM method. For this reason, only

×