12
ICA by Nonlinear
Decorrelation and
Nonlinear PCA
This chapter starts by reviewing some of the early research efforts in independent
component analysis (ICA), especially the technique based on nonlinear decorrelation,
that was successfully used by Jutten, H
´
erault, and Ans to solve the first ICA problems.
Today, this work is mainly of historical interest, because there exist several more
efficient algorithms for ICA.
Nonlinear decorrelation can be seen as an extension of second-order methods
such as whitening and principal component analysis (PCA). These methods give
components that are uncorrelated linear combinations of input variables, as explained
in Chapter 6. We will show that independent components can in some cases be found
as nonlinearly uncorrelated linear combinations. The nonlinear functions used in
this approach introduce higher order statistics into the solution method, making ICA
possible.
We then show how the work on nonlinear decorrelation eventually lead to the
Cichocki-Unbehauen algorithm, which is essentially the same as the algorithm that
we derived in Chapter 9 using the natural gradient. Next, the criterion of nonlinear
decorrelation is extended and formalized to the theory of estimating functions, and
the closely related EASI algorithm is reviewed.
Another approach to ICA that is related to PCA is the so-called nonlinear PCA.
A nonlinear representation is sought for the input data that minimizes a least mean-
square error criterion. For the linear case, it was shown in Chapter 6 that principal
components are obtained. It turns out that in some cases the nonlinear PCA approach
gives independent components instead. We review the nonlinear PCA criterion and
show its equivalence to other criteria like maximum likelihood (ML). Then, two
typical learning rules introduced by the authors are reviewed, of which the first one
239
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
240
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
is a stochastic gradient algorithm and the other one a recursive least mean-square
algorithm.
12.1 NONLINEAR CORRELATIONS AND INDEPENDENCE
The correlation between two random variables
y
1
and
y
2
was discussed in detail in
Chapter 2. Here we consider zero-mean variables only, so correlation and covariance
are equal. Correlation is related to independence in such a way that independent
variables are always uncorrelated. The opposite is not true, however: the variables
can be uncorrelated, yet dependent. An example is a uniform density in a rotated
square centered at the origin of the
(y
1
y
2
)
space, see e.g. Fig. 8.3. Both
y
1
and
y
2
are zero mean and uncorrelated, no matter what the orientation of the square, but
they are independent only if the square is aligned with the coordinate axes. In some
cases uncorrelatedness does imply independence, though; the best example is the
case when the density of
(y
1
y
2
)
is constrained to be jointly gaussian.
Extending the concept of correlation, we here define the nonlinear correlation of
the random variables
y
1
and
y
2
as E
ff (y
1
)g (y
2
)g
. Here,
f (y
1
)
and
g (y
2
)
are two
functions, of which at least one is nonlinear. Typical examples might be polynomials
of degree higher than 1, or more complex functions like the hyperbolic tangent. This
means that one or both of the random variables are first transformed nonlinearly to
new variables
f (y
1
)g(y
2
)
and then the usual linear correlation between these new
variables is considered.
The question now is: Assuming that
y
1
and
y
2
are nonlinearly decorrelated in the
sense
E
ff (y
1
)g (y
2
)g =0
(12.1)
can we say something about their independence? We would hope that by making
this kind of nonlinear correlation zero, independence would be obtained under some
additional conditions to be specified.
There is a general theorem (see, e.g., [129]) stating that
y
1
and
y
2
are independent
if and only if
E
ff (y
1
)g (y
2
)g =
E
ff (y
1
)g
E
fg (y
2
)g
(12.2)
for all continuous functions
f
and
g
that are zero outside a finite interval. Based
on this, it seems very difficult to approach independence rigorously, because the
functions
f
and
g
are almost arbitrary. Some kind of approximations are needed.
This problem was considered by Jutten and H
´
erault [228]. Let us assume that
f (y
1
)
and
g (y
2
)
are smooth functions that have derivatives of all orders in a neighborhood
NONLINEAR CORRELATIONS AND INDEPENDENCE
241
of the origin. They can be expanded in Taylor series:
f (y
1
) = f (0) + f
0
(0)y
1
+
1
2
f
00
(0)y
2
1
+ :::
=
1
X
i=0
f
i
y
i
1
g (y
2
) = g (0) + g
0
(0)y
2
+
1
2
g
00
(0)y
2
2
+ :::
=
1
X
i=0
g
i
y
i
2
where
f
i
g
i
is shorthand for the coefficients of the
i
th powers in the series.
The product of the functions is then
f (y
1
)g (y
2
)=
1
X
i=1
1
X
j =1
f
i
g
j
y
i
1
y
j
2
(12.3)
and condition (12.1) is equivalent to
E
ff (y
1
)g (y
2
)g =
1
X
i=1
1
X
j =1
f
i
g
j
E
fy
i
1
y
j
2
g =0
(12.4)
Obviously, a sufficient condition for this equation to hold is
E
fy
i
1
y
j
2
g =0
(12.5)
for all indices
i j
appearing in the series expansion (12.4). There may be other
solutions in which the higher order correlations are not zero, but the coefficients
f
i
g
j
happen to be just suitable to cancel the terms and make the sum in (12.4) exactly
equal to zero. For nonpolynomial functions that have infinite Taylor expansions, such
spurious solutions can be considered unlikely (we will see later that such spurious
solutions do exist but they can be avoided by the theory of ML estimation).
Again, a sufficient condition for (12.5) to hold is that the variables
y
1
and
y
2
are
independent and one of E
fy
i
1
g
E
fy
j
2
g
is zero. Let us require that E
fy
i
1
g =0
for all
powers
i
appearing in its series expansion. But this is only possible if
f (y
1
)
is an odd
function; then the Taylor series contains only odd powers
1 3 5:::
, and the powers
i
in Eq. (12.5) will also be odd. Otherwise, we have the case that even moments of
y
1
like the variance are zero, which is impossible unless
y
1
is constant.
To conclude, a sufficient (but not necessary) condition for the nonlinear uncorre-
latedness (12.1) to hold is that
y
1
and
y
2
are independent, and for one of them, say
y
1
, the nonlinearity is an odd function such that
f (y
1
)
has zero mean.
The preceding discussion is informal but should make it credible that nonlinear
correlations are useful as a possible general criterion for independence. Several things
have to be decided in practice: the first one is how to actually choose the functions
f g
. Is there some natural optimality criterion that can tell us that some functions
242
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
yx
1
21
12
-m
-m
2
1
2
x y
Fig. 12.1
The basic feedback circuit for the H
´
erault-Jutten algorithm. The element marked
with
+
is a summation
are better than some other ones? This will be answered in Sections 12.3 and 12.4.
The second problem is how we could solve Eq. (12.1), or nonlinearly decorrelate two
variables
y
1
y
2
. This is the topic of the next section.
12.2 THE H
´
ERAULT-JUTTEN ALGORITHM
Consider the ICA model
x = As
. Let us first look at a
2 2
case, which was
considered by H
´
erault, Jutten and Ans [178, 179, 226] in connection with the blind
separation of two signals from two linear mixtures. The model is then
x
1
= a
11
s
1
+ a
12
s
2
x
2
= a
21
s
1
+ a
22
s
2
H
´
erault and Jutten proposed the feedback circuit shown in Fig. 12.1 to solve the prob-
lem. The initial outputs are fed back to the system, and the outputs are recomputed
until an equilibrium is reached.
From Fig. 12.1 we have directly
y
1
= x
1
m
12
y
2
(12.6)
y
2
= x
2
m
21
y
1
(12.7)
Before inputting the mixture signals
x
1
x
2
to the network, they were normalized to
zero mean, which means that the outputs
y
1
y
2
also will have zero means. Defining a
matrix
M
with off-diagonal elements
m
12
m
21
and diagonal elements equal to zero,
these equations can be compactly written as
y = x My
Thus the input-output mapping of the network is
y =(I + M)
1
x
(12.8)
THE CICHOCKI-UNBEHAUEN ALGORITHM
243
Note that from the original ICA model we have
s = A
1
x
, provided that
A
is
invertible. If
I + M = A
,then
y
becomes equal to
s
. However, the problem in blind
separation is that the matrix
A
is unknown.
The solution that Jutten and H
´
erault introduced was to adapt the two feedback
coefficients
m
12
m
21
so that the outputs of the network
y
1
y
2
become independent.
Then the matrix
A
has been implicitly inverted and the original sources have been
found. For independence, they used the criterion of nonlinear correlations. They
proposed the following learning rules:
m
12
= f (y
1
)g (y
2
)
(12.9)
m
21
= f (y
2
)g (y
1
)
(12.10)
with
the learning rate. Both functions
f (:)g(:)
are odd functions; typically, the
functions
f (y )=y
3
g (y ) = arctan(y)
were used, although the method also seems to work for
g (y )=y
or
g (y )=
sign
(y )
.
Now, if the learning converges, then the right-hand sides must be zero on average,
implying
E
ff (y
1
)g (y
2
)g =
E
ff (y
2
)g (y
1
)g =0
Thus independence has hopefully been attained for the outputs
y
1
y
2
. A stability
analysis for the H
´
erault-Jutten algorithm was presented by [408].
In the numerical computation of the matrix
M
according to algorithm (12.9,12.10),
the outputs
y
1
y
2
on the right-hand side must also be updated at each step of the
iteration. By Eq. (12.8), they too depend on
M
, and solving them requires the
inversion of matrix
I + M
. As noted by Cichocki and Unbehauen [84], this matrix
inversion may be computationally heavy, especially if this approach is extended to
more than two sources and mixtures. One way to circumvent this problem is to make
a rough approximation
y =(I + M)
1
x (I M)x
that seems to work in practice.
Although the H
´
erault-Jutten algorithm was a very elegant pioneering solution to
the ICA problem, we know now that it has some drawbacks in practice. The algorithm
may work poorly or even fail to separate the sources altogether if the signals are badly
scaled or the mixing matrix is ill-conditioned. The number of sources that the method
can separate is severely limited. Also, although the local stability was shown in [408],
good global convergence behavior is not guaranteed.
12.3 THE CICHOCKI-UNBEHAUEN ALGORITHM
Starting from the H
´
erault-Jutten algorithm Cichocki, Unbehauen, and coworkers [82,
85, 84] derived an extension that has a much enhanced performance and reliability.
Instead of a feedback circuit like the H
´
erault-Jutten network in Fig. 12.1, Cichocki
244
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
and Unbehauen proposed a feedforward network with weight matrix
B
, with the
mixture vector
x
for input and with output
y = Bx
. Now the dimensionality of the
problem can be higher than 2. The goal is to adapt the
m m
matrix
B
so that the
elements of
y
become independent. The learning algorithm for
B
is as follows:
B = f (y)g (y
T
)]B
(12.11)
where
is the learning rate,
is a diagonal matrix whose elements determine the
amplitude scaling for the elements of
y
(typically,
could be chosen as the unit
matrix
I
), and
f
and
g
are two nonlinear scalar functions; the authors proposed a
polynomial and a hyperbolic tangent. The notation
f (y)
means a column vector with
elements
f (y
1
)::: f(y
n
)
.
The argumentation showing that this algorithm will give independent components,
too, is based on nonlinear decorrelations. Consider the stationary solution of this
learning rule defined as the matrix for which E
fBg = 0
, with the expectation
taken over the density of the mixtures
x
. For this matrix, the update is on the average
zero. Because this is a stochastic-approximation-type algorithm (see Chapter 3), such
stationarity is a necessary condition for convergence. Excluding the trivial solution
B =0
,wemusthave
E
ff (y)g (y
T
)g =0
Especially, for the off-diagonal elements, this implies
E
ff (y
i
)g (y
j
)g =0
(12.12)
which is exactly our definition of nonlinear decorrelation in Eq. (12.1) extended to
n
output signals
y
1
:::y
n
. The diagonal elements satisfy
E
ff (y
i
)g (y
i
)g =
ii
showing that the diagonal elements
ii
of matrix
only control the amplitude scaling
of the outputs.
The conclusion is that if the learning rule converges to a nonzero matrix
B
,then
the outputs of the network must become nonlinearly decorrelated, and hopefully
independent. The convergence analysis has been performed in [84]; for general
principles of analyzing stochastic iteration algorithms like (12.11), see Chapter 3.
The justification for the Cichocki-Unbehauen algorithm (12.11) in the original
articles was based on nonlinear decorrelations, not on any rigorous cost functions
that would be minimized by the algorithm. However, it is interesting to note that
this algorithm, first appearing in the early 1990’s, is in fact the same as the popular
natural gradient algorithm introduced later by Amari, Cichocki, and Young [12] as
an extension to the original Bell-Sejnowski algorithm [36]. All we have to do is
choose
as the unit matrix, the function
g (y)
as the linear function
g (y) = y
,
and the function
f (y)
as a sigmoidal related to the true density of the sources. The
Amari-Cichocki-Young algorithm and the Bell-Sejnowski algorithm were reviewed
in Chapter 9 and it was shown how the algorithms are derived from the rigorous
maximum likelihood criterion. The maximum likelihood approach also tells us what
kind of nonlinearities should be used, as discussed in Chapter 9.
THE ESTIMATING FUNCTIONS APPROACH *
245
12.4 THE ESTIMATING FUNCTIONS APPROACH *
Consider the criterion of nonlinear decorrelations being zero, generalized to
n
random
variables
y
1
:::y
n
, shown in Eq. (12.12). Among the possible roots
y
1
:::y
n
of
these equations are the source signals
s
1
:::s
n
. When solving these in an algorithm
like the H
´
erault-Jutten algorithm or the Cichocki-Unbehauen algorithm, one in fact
solves the separating matrix
B
.
This notion was generalized and formalized by Amari and Cardoso [8] to the case
of estimating functions. Again, consider the basic ICA model
x = As
,
s = B
x
where
B
is a true separating matrix (we use this special notation here to avoid any
confusion). An estimation function is a matrix-valued function
F(x B)
such that
E
fF(x B
)g =0:
(12.13)
This means that, taking the expectation with respect to the density of
x
,thetrue
separating matrices are roots of the equation. Once these are solved from Eq. (12.13),
the independent components are directly obtained.
Example 12.1 Given a set of nonlinear functions
f
1
(y
1
):::f
n
(y
n
)
, with
y = Bx
,
and defining a vector function
f (y) = f
1
(y
1
):::f
n
(y
n
)]
T
, a suitable estimating
function for ICA is
F(x B)= f (y)y
T
= f (Bx)(Bx)
T
(12.14)
because obviously E
ff (y)y
T
g
becomes diagonal when
B
is a true separating matrix
B
and
y
1
:::y
n
are independent and zero mean. Then the off-diagonal elements
become E
ff
i
(y
i
)y
j
g =
E
ff
i
(y
i
)g
E
fy
j
g =0
. The diagonal matrix
determines the
scales of the separated sources. Another estimating function is the right-hand side of
the learning rule (12.11),
F(x B)= f (y)g (y
T
)]B
There is a fundamental difference in the estimating function approach compared to
most of the other approaches to ICA: the usual starting point in ICA is a cost function
that somehow measures how independent or nongaussian the outputs
y
i
are, and the
independent components are solved by minimizing the cost function. In contrast,
there is no such cost function here. The estimation function need not be the gradient
of any other function. In this sense, the theory of estimating functions is very general
and potentially useful for finding ICA algorithms. For a discussion of this approach
in connection with neural networks, see [328].
It is not a trivial question how to design in practice an estimation function so that
we can solve the ICA model. Even if we have two estimating functions that both
have been shaped in such a way that separating matrices are their roots, what is a
relevant measure to compare them? Statistical considerations are helpful here. Note
that in practice, the densities of the sources
s
i
and the mixtures
x
j
are unknown in
246
ICA BY NONLINEAR DECORRELATION AND NONLINEAR PCA
the ICA model. It is impossible in practice to solve Eq. (12.13) as such, because the
expectation cannot be formed. Instead, it has to be estimated using a finite sample of
x
. Denoting this sample by
x(1):::x(T )
, we use the sample function
E
fF(x B)g
1
T
T
X
t=1
F(x(t) B)
Its root
^
B
is then an estimator for the true separating matrix. Obviously (see Chapter
4), the root
^
B =
^
Bx(1):::x(T )]
is a function of the training sample, and it is
meaningful to consider its statistical properties like bias and variance. This gives a
measure of goodness for the comparison of different estimation functions. The best
estimating function is one that gives the smallest error between the true separating
matrix
B
and the estimate
^
B
.
A particularly relevant measure is (Fisher) efficiency or asymptotic variance, as
the size
T
of the sample
x(1) ::: x(T )
grows large (see Chapter 4). The goal is
to design an estimating function that gives the smallest variance, given the set of
observations
x(t)
. Then the optimal amount of information is extracted from the
training set.
The general result provided by Amari and Cardoso [8] is that estimating functions
of the form (12.14) are optimal in the sense that, given any estimating function
F
,
one can always find a better or at least equally good estimating function (in the sense
of efficiency) having the form
F(x B) = f (y)y
T
(12.15)
= f (Bx)(Bx)
T
(12.16)
where
is a diagonal matrix. Actually, the diagonal matrix
has no effect on the
off-diagonal elements of
F(x B)
which are the ones determining the independence
between
y
i
y
j
; the diagonal elements are simply scaling factors.
The result shows that it is unnecessary to use a nonlinear function
g (y)
instead of
y
as the other one of the two functions in nonlinear decorrelation. Only one nonlinear
function
f (y)
, combined with
y
, is sufficient. It is interesting that functions of exactly
the type
f (y)y
T
naturally emerge as gradients of cost functions such as likelihood;
the question of how to choose the nonlinearity
f (y)
is also answered in that case. A
further example is given in the following section.
The preceding analysis is not related in any way to the practical methods for finding
the roots of estimating functions. Due to the nonlinearities, closed-form solutions do
not exist and numerical algorithms have to be used. The simplest iterative stochastic
approximation algorithm for solving the roots of
F(x B)
has the form
B = F(x B)
(12.17)
with
an appropriate learning rate. In fact, we now discover that the learning rules
(12.9), (12.10) and (12.11) are examples of this more general framework.
EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE
247
12.5 EQUIVARIANT ADAPTIVE SEPARATION VIA INDEPENDENCE
In most of the proposed approaches to ICA, the learning rules are gradient descent
algorithms of cost (or contrast) functions. Many cases have been covered in previous
chapters. Typically, the cost function has the form
J (B) =
E
fG(y)g
, with
G
some scalar function, and usually some additional constraints are used. Here again
y = Bx
, and the form of the function
G
and the probability density of
x
determine
the shape of the contrast function
J (B)
.
It is easy to show (see the definition of matrix and vector gradients in Chapter 3)
that
@J (B)
@ B
=
E
f(
@G(y)
@ y
)x
T
g =
E
fg(y)x
T
g
(12.18)
where
g(y)
is the gradient of
G(y)
.If
B
is square and invertible, then
x = B
1
y
and we have
@J (B)
@ B
=
E
fg(y)y
T
g(B
T
)
1
(12.19)
For appropriate nonlinearities
G(y)
, these gradients are estimating functions in
the sense that the elements of
y
must be statistically independent when the gradient
becomes zero. Note also that in the form E
ffg(y)y
T
gg(B
T
)
1
, the first factor
g(y)y
T
has the shape of an optimal estimating function (except for the diagonal
elements); see eq. (12.15). Now we also know how the nonlinear function
g(y)
can be determined: it is directly the gradient of the function
G(y)
appearing in the
original cost function.
Unfortunately, the matrix inversion
(B
T
)
1
in (12.19) is cumbersome. Matrix
inversion can be avoided by using the so-called natural gradient introduced by Amari
[4]. This is covered in Chapter 3. The natural gradient is obtained in this case by
multiplying the usual matrix gradient (12.19) from the right by matrix
B
T
B
,which
gives E
fg(y)y
T
gB
. The ensuing stochastic gradient algorithm to minimize the cost
function
J (B)
is then
B = g(y)y
T
B
(12.20)
This learning rule again has the form of nonlinear decorrelations. Omitting the
diagonal elements in matrix in
g(y)y
T
, the off-diagonal elements have the same
form as in the Cichocki-Unbehauen algorithm (12.11), with the two functions now
given by the linear function
y
and the gradient
g(y)
.
This gradient algorithm can also be derived using the relative gradient introduced
by Cardoso and Hvam Laheld [71]. This approach is also reviewed in Chapter
3. Based on this, the authors developed their equivariant adaptive separation via
independence (EASI) learning algorithm. To proceed from (12.20) to the EASI
learning rule, an extra step must be taken. In EASI, as in many other learning
rules for ICA, a whitening preprocessing is considered for the mixture vectors
x
(see Chapter 6). We first transform
x
linearly to
z = Vx
whose elements
z
i
have