11
ICA by Tensorial Methods
One approach for estimation of independent component analysis (ICA) consists of
using higher-order cumulant tensor. Tensors can be considered as generalization
of matrices, or linear operators. Cumulant tensors are then generalizations of the
covariance matrix. The covariance matrix is the second-order cumulant tensor, and
the fourth order tensor is defined by the fourth-order cumulants cum
(x
i
x
j
x
k
x
l
)
.
For an introduction to cumulants, see Section 2.7.
As explained in Chapter 6, we can use the eigenvalue decomposition of the
covariance matrix to whiten the data. This means that we transform the data so that
second-order correlations are zero. As a generalization of this principle, we can use
the fourth-order cumulant tensor to make the fourth-order cumulants zero, or at least
as small as possible. This kind of (approximative) higher-order decorrelation gives
one class of methods for ICA estimation.
11.1 DEFINITION OF CUMULANT TENSOR
We shall here consider only the fourth-order cumulant tensor, which we call for sim-
plicity the cumulant tensor. The cumulant tensor is a four-dimensional array whose
entries are given by the fourth-order cross-cumulants of the data: cum
(x
i
x
j
x
k
x
l
)
,
where the indices
i j k l
are from
1
to
n
. This can be considered as a “four-
dimensional matrix”, since it has four different indices instead of the usual two. For
a definition of cross-cumulants, see Eq. (2.106).
In fact, all fourth-order cumulants of linear combinations of
x
i
can be obtained
as linear combinations of the cumulants of
x
i
. This can be seen using the additive
229
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
230
ICA BY TENSORIAL METHODS
properties of the cumulants as discussed in Section 2.7. The kurtosis of a linear
combination is given by
kurt
X
i
w
i
x
i
=
cum
(
X
i
w
i
x
i
X
j
w
j
x
j
X
k
w
k
x
k
X
l
w
l
x
l
)
=
X
ij kl
w
4
i
w
4
j
w
4
k
w
4
l
cum
(x
i
x
j
x
k
x
l
)
(11.1)
Thus the (fourth-order) cumulants contain all the fourth-orderinformation of the data,
just as the covariance matrix gives all the second-order information on the data. Note
that if the
x
i
are independent, all the cumulants with at least two different indices are
zero, and therefore we have the formula that was already widely used in Chapter 8:
kurt
P
i
q
i
s
i
=
P
i
q
4
i
kurt
(s
i
)
.
The cumulant tensor is a linear operator defined by the fourth-order cumulants
cum
(x
i
x
j
x
k
x
l
)
. This is analogous to the case of the covariance matrix with
elements cov
(x
i
x
j
)
, which defines a linear operator just as any matrix defines one.
In the case of the tensor we have a linear transformation in the space of
n n
matrices,
instead of the space of
n
-dimensional vectors. The space of such matrices is a linear
space of dimension
n n
, so there is nothing extraordinary in defining the linear
transformation. The
i j
th element of the matrix given by the transformation, say
F
ij
, is defined as
F
ij
(M)=
X
kl
m
kl
cum
(x
i
x
j
x
k
x
l
)
(11.2)
where
m
kl
are the elements in the matrix
M
that is transformed.
11.2 TENSOR EIGENVALUES GIVE INDEPENDENT COMPONENTS
As any symmetric linear operator, the cumulant tensor has an eigenvalue decom-
position (EVD). An eigenmatrix of the tensor is, by definition, a matrix
M
such
that
F(M)=M
(11.3)
i.e.,
F
ij
(M)=M
ij
,where
is a scalar eigenvalue.
The cumulant tensor is a symmetric linear operator, since in the expression
cum
(x
i
x
j
x
k
x
l
)
, the order of the variables makes no difference. Therefore, the
tensor has an eigenvalue decomposition.
Let us consider the case where the data follows the ICA model, with whitened
data:
z = VAs = W
T
s
(11.4)
where we denote the whitened mixing matrix by
W
T
. This is because it is orthogonal,
and thus it is the transpose of the separating matrix
W
for whitened data.
TENSOR EIGENVALUES GIVE INDEPENDENT COMPONENTS
231
The cumulant tensor of
z
has a special structure that can be seen in the eigenvalue
decomposition. In fact, every matrix of the form
M = w
m
w
T
m
(11.5)
for
m =1:::n
is an eigenmatrix. The vector
w
m
is here one of the rows of the
matrix
W
, and thus one of the columns of the whitened mixing matrix
W
T
.Tosee
this, we calculate by the linearity properties of cumulants
F
ij
(w
m
w
T
m
)=
X
kl
w
mk
w
ml
cum
(z
i
z
j
z
k
z
l
)
=
X
kl
w
mk
w
ml
cum
(
X
q
w
qi
s
q
X
q
0
w
q
0
j
s
q
0
X
r
w
rk
s
r
X
r
0
w
r
0
l
s
r
0
)
=
X
klqq
0
rr
0
w
mk
w
ml
w
qi
w
q
0
j
w
rk
w
r
0
l
cum
(s
q
s
q
0
s
r
s
r
0
)
(11.6)
Now, due to the independence of the
s
i
, only those cumulants where
q = q
0
= r = r
0
are nonzero. Thus we have
F
ij
(w
m
w
T
m
)=
X
klq
w
mk
w
ml
w
qi
w
qj
w
qk
w
ql
kurt
(s
q
)
(11.7)
Due to the orthogonality of the rows of
W
,wehave
P
k
w
mk
w
qk
=
mq
,and
similarly for index
l
. Thus we can take the sum first with respect to
k
, and then with
respect to
l
, which gives
F
ij
(w
m
w
T
m
)=
X
lq
w
ml
w
qi
w
qj
mq
w
ql
kurt
(s
q
)
=
X
q
w
qi
w
qj
mq
mq
kurt
(s
q
)=w
mi
w
mj
kurt
(s
m
)
(11.8)
This proves that matrices of the form in (11.5) are eigenmatrices of the tensor. The
corresponding eigenvalues are given by the kurtoses of the independent components.
Moreover, it can be proven that all other eigenvalues of the tensor are zero.
Thus we see that if we knew the eigenmatrices of the cumulant tensor, we could
easily obtain the independent components. If the eigenvalues of the tensor, i.e., the
kurtoses of the independent components, are distinct, every eigenmatrix corresponds
to a nonzero eigenvalue of the form
w
m
w
T
m
, giving one of the columns of the
whitened mixing matrix.
If the eigenvalues are not distinct, the situation is more problematic: The eigenma-
trices are no longer uniquely defined, since any linear combinations of the matrices
w
m
w
T
m
corresponding to the same eigenvalue are eigenmatrices of the tensor as
well. Thus, every
k
-fold eigenvalue corresponds to
k
matrices
M
i
i =1:::k
that
are different linear combinations of the matrices
w
i(j )
w
T
i(j )
corresponding to the
k
ICs whose indices are denoted by
i(j )
. The matrices
M
i
can be thus expressed as:
M
i
=
k
X
j =1
j
w
i(j )
w
T
i(j )
(11.9)
232
ICA BY TENSORIAL METHODS
Now, vectors that can be used to construct the matrix in this way can be computed
by the eigenvalue decomposition of the matrix: The
w
i(j )
are the (dominant) eigen-
vectors of
M
i
.
Thus, after finding the eigenmatrices
M
i
of the cumulant tensor, we can decom-
pose them by ordinary EVD, and the eigenvectors give the columns of the mixing
matrix
w
i
. Of course, it could turn out that the eigenvalues in this latter EVD are
equal as well, in which case we have to figure out something else. In the algorithms
given below, this problem will be solved in different ways.
This result leaves the problem of how to compute the eigenvalue decomposition
of the tensor in practice. This will be treated in the next section.
11.3 COMPUTING THE TENSOR DECOMPOSITION BY A POWER
METHOD
In principle, using tensorial methods is simple. One could take any method for
computing the EVD of a symmetric matrix, and apply it on the cumulant tensor.
To do this, we must first consider the tensor as a matrix in the space of
n n
matrices. Let
q
be an index that goes though all the
n n
couples
(i j )
.Thenwe
can consider the elements of an
n n
matrix
M
as a vector. This means that we
are simply vectorizing the matrices. Then the tensor can be considered as a
q q
symmetric matrix
F
with elements
f
qq
0
=
cum
(z
i
z
j
z
i
0
z
j
0
)
, where the indices
(i j )
corresponds to
q
, and similarly for
(i
0
j
0
)
and
q
0
. Itisonthismatrixthatwe
could apply ordinary EVD algorithms, for example the well-known QR methods. The
special symmetricity properties of the tensor could be used to reduce the complexity.
Such algorithms are out of the scope of this book; see e.g. [62].
The problem with the algorithm in this category, however, is that the memory
requirements may be prohibitive, because often the coefficients of the fourth-order
tensor must be stored in memory, which requires
O(n
4
)
units of memory. The
computational load also grows quite fast. Thus these algorithms cannot be used in
high-dimensional spaces. In addition, equal eigenvalues may give problems.
In the following we discuss a simple modification of the power method, that
circumvents the computational problems with the tensor EVD. In general, the power
method is a simple way of computing the eigenvector corresponding to the largest
eigenvalue of a matrix. This algorithm consists of multiplying the matrix with the
running estimate of the eigenvector, and taking the product as the new value of the
vector. The vector is then normalized to unit length, and the iteration is continued
until convergence. The vector then gives the desired eigenvector.
We can apply the power method quite simply to the case of the cumulant tensor.
Starting from a random matrix
M
, we compute
F(M)
and take this as the new value
of
M
. Then we normalize
M
and go back to the iteration step. After convergence,
M
will be of the form
P
k
k
w
i(k)
w
T
i(k)
. Computing its eigenvectors gives one or
more of the independent components. (In practice, though, the eigenvectors will
not be exactly of this form due to estimation errors.) To find several independent
TENSOR DECOMPOSITION BY A POWER METHOD
233
components, we could simply project the matrix after every step on the space of
matrices that are orthogonal to the previously found ones.
In fact, in the case of ICA, such an algorithm can be considerably simplified.
Since we know that the matrices
w
i
w
T
i
are eigenmatrices of the cumulant tensor, we
can apply the power method inside that set of matrices
M = ww
T
only. After every
computation of the product with the tensor, we must then project the obtained matrix
back to the set of matrices of the form
ww
T
. A very simple way of doing this is to
multiply the new matrix
M
by the old vector to obtain the new vector
w
= M
w
(which will be normalized as necessary). This can be interpreted as another power
method, this time applied on the eigenmatrix to compute its eigenvectors. Since the
best way of approximating the matrix
M
in the space of matrices of the form
ww
T
is by using the dominant eigenvector, a single step of this ordinary power method
will at least take us closer to the dominant eigenvector, and thus to the optimal vector.
Thus we obtain an iteration of the form
w w
T
F(ww
T
)
(11.10)
or
w
i
X
j
w
j
X
kl
w
k
w
l
cum
(z
i
z
j
z
k
z
l
)
(11.11)
In fact, this can be manipulated algebraically to give much simpler forms. We have
equivalently
w
i
cum
(z
i
X
j
w
j
z
j
X
k
w
k
z
k
X
l
w
l
z
l
)=
cum
(z
i
yyy)
(11.12)
wherewedenoteby
y =
P
i
w
i
z
i
the estimate of an independent component. By
definition of the cumulants, we have
cum
(z
i
yyy)=E fz
i
y
3
g3E fz
i
y gE fy
2
g
(11.13)
We can constrain
y
to have unit variance, as usual. Moreover, we have
E fz
i
y g = w
i
.
Thus we have
w E fzy
3
g3w
(11.14)
where
w
is normalized to unit norm after every iteration. To find several indepen-
dent components, we can actually just constrain the
w
corresponding to different
independent components to be orthogonal, as is usual for whitened data.
Somewhat surprisingly, (11.14) is exactly the FastICA algorithm that was derived
as a fixed-point iteration for finding the maxima of the absolute value of kurtosis in
Chapter 8, see (8.20). We see that these two methods lead to the same algorithm.