10
ICA by Minimization of
Mutual Information
An important approach for independent component analysis (ICA) estimation, in-
spired by information theory, is minimization of mutual information.
The motivation of this approach is that it may not be very realistic in many cases
to assume that the data follows the ICA model. Therefore, we would like to develop
an approach that does not assume anything about the data. What we want to have
is a general-purpose measure of the dependence of the components of a random
vector. Using such a measure, we could define ICA as a linear decomposition that
minimizes that dependence measure. Such an approach can be developed using
mutual information, which is a well-motivated information-theoretic measure of
statistical dependence.
One of the main utilities of mutual information is that it serves as a unifying
framework for many estimation principles, in particular maximum likelihood (ML)
estimation and maximization of nongaussianity. In particular, this approach gives a
rigorous justification for the heuristic principle of nongaussianity.
10.1 DEFINING ICA BY MUTUAL INFORMATION
10.1.1 Information-theoretic concepts
The information-theoretic concepts needed in this chapter were explained in Chap-
ter 5. Readers not familiar with information theory are advised to read that chapter
before this one.
221
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
222
ICA BY MINIMIZATION OF MUTUAL INFORMATION
We recall here very briefly the basic definitions of information theory. The
differential entropy
H
of a random vector
y
with density
p(y)
is defined as:
H (y)=
Z
p(y)logp(y)
d
y
(10.1)
Entropy is closely related to the code length of the random vector. A normalized
version of entropy is given by negentropy
J
, which is defined as follows
J (y)=H (y
gauss
) H (y)
(10.2)
where
y
gauss
is a gaussian random vector of the same covariance (or correlation)
matrix as
y
. Negentropy is always nonnegative, and zero only for gaussian random
vectors. Mutual information
I
between
m
(scalar) random variables,
y
i
i =1:::m
is
defined as follows
I (y
1
y
2
:::y
m
)=
m
X
i=1
H (y
i
) H (y)
(10.3)
10.1.2 Mutual information as measure of dependence
We have seen earlier (Chapter 5) that mutual information is a natural measure of the
dependence between random variables. It is always nonnegative, and zero if and only
if the variables are statistically independent. Mutual information takes into account
the whole dependence structure of the variables, and not just the covariance, like
principal component analysis (PCA) and related methods.
Therefore, we can use mutual information as the criterion for finding the ICA
representation. This approach is an alternative to the model estimation approach. We
define the ICA of a random vector
x
as an invertible transformation:
s = Bx
(10.4)
where the matrix
B
is determined so that the mutual information of the transformed
components
s
i
is minimized. If the data follows the ICA model, this allows estimation
of the data model. On the other hand, in this definition, we do not need to assume
that the data follows the model. In any case, minimization of mutual information can
be interpreted as giving the maximally independent components.
MUTUAL INFORMATION AND NONGAUSSIANITY
223
10.2 MUTUAL INFORMATION AND NONGAUSSIANITY
Using the formula for the differential entropy of a transformation as given in (5.13)
of Chapter 5, we obtain a corresponding result for mutual information. We have
for an invertible linear transformation
y = Bx
:
I (y
1
y
2
::: y
n
)=
X
i
H (y
i
) H (x) log j det Bj
(10.5)
Now, let us consider what happens if we constrain the
y
i
to be uncorrelated and of
unit variance. This means
E fyy
T
g = BE fxx
T
gB
T
= I
, which implies
det I = 1 = det(BE fxx
T
gB
T
) = (det B)(det E fxx
T
g)(det B
T
)
(10.6)
and this implies that
det B
must be constant since
det E fxx
T
g
does not depend
on
B
. Moreover, for
y
i
of unit variance, entropy and negentropy differ only by a
constant and the sign, as can be seen in (10.2). Thus we obtain,
I (y
1
y
2
::: y
n
)=
const.
X
i
J (y
i
)
(10.7)
where the constant term does not depend on
B
. This shows the fundamental relation
between negentropy and mutual information.
We see in (10.7) that finding an invertible linear transformation
B
that minimizes
the mutual information is roughly equivalent to finding directions in which the ne-
gentropy is maximized. We have seen previously that negentropy is a measure of
nongaussianity. Thus, (10.7) shows that ICA estimation by minimization of mutual in-
formation is equivalent to maximizing the sum of nongaussianities of the estimates of
the independent components, when the estimates are constrained to be uncorrelated.
Thus, we see that the formulation of ICA as minimization of mutual information
gives another rigorous justification of our more heuristically introduced idea of finding
maximally nongaussian directions, as used in Chapter 8.
In practice, however, there are also some important differences between these two
criteria.
1. Negentropy, and other measures of nongaussianity, enable the deflationary, i.e.,
one-by-one, estimation of the independent components, since we can look for
the maxima of nongaussianity of a single projection
b
T
x
. This is not possible
with mutual information or most other criteria, like the likelihood.
2. A smaller difference is that in using nongaussianity, we force the estimates of
the independent components to be uncorrelated. This is not necessary when
using mutual information, because we could use the form in (10.5) directly,
as will be seen in the next section. Thus the optimization space is slightly
reduced.
224
ICA BY MINIMIZATION OF MUTUAL INFORMATION
10.3 MUTUAL INFORMATION AND LIKELIHOOD
Mutual information and likelihood are intimately connected. To see the connection
between likelihood and mutual information, consider the expectation of the log-
likelihood in (9.5):
1
T
E flog L(B)g =
n
X
i=1
E flog p
i
(b
T
i
x)g + log j det Bj
(10.8)
If the
p
i
were equal to the actual pdf’s of
b
T
i
x
, the first term would be equal to
P
i
H (b
T
i
x)
. Thus the likelihood would be equal, up to an additive constant given
by the total entropy of
x
, to the negative of mutual information as given in Eq. (10.5).
In practice, the connection may be just as strong, or even stronger. This is because
in practice we do not know the distributions of the independent components that are
needed in ML estimation. A reasonable approach would be to estimate the density
of
b
T
i
x
as part of the ML estimation method, and use this as an approximation of the
density of
s
i
. This is what we did in Chapter 9. Then, the
p
i
in this approximation
of likelihood are indeed equal to the actual pdf’s
b
T
i
x
. Thus, the equivalency would
really hold.
Conversely, to approximate mutual information, we could take a fixed approxi-
mation of the densities
y
i
, and plug this in the definition of entropy. Denote the pdf’s
by
G
i
(y
i
)=logp
i
(y
i
)
. Then we could approximate (10.5) as
I (y
1
y
2
::: y
n
)=
X
i
E fG
i
(y
i
)glog j det BjH (x)
(10.9)
Now we see that this approximation is equal to the approximation of the likelihood
used in Chapter 9 (except, again, for the global sign and the additive constant given by
H (x)
). This also gives an alternative method of approximating mutual information
that is different from the approximation that uses the negentropy approximations.
10.4 ALGORITHMS FOR MINIMIZATION OF MUTUAL INFORMATION
To use mutual information in practice, we need some method of estimating or ap-
proximating it from real data. Earlier, we saw two methods for approximating mutual
entropy. The first one was based on the negentropy approximations introduced in
Section 5.6. The second one was based on using more or less fixed approximations
for the densities of the ICs in Chapter 9.
Thus, using mutual information leads essentially to the same algorithms as used for
maximization of nongaussianity in Chapter 8, or for maximum likelihood estimation
in Chapter 9. In the case of maximization of nongaussianity, the corresponding
algorithms are those that use symmetric orthogonalization, since we are maximizing
the sum of nongaussianities, so that no order exists between the components. Thus,
we do not present any new algorithms in this chapter; the reader is referred to the two
preceding chapters.
EXAMPLES
225
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
−4
iteration count
mutual information
Fig. 10.1
The convergence of FastICA for ICs with uniform distributions. The value of
mutual information shown as function of iteration count.
10.5 EXAMPLES
Here we show the results of applying minimization of mutual information to the
two mixtures introduced in Chapter 7. We use here the whitened mixtures, and the
FastICA algorithm (which is essentially identical whichever approximation of mutual
information is used). For illustration purposes, the algorithm was always initialized
so that
W
was the identity matrix. The function
G
was chosen as
G
1
in (8.26).
First, we used the data consisting of two mixtures of two subgaussian (uniformly
distributed) independent components. To demonstrate the convergence of the al-
gorithm, the mutual information of the components at each iteration step is plotted
in Fig. 10.1. This was obtained by the negentropy-based approximation. At con-
vergence, after two iterations, mutual information was practically equal to zero.
The corresponding results for two supergaussian independent components are shown
in Fig. 10.2. Convergence was obtained after three iterations, after which mutual
information was practically zero.
10.6 CONCLUDING REMARKS AND REFERENCES
A rigorous approach to ICA that is different from the maximum likelihood approach
is given by minimization of mutual information. Mutual information is a natural
information-theoretic measure of dependence, and therefore it is natural to estimate
the independent components by minimizing the mutual information of their estimates.
Mutual information gives a rigorous justification of the principle of searching for
maximally nongaussian directions, and in the end turns out to be very similar to the
likelihood as well.
Mutual information can be approximated by the same methods that negentropy is
approximated. Alternatively, is can be approximated in the same way as likelihood.