14
Overview and Comparison
of Basic ICA Methods
In the preceding chapters, we introduced several different estimation principles and
algorithms for independent component analysis (ICA). In this chapter, we provide
an overview of these methods. First, we show that all these estimation principles
are intimately connected, and the main choices are between cumulant-based vs.
negentropy/likelihood-based estimation methods, and between one-unit vs. multi-
unit methods. In other words, one must choose the nonlinearity and the decorrelation
method. We discuss the choice of the nonlinearity from the viewpoint of statistical
theory. In practice, one must also choose the optimization method. We compare the
algorithms experimentally, and show that the main choice here is between on-line
(adaptive) gradient algorithms vs. fast batch fixed-point algorithms.
At the end of this chapter, we provide a short summary of the whole of Part II,
that is, of basic ICA estimation.
14.1 OBJECTIVE FUNCTIONS VS. ALGORITHMS
A distinction that has been used throughout this book is between the formulation of
the objective function, and the algorithm used to optimize it. One might express this
in the following “equation”:
ICA method
=
objective function
+
optimization algorithm
:
In the case of explicitly formulated objective functions, one can use any of the
classic optimization methods, for example, (stochastic) gradient methods and Newton
273
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
274
OVERVIEW AND COMPARISON OF BASIC ICA METHODS
methods. In some cases, however, the algorithm and the estimation principle may be
difficult to separate.
The properties of the ICA method depend on both of the objective function and
the optimization algorithm. In particular:
the statistical properties (e.g., consistency, asymptotic variance, robustness) of
the ICA method depend on the choice of the objective function,
the algorithmic properties (e.g., convergence speed, memory requirements,
numerical stability) depend on the optimization algorithm.
Ideally, these two classes of properties are independent in the sense that different
optimization methods can be used to optimize a single objective function, and a
single optimization method can be used to optimize different objective functions. In
this section, we shall first treat the choice of the objective function, and then consider
optimization of the objective function.
14.2 CONNECTIONS BETWEEN ICA ESTIMATION PRINCIPLES
Earlier, we introduced several different statistical criteria for estimation of the ICA
model, including mutual information, likelihood, nongaussianity measures, cumu-
lants, and nonlinear principal component analysis (PCA) criteria. Each of these
criteria gave an objective function whose optimization enables ICA estimation. We
have already seen that some of them are closely connected; the purpose of this section
is to recapitulate these results. In fact, almost all of these estimation principles can be
considered as different versions of the same general criterion. After this, we discuss
the differences between the principles.
14.2.1 Similarities between estimation principles
Mutual information gives a convenient starting point for showing the similarity be-
tween different estimation principles. We have for an invertible linear transformation
y = Bx
:
I (y
1
y
2
:::y
n
)=
X
i
H (y
i
) H (x) log j det Bj
(14.1)
If we constrain the
y
i
to be uncorrelated and of unit variance, the last term on the
right-hand side is constant; the second term does not depend on
B
anyway (see
Chapter 10). Recall that entropy is maximized by a gaussian distribution, when
variance is kept constant (Section 5.3). Thus we see that minimization of mutual
information means maximizing the sum of the nongaussianities of the estimated
components. If these entropies (or the corresponding negentropies) are approximated
by the approximations used in Chapter 8, we obtain the same algorithms as in that
chapter.
CONNECTIONS BETWEEN ICA ESTIMATION PRINCIPLES
275
Alternatively, we could approximate mutual information by approximating the
densities of the estimated ICs by some parametric family, and using the obtained
log-density approximations in the definition of entropy. Thus we obtain a method
that is essentially equivalent to maximum likelihood (ML) estimation.
The connections to other estimation principles can easily be seen using likelihood.
First of all, to see the connection to nonlinear decorrelation, it is enough to compare
the natural gradient methods for ML estimation shown in (9.17) with the nonlinear
decorrelation algorithm (12.11): they are of the same form. Thus, ML estimation
gives a principled method for choosing the nonlinearities in nonlinear decorrelation.
The nonlinearities used are determined as certain functions of the probability density
functions (pdf’s) of the independent components. Mutual information does the same
thing, of course, due to the equivalency discussed earlier. Likewise, the nonlin-
ear PCA methods were shown to be essentially equivalent to ML estimation (and,
therefore, most other methods) in Section 12.7.
The connection of the preceding principles to cumulant-based criteria can be seen
by considering the approximation of negentropy by cumulants as in Eq. (5.35):
J (y )
1
12
E fy
3
g
2
+
1
48
kurt
(y )
2
(14.2)
where the first term could be omitted, leaving just the term containing kurtosis.
Likewise, cumulants could be used to approximate mutual information, since mutual
information is based on entropy. More explicitly, we could consider the following
approximation of mutual information:
I (y) c
1
c
2
X
i
kurt
(y
i
)
2
(14.3)
where
c
1
and
c
2
are some constants. This shows clearly the connection between
cumulants and minimization of mutual information. Moreover, the tensorial methods
in Chapter 11 were seen to lead to the same fixed-point algorithm as the maximization
of nongaussianity as measured by kurtosis,which shows that they are doing very much
the same thing as the other kurtosis-based methods.
14.2.2 Differences between estimation principles
There are, however, a couple of differences between the estimation principles as well.
1. Some principles (especially maximum nongaussianity) are able to estimate
single independent components, whereas others need to estimate all the com-
ponents at the same time.
2. Some objective functions use nonpolynomial functions based on the (assumed)
probability density functions of the independent components, whereas others
use polynomial functions related to cumulants. This leads to different non-
quadratic functions in the objective functions.
3. In many estimation principles, the estimates of the ICs are constrained to be
uncorrelated. This reduces somewhat the space in which the estimation is
276
OVERVIEW AND COMPARISON OF BASIC ICA METHODS
performed. Considering, for example, mutual information, there is no reason
why mutual information would be exactly minimized by a decomposition that
gives uncorrelated components. Thus, this decorrelation constraint slightly
reduces the theoretical performance of the estimation methods. In practice,
this may be negligible.
4. One important difference in practice is that often in ML estimation,the densities
of the ICs are fixed in advance, using prior knowledge on the independent
components. This is possible because the pdf’s of the ICs need not be known
with any great precision: in fact, it is enough to estimate whether they are sub-
or supergaussian. Nevertheless, if the prior information on the nature of the
independent components is not correct, ML estimation will give completely
wrong results, as was shown in Chapter 9. Some care must be taken with ML
estimation, therefore. In contrast, using approximations of negentropy, this
problem does not usually arise, since the approximations we have used in this
book do not depend on reasonable approximations of the densities. Therefore,
these approximations are less problematic to use.
14.3 STATISTICALLY OPTIMAL NONLINEARITIES
Thus, from a statistical viewpoint, the choice of estimation method is more or less
reduced to the choice of the nonquadratic function
G
that gives information on the
higher-order statistics in the form of the expectation
E fG(b
T
i
x)g
. In the algorithms,
this choice corresponds to the choice of the nonlinearity
g
that is the derivative of
G
.
In this section, we analyze the statistical properties of different nonlinearities. This
is based on the family of approximations of negentropy given in (8.25). This family
includes kurtosis as well. For simplicity, we consider here the estimation of just one
IC, given by maximizing this nongaussianity measure. This is essentially equivalent
to the problem
max
E f(b
T
x)
2
g=1
E fG(b
T
x)g
(14.4)
where the sign of
G
depends of the estimate on the sub- or supergaussianity of
b
T
x
.
The obtained vector is denoted by
b
b
. The two fundamental statistical properties of
b
b
that we analyze are asymptotic variance and robustness.
14.3.1 Comparison of asymptotic variance *
In practice, one usually has only a finite sample of
T
observations of the vector
x
.
Therefore, the expectations in the theoretical definition of the objective function are in
fact replaced by sample averages. This results in certain errors in the estimator
b
b
,and
it is desired to make these errors as small as possible. A classic measure of this error
is asymptotic (co)variance, which means the limit of the covariance matrix of
b
b
p
T
as
T !1
. This gives an approximation of the mean-square error of
b
b
, as was already
STATISTICALLY OPTIMAL NONLINEARITIES
277
discussed in Chapter 4. Comparison of, say, the traces of the asymptotic variances of
two estimators enables direct comparison of the accuracy of two estimators. One can
solve analytically for the asymptotic variance of
b
b
, obtaining the following theorem
[193]:
Theorem 14.1 The trace of the asymptotic variance of
b
b
as defined above for the
estimation of the independent component
s
i
, equals
V
G
= C (A)
E fg
2
(s
i
)g(E fs
i
g (s
i
)g)
2
(E fs
i
g (s
i
) g
0
(s
i
)g)
2
(14.5)
where
g
is the derivative of
G
, and
C (A)
is a constant that depends only on
A
.
The theorem is proven at the appendix of this chapter.
Thus the comparison of the asymptotic variances of two estimators for two different
nonquadratic functions
G
boils down to a comparison of the
V
G
. In particular, one
can use variational calculus to find a
G
that minimizes
V
G
. Thus one obtains the
following theorem [193]:
Theorem 14.2 The trace of the asymptotic variance of
b
b
is minimized when
G
is of
the form
G
opt
(y )=c
1
log p
i
(y )+c
2
y
2
+ c
3
(14.6)
where
p
i
is the density function of
s
i
, and
c
1
c
2
c
3
are arbitrary constants.
For simplicity, one can choose
G
opt
(y )=logp
i
(y )
. Thus, we see that the optimal
nonlinearity is in fact the one used in the definition of negentropy. This shows that
negentropy is the optimal measure of nongaussianity, at least inside those measures
that lead to estimators of the form considered here.
1
Also, one sees that the optimal
function is the same as the one obtained for several units by the maximum likelihood
approach.
14.3.2 Comparison of robustness *
Another very desirable property of an estimator is robustness against outliers. This
means that single, highly erroneous observations do not have much influence on the
estimator. In this section, we shall treat the question: How does the robustness of
the estimator
^
b
depend on the choice of the function
G
? The main result is that the
function
G(y )
should not grow fast as a function of
jy j
if we want robust estimators.
In particular, this means that kurtosis gives nonrobust estimators, which may be very
disadvantagous in some situations.
1
One has to take into account, however, that in the definition of negentropy, the nonquadratic function is
not fixed in advance, whereas in our nongaussianity measures,
G
is fixed. Thus, the statistical properties
of negentropy can be only approximatively derived from our analysis.
278
OVERVIEW AND COMPARISON OF BASIC ICA METHODS
First, note that the robustness of
^
b
depends also on the method of estimation used
in constraining the variance of
^
b
T
x
to equal unity, or, equivalently, the whitening
method. This is a problem independent of the choice of
G
. In the following, we
assume that this constraint is implemented in a robust way. In particular, we assume
that the data is sphered (whitened) in a robust manner, in which case the constraint
reduces to
k
^
wk =1
,where
w
is the value of
b
for whitened data. Several robust
estimators of the variance of
^
w
T
z
or of the covariance matrix of
x
are presented in
the literature; see reference [163].
The robustness of the estimator
^
w
can be analyzed using the theory of M-
estimators. Without going into technical details, the definition of an M-estimator
can be formulated as follows: an estimator is called an M-estimator if it is defined as
the solution
^
for
of
E f (z)g =0
(14.7)
where
z
is a random vector and
is some function defining the estimator. Now, the
point is that the estimator
^
w
is an M-estimator. To see this, define
=(w)
,where
is the Lagrangian multiplier associated with the constraint. Using the Lagrange
conditions, the estimator
^
w
can then be formulated as the solution of Eq. (14.7) where
is defined as follows (for sphered data):
(z)=
zg (w
T
z)+cw
kwk
2
1
(14.8)
where
c =(E
z
fG(
^
w
T
z)gE
fG( )g)
1
is an irrelevant constant.
The analysis of robustness of an M-estimator is based on the concept of an
influence function,
IF (z
^
)
. Intuitively speaking, the influence function measures
the influence of single observations on the estimator. It would be desirable to have
an influence function that is bounded as a function of
z
,asthisimpliesthateven
the influence of a far-away outlier is “bounded”, and cannot change the estimate
too much. This requirement leads to one definition of robustness, which is called
B-robustness. An estimator is called B-robust, if its influence function is bounded
as a function of
z
, i.e.,
sup
z
kIF (z
^
)k
is finite for every
^
. Even if the influence
function is not bounded, it should grow as slowly as possible when
kzk
grows, to
reduce the distorting effect of outliers.
It can be shown that the influence function of an M-estimator equals
IF (z
^
)=B (z
^
)
(14.9)
where
B
is an irrelevant invertible matrix that does not depend on
z
. On the other
hand, using our definition of
, and denoting by
= w
T
z=kzk
the cosine of the
angle between
z
and
w
, one obtains easily
k (z (w))k
2
= C
1
1
2
h
2
(w
T
z)+C
2
h(w
T
z)+C
3
(14.10)
where
C
1
C
2
C
3
are constants that do not depend on
z
,and
h(y )=yg(y )
. Thus we
see that the robustness of
^
w
essentially depends on the behavior of the function
h(u)
.