13
Practical Considerations
In the preceding chapters, we presented several approaches for the estimation of
the independent component analysis (ICA) model. In particular, several algorithms
were proposed for the estimation of the basic version of the model, which has a
square mixing matrix and no noise. Now we are, in principle, ready to apply those
algorithms on real data sets. Many such applications will be discussed in Part IV.
However, when applying the ICA algorithms to real data, some practical con-
siderations arise and need to be taken into account. In this chapter, we discuss
different problems that may arise, in particular, overlearning and noise in the data.
We also propose some preprocessing techniques (dimension reduction by principal
component analysis, time filtering) that may be useful and even necessary before the
application of the ICA algorithms in practice.
13.1 PREPROCESSING BY TIME FILTERING
The success of ICA for a given data set may depend crucially on performing some
application-dependent preprocessing steps. In the basic methods discussed in the
previous chapters, we always used centering in preprocessing, and often whitening
was done as well. Here we discuss further preprocessing methods that are not
necessary in theory, but are often very useful in practice.
263
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
264
PRACTICAL CONSIDERATIONS
13.1.1 Why time filtering is possible
In many cases, the observed random variables are, in fact, time signals or time series,
which means that they describe the time course of some phenomenon or system.
Thus the sample index
t
in
x
i
(t)
is a time index. In such a case, it may be very useful
to filter the signals. In other words, this means taking moving averages of the time
series. Of course, in the ICA model no time structure is assumed, so filtering is not
always possible: If the sample points
x(t)
cannot be ordered in any meaningful way
with respect to
t
, filtering is not meaningful, either.
For time series, any linear filtering of the signals is allowed, since it does not
change the ICA model. In fact, if we filter linearly the observed signals
x
i
(t)
to
obtain new signals, say
x
i
(t)
, the ICA model still holds for
x
i
(t)
, with the same
mixing matrix. This can be seen as follows. Denote by
X
the matrix that contains
the observations
x(1):::x(T )
as its columns, and similarly for
S
. Then the ICA
model can be expressed as:
X = AS
(13.1)
Now, time filtering of
X
corresponds to multiplying
X
from the right by a matrix, let
us call it
M
.Thisgives
X
= XM = ASM = AS
(13.2)
which shows that the ICA model still remains valid. The independent components
are filtered by the same filtering that was applied on the mixtures. They are not
mixed with each other in
S
because the matrix
M
is by definition a component-wise
filtering matrix.
Since the mixing matrix remains unchanged, we can use the filtered data in the
ICA estimating method only. After estimating the mixing matrix, we can apply the
same mixing matrix on the original data to obtain the independent components.
The question then arises what kind of filtering could be useful. In the following,
we consider three different kinds of filtering: high-pass and low-pass filtering, as
well as their compromise.
PREPROCESSING BY TIME FILTERING
265
13.1.2 Low-pass filtering
Basically, low-pass filtering means that every sample point is replaced by a weighted
average of that point and the points immediately before it.
1
This is a form of
smoothing the data. Then the matrix
M
in (13.2) would be something like
M =
1
3
0
B
B
B
B
B
B
B
B
B
B
B
B
@
.
.
.
:::1 1 1 0 0 0 0 0 :::
:::0 1 1 1 0 0 0 0 :::
:::0 0 1 1 1 0 0 0 :::
:::0 0 0 1 1 1 0 0 :::
:::0 0 0 0 1 1 1 0 :::
:::0 0 0 0 0 1 1 1 :::
.
.
.
1
C
C
C
C
C
C
C
C
C
C
C
C
A
(13.3)
Low-pass filtering is often used because it tends to reduce noise. This is a well-
known property in signal processing that is explained in most basic signal processing
textbooks.
In the basic ICA model, the effect of noise is more or less neglected; see Chapter 15
for a detailed discussion. Thus basic ICA methods work much better with data that
does not have much noise, and reducing noise is thus useful and sometimes even
necessary.
A possible problem with low-pass filtering is that it reduces the information in the
data, since the fast-changing, high-frequency features of the data are lost. It often
happens that this leads to a reduction of independence as well (see next section).
13.1.3 High-pass filtering and innovations
High-pass filtering is the opposite of low-pass filtering. The point is to remove slowly
changing trends from the data. Thus a low-pass filtered version is subtracted from
the signal. A classic way of doing high-pass filtering is differencing, which means
replacing every sample point by the difference between the value at that point and
the value at the preceding point. Thus, the matrix
M
in (13.2) would be
M =
0
B
B
B
B
B
B
B
B
B
B
B
B
@
.
.
.
:::1 1 0 0 0 0 0 :::
:::0 1 1 0 0 0 0 :::
:::0 0 1 1 0 0 0 :::
:::0 0 0 1 1 0 0 :::
:::0 0 0 0 1 1 0 :::
:::0 0 0 0 0 1 1 :::
.
.
.
1
C
C
C
C
C
C
C
C
C
C
C
C
A
(13.4)
1
To have a causal filter, points after the current point may be left out of the averaging.
266
PRACTICAL CONSIDERATIONS
High-pass filtering may be useful in ICA because in certain cases it increases
the independence of the components. It often happens in practice that the compo-
nents have slowly changing trends or fluctuations, in which case they are not very
independent. If these slow fluctuations are removed by high-pass filtering the fil-
tered components are often much more independent. A more principled approach to
high-pass filtering is to consider it in the light of innovation processes.
Innovation processes
Given a stochastic process
s(t)
, we define its innovation
process
~
s(t)
as the error of the best prediction of
s(t)
, given its past. Such a best
prediction is given by the conditional expectation of
s(t)
given its past, because it
is the expected value of the conditional distribution of
s(t)
given its past. Thus the
innovation process of
~
s(t)
is defined by
~
s(t)=s(t) E fs(t)js(t 1) s(t 2):::g
(13.5)
The expression “innovation” describes the fact that
~
s(t)
contains all the new infor-
mation about the process that can be obtained at time
t
by observing
s(t)
.
The concept of innovations can be utilized in the estimation of the ICA model due
to the following property:
Theorem 13.1 If
x(t)
and
s(t)
follow the basic ICA model, then the innovation
processes
~
x(t)
and
~
s(t)
follow the ICA model as well. In particular, the components
~s
i
(t)
are independent from each other.
On the other hand, independence of the innovations does not imply the indepen-
dence of the
s
i
(t)
. Thus, the innovations are more often independent from each
other than the original processes. Moreover, one could argue that the innovations
are usually more nongaussian than the original processes. This is because the
s
i
(t)
is a kind of moving average of the innovation process, and sums tend to be more
gaussian than the original variable. Together these mean that the innovation process
is more susceptible to be independent and nongaussian, and thus to fulfill the basic
assumptions in ICA.
Innovation processes were discussed in more detail in [194], where it was also
shown that using innovations, it is possible to separate signals (images of faces) that
are otherwise strongly correlated and very difficult to separate.
The connection between innovations and ordinary filtering techniques is that the
computation of the innovation process is often rather similar to high-pass filtering.
Thus, the arguments in favor of using innovation processes apply at least partly in
favor of high-pass filtering.
A possible problem with high-pass filtering, however, is that it may increase noise
for the same reasons that low-pass filtering decreases noise.
13.1.4 Optimal filtering
Both of the preceding types of filtering have their pros and cons. The optimum would
be to find a filter that increases the independence of the components while reducing
PREPROCESSING BY PCA
267
noise. To achieve this, some compromise between high- and low-pass filtering may
be the best solution. This leads to band-pass filtering, in which the highest and the
lowest frequencies are filtered out, leaving a suitable frequency band in between.
What this band should be depends on the data and general answers are impossible to
give.
In addition to simple low-pass/high-pass filtering, one might also use more so-
phisticated techniques. For example, one might take the (1-D) wavelet transforms of
the data [102, 290, 17]. Other time-frequency decompositions could be used as well.
13.2 PREPROCESSING BY PCA
A common preprocessing technique for multidimensional data is to reduce its dimen-
sion by principal component analysis (PCA). PCA was explained in more detail in
Chapter 6. Basically, the data is projected linearly onto a subspace
~
x = E
n
x
(13.6)
so that the maximum amount of information (in the least-squares sense) is preserved.
Reducing dimension in this way has several benefits which we discuss in the next
subsections.
13.2.1 Making the mixing matrix square
First, let us consider the case where the the number of independent components
n
is smaller than the number of mixtures, say
m
. Performing ICA on the mixtures
directly can cause big problems in such a case, since the basic ICA model does not
hold anymore. Using PCA we can reduce the dimension of the data to
n
. After such
a reduction, the number of mixtures and ICs are equal, the mixing matrix is square,
and the basic ICA model holds.
The question is whether PCA is able to find the subspace correctly, so that the
n
ICs can be estimated from the reduced mixtures. This is not true in general, but
in a special case it turns out to be the case. If the data consists of
n
ICs only, with
no noise added, the whole data is contained in an
n
-dimensional subspace. Using
PCA for dimension reduction clearly finds this
n
-dimensional subspace, since the
eigenvalues corresponding to that subspace, and only those eigenvalues, are nonzero.
Thus reducing dimension with PCA works correctly. In practice, the data is usually
not exactly contained in the subspace, due to noise and other factors, but if the noise
level is low, PCA still finds approximately the right subspace; see Section 6.1.3. In
the general case, some “weak” ICs may be lost in the dimension reduction process,
but PCA may still be a good idea for optimal estimation of the “strong” ICs [313].
Performing first PCA and then ICA has an interesting interpretation in terms of
factor analysis. In factor analysis, it is conventional that after finding the factor
subspace, the actual basis vectors for that subspace are determined by some criteria