Tải bản đầy đủ (.pdf) (10 trang)

Tài liệu Independent component analysis P13 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (221.25 KB, 10 trang )

13
Practical Considerations
In the preceding chapters, we presented several approaches for the estimation of
the independent component analysis (ICA) model. In particular, several algorithms
were proposed for the estimation of the basic version of the model, which has a
square mixing matrix and no noise. Now we are, in principle, ready to apply those
algorithms on real data sets. Many such applications will be discussed in Part IV.
However, when applying the ICA algorithms to real data, some practical con-
siderations arise and need to be taken into account. In this chapter, we discuss
different problems that may arise, in particular, overlearning and noise in the data.
We also propose some preprocessing techniques (dimension reduction by principal
component analysis, time filtering) that may be useful and even necessary before the
application of the ICA algorithms in practice.
13.1 PREPROCESSING BY TIME FILTERING
The success of ICA for a given data set may depend crucially on performing some
application-dependent preprocessing steps. In the basic methods discussed in the
previous chapters, we always used centering in preprocessing, and often whitening
was done as well. Here we discuss further preprocessing methods that are not
necessary in theory, but are often very useful in practice.
263
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
264
PRACTICAL CONSIDERATIONS
13.1.1 Why time filtering is possible
In many cases, the observed random variables are, in fact, time signals or time series,
which means that they describe the time course of some phenomenon or system.


Thus the sample index in is a time index. In such a case, it may be very useful
to filter the signals. In other words, this means taking moving averages of the time
series. Of course, in the ICA model no time structure is assumed, so filtering is not
always possible: If the sample points cannot be ordered in any meaningful way
with respect to , filtering is not meaningful, either.
For time series, any linear filtering of the signals is allowed, since it does not
change the ICA model. In fact, if we filter linearly the observed signals
to
obtain new signals, say , the ICA model still holds for , with the same
mixing matrix. This can be seen as follows. Denote by the matrix that contains
the observations as its columns, and similarly for . Then the ICA
model can be expressed as:
(13.1)
Now, time filtering of corresponds to multiplying from the right by a matrix, let
us call it .Thisgives
(13.2)
which shows that the ICA model still remains valid. The independent components
are filtered by the same filtering that was applied on the mixtures. They are not
mixed with each other in because the matrix is by definition a component-wise
filtering matrix.
Since the mixing matrix remains unchanged, we can use the filtered data in the
ICA estimating method only. After estimating the mixing matrix, we can apply the
same mixing matrix on the original data to obtain the independent components.
The question then arises what kind of filtering could be useful. In the following,
we consider three different kinds of filtering: high-pass and low-pass filtering, as
well as their compromise.
PREPROCESSING BY TIME FILTERING
265
13.1.2 Low-pass filtering
Basically, low-pass filtering means that every sample point is replaced by a weighted

average of that point and the points immediately before it.
1
This is a form of
smoothing the data. Then the matrix in (13.2) would be something like
.
.
.
.
.
.
(13.3)
Low-pass filtering is often used because it tends to reduce noise. This is a well-
known property in signal processing that is explained in most basic signal processing
textbooks.
In the basic ICA model, the effect of noise is more or less neglected; see Chapter 15
for a detailed discussion. Thus basic ICA methods work much better with data that
does not have much noise, and reducing noise is thus useful and sometimes even
necessary.
A possible problem with low-pass filtering is that it reduces the information in the
data, since the fast-changing, high-frequency features of the data are lost. It often
happens that this leads to a reduction of independence as well (see next section).
13.1.3 High-pass filtering and innovations
High-pass filtering is the opposite of low-pass filtering. The point is to remove slowly
changing trends from the data. Thus a low-pass filtered version is subtracted from
the signal. A classic way of doing high-pass filtering is differencing, which means
replacing every sample point by the d ifference between the value at that point and
the value at the preceding point. Thus, the matrix in (13.2) would be
.
.
.

.
.
.
(13.4)
1
To have a causal filter, points after the current point may be left out of the averaging.
266
PRACTICAL CONSIDERATIONS
High-pass filtering may be useful in ICA because in certain cases it increases
the independence of the components. It often happens in practice that the compo-
nents have slowly changing trends or fluctuations, in which case they are not very
independent. If these slow fluctuations are removed b y high-pass filtering the fil-
tered components are often much more independent. A more principled approach to
high-pass filtering is to consider it in the light of innovation processes.
Innovation processes
Given a stochastic process , we define its innovation
process as the error of the best prediction of , g iven its past. Such a best
prediction is given by the conditional expectation of given its past, because it
is the expected value of the conditional distribution of given its past. Thus the
innovation process of is defined by
(13.5)
The expression “innovation” describes the fact that contains all the new infor-
mation about the process that can be obtained at time by observing .
The concept of innovations can be utilized in the estimation of the ICA model due
to the following property:
Theorem 13.1 If and follow the basic ICA model, then the innovation
processes and follow the ICA model as well. In particular, the components
are independent from each other.
On the other hand, independence of the innovations does not imply the indepen-
dence of the . Thus, the innovations are more often independent from each

other than the original processes. Moreover, one could argue that the innovations
are usually more nongaussian than the original processes. This is because the
is a kind of moving average of the innovation process, and sums tend to be more
gaussian than the original variable. Together these mean that the innovation process
is more susceptible to be independent and nongaussian, and thus to fulfill the basic
assumptions in ICA.
Innovation processes were discussed in more detail in [194], where it was also
shown that using innovations, it is possible to separate signals (images of faces) that
are otherwise strongly correlated and very difficult to separate.
The connection between innovations and ordinary filtering techniques is that the
computation of the innovation process is often rather similar to high-pass filtering.
Thus, the arguments in favor of using innovation processes apply at least partly in
favor of high-pass filtering.
A possible problem with high-pass filtering, however, is that it may increase noise
for the same reasons that low-pass filtering decreases noise.
13.1.4 Optimal filtering
Both of the preceding types of filtering have their pros and cons. The optimum would
be to find a filter that increases the independence of the components while reducing
PREPROCESSING BY PCA
267
noise. To achieve this, some compromise between high- and low-pass filtering may
be the b est solution. This leads to band-pass filtering, in which the highest and the
lowest frequencies are filtered out, leaving a suitable frequency band in between.
What this band should be depends on the data and general answers are impossible to
give.
In addition to simple low-pass/high-pass filtering, one might also u se more so-
phisticated techniques. For example, one might take the (1-D) wavelet transforms of
the data [102, 290, 17]. Other time-frequency decompositions could be used as well.
13.2 PREPROCESSING BY PCA
A common preprocessing technique for multidimensional data is to reduce its dimen-

sion by principal component analysis (PCA). PCA was explained in more detail in
Chapter 6. Basically, the data is projected linearly onto a subspace
(13.6)
so that the maximum amount of information (in the least-squares sense) is preserved.
Reducing dimension in this way has several benefits which we discuss in the next
subsections.
13.2.1 Making the mixing matrix square
First, let us consider the case where the the number of independent components
is smaller than the number of mixtures, say . Performing ICA on the mixtures
directly can cause big problems in such a case, since the basic ICA model does not
hold anymore. Using PCA we can reduce the dimension of the data to . After such
a reduction, the number of mixtures and I Cs are equal, the mixing matrix is square,
and the basic ICA model holds.
The question is whether PCA is able to find the subspace correctly, so that the
ICs can be estimated from the reduced mixtures. This is not true in general, but
in a special case it turns out to be the case. If the data consists of ICs only, with
no noise added, the whole data is contained in an -dimensional subspace. Using
PCA for dimension reduction clearly finds this -dimensional subspace, since the
eigenvalues corresponding to that subspace, and only those eigenvalues, are nonzero.
Thus reducing dimension with PCA works correctly. In practice, the data is usually
not exactly contained in the subspace, due to noise and other factors, but if the noise
level is low, PCA still finds approximately the right subspace; see Section 6.1.3. In
the general case, some “weak” ICs may be lost in the dimension reduction process,
but PCA may still be a good idea for optimal estimation of the “strong” ICs [313].
Performing first PCA and then ICA has an interesting interpretation in terms of
factor analysis. In factor analysis, it is conventional that after finding the factor
subspace, the actual basis vectors for that subspace are determined by some criteria
268
PRACTICAL CONSIDERATIONS
that make the mixing matrix as simple as possible [166]. This is called factor rotation.

Now, ICA can be interpreted as one method for determining this factor rotation, based
on higher-order statistics instead of the structure o f the mixing matrix.
13.2.2 Reducing noise and preventing overlearning
A well-known benefit of reducing the dimension of the data is that it reduces noise,
as was already discussed in Chapter 6. Often, the dimensions that have been omitted
consist mainly of noise. This is especially true in the case where the number of ICs
is smaller than the number of mixtures.
Another benefit of reducing dimensions is that it prevents overlearning, to which
the rest of this subsection is devoted. Overlearning means that if the number of
parameters in a statistical model is too large when compared to the number of
available data points, the estimation of the parameters becomes difficult, maybe
impossible. The estimation of the parameters is then too much determined by the
available sample points, instead of the actual process that generated the data, which
is what we are really interested in.
Overlearning in ICA [214] typically produces estimates of the ICs that have a
single spike or bump, and are practically zero everywhere else. This is because in the
space of source signals of unit variance, nongaussianity is more or less maximized
by such spike/bump signals. This becomes easily comprehensible if we consider
theextremecasewherethesamplesize equals the dimension of the data ,and
these are both equal to the number of independent components . Let us collect
the realizations of as the columns of the matrix , and denote by the
corresponding matrix of the realizations of , as in (13.1). Note that now all the
matrices in (13.1) are square. This means that by changing the values of (and
keeping fixed), we can give any values whatsoever to the elements of .Thisis
a case of serious overlearning, not unlike the classic case of regression with equal
numbers of data points and parameters.
Thus it is clear that in this case, the estimate of that is obtained by ICA
estimation depends little on the observed data. Let us assume that the densities of
the source signals are known to be supergaussian (i.e., positively kurtotic). Then the
ICA estimation basically consists of finding a separating matrix that maximizes a

measure of the supergaussianities (or sparsities) of the estimates of the source signals.
Intuitively, it is easy to see that sparsity is maximized when the source signals each
have only one nonzero point. Thus we see that ICA estimation with an insufficient
sample size leads to a form of overlearning that gives artifactual (spurious) source
signals. Such source signals are characterized by larg e spikes.
An important fact shown experimentally [214] is that a similar phenomenon is
much more likely to occur if the source signals are not independently and identically
distributed (i.i.d.) in time, but have strong time-dependencies. In such cases the
sample size needed to get rid of overlearning is much larger, and the source signals
are better characterized by bumps, i.e., low-pass filtered versions of spikes. An
intuitive way of explaining this phenomenon is to consider such a signal as being
constant on blocks of consecutive sample points. This means that the data can
HOW MANY COMPONENTS SHOULD BE ESTIMATED?
269
be considered as really having only sample points; each sample point has simply
been repeated times. Thus, in the case of overlearning, the estimation procedure
gives “spikes” that have a width of time points, i.e., bumps.
Here we illustrate the phenomenon by separation of artificial source signals.
Three positively kurtotic signals, with 500 sample points each, were used in these
simulations, and are depicted in Fig. 13.1 a. Five hundred mixtures were produced,
and a very small amount of gaussian noise was added to each mixture separately.
As an example of a successful ICA estimation, Fig. 13.1 b shows the result of
applying the FastICA and maximum likelihood (ML) gradient ascent algorithms
(denoted by “Bell-Sejnowski”) to the mixed signals. In both approaches, the prepro-
cessing (whitening) stage included a dimension reduction of the data into the first
three principal components. It is evident that both algorithms are able to extract all
the i nitial signals.
In contrast, when the whitening is made with very small dimension reduction (we
took 400 dimensions), we see the emergence of spiky solutions (like Dirac functions),
which is an extreme case of kurtosis maximization (Fig. 13.1 c). The algorithm used

in FastICA was of a d eflationary type, from which we plot the first five components
extracted. As for the ML gradient ascent, which was of a symmetric type, we show
five representative solutions to the 400 extracted.
Thus, we see here that without dimension reduction, we are not able to estimate
the source signals.
Fig. 13.1 d presents an intermediate stage of dimension reduction (from the original
500 mixtures we took 50 whitened vectors). We see that the actual source signals are
revealed by both methods, even though each resulting vector is more noisy than the
ones shown in Fig. 13.1 b.
For the final example, in Fig. 13.1 e, we low-pass filtered the mixed signals, prior to
the independent component analysis, using a 10 delay moving average filter. Taking
the same amount of principal components as in d,wecanseethatweloseallthe
original source signals: the decompositions show a bumpy structure corresponding to
the low-pass filtering of the spiky outputs presented in c. Through low-pass filtering,
we have reduced the information contained in the data, and so the estimation is
rendered impossible even with this, not very weak, dimension reduction. Thus, we
see that with this low-pass filtered data, a much stronger dimension reduction by
PCA is necessary to prevent overlearning.
In addition to PCA, some kind of prior information on the mixing matrix could be
useful in preventing overlearning. This is considered in detail in Section 20.1.3.
13.3 HOW MANY COMPONENTS SHOULD BE ESTIMATED?
Another problem that often arises in practice is to decide the number of ICs to be
estimated. This problem does not arise if one simply estimates the same number
of components as the dimension of the data. This may not always be a good idea,
however.
270
PRACTICAL CONSIDERATIONS
(a)
(b)
Fast ICA Bell−Sejnowski

(c)
(d)
(e)
Fig. 13.1
(From [214]) I llustration o f the importance of the de gree of dimension reduction
and filtering in artificially generated data, using FastICA and a gradient algorithm for ML
estimation. (a) Original positively kurtotic signals. (b) ICA decomposition in which the
preprocessing includes a dimension reduction to the first 3 principal components. (c) Poor,
i.e., too weak dimension reduction. (d) Decomposition using an intermediate dimension
reduction ( 50 components r etained). (e) Same results as in (d) but using low-pass filtered
mixtures
CHOICE OF ALGORITHM
271
First, since dimension reduction by PCA is often necessary, one must choose
the number of principal components to be retained. This is a classic p roblem;
see Chapter 6. It is usually solved by choosing the minimum number of principal
components that explain the data well enough, containing, for example, of the
variance. Often, the dimension is actually chosen by trial and error with no theoretical
guidelines.
Second, for computational reasons we may prefer to estimate only a smaller
number of ICs than the dimension of the data (after PCA preprocessing). This is the
case when the dimension of the data is very large, and we do not want to reduce the
dimension by PCA too much, since PCA always contains the risk of not including the
ICs in the reduced data. Using FastICA and other algorithms that allow estimation of
a smaller number of components, we can thus perform a kind o f dimension reduction
by ICA. In fact, this is an idea somewhat similar to projection pursuit. Here, it is
even more difficult to give any guidelines as to how many components should be
estimated. Trial and error may be the only method applicable.
Information-theoretic, Bayesian, and other criteria for determining the number of
ICs are discussed in more detail in [231, 81, 385].

13.4 CHOICE OF ALGORITHM
Now we shall briefly discuss the choice of ICA algorithm from a practical viewpoint.
As will be discussed in detail in Chapter 14, most estimation principles and objectiv e
functions for ICA are equivalent, at least in theory. So, the main choice is reduced to
a couple of points:
One choice is between estimating all the independent components in parallel,
or just estimating a few of them (possibly one-by-one). This corresponds to
choosing between symmetric and hierarchical decorrelation. In most cases,
symmetric decorrelation is recommended. Deflation is mainly useful in the
case where we want to estimate only a very limited number of ICs, and other
special cases. The disadvantage with deflationary orthogonalization is that the
estimation errors in the components that are estimated first accumulate and
increase the errors in the later components.
One must also choose the nonlinearity used in the algorithms. It seems that the
robust, nonpolynomial nonlinearities are to be preferred in most applications.
Thesimplestthingtodoistojustusethe function as the nonlinearity
. This is sufficient when using FastICA. (When using gradient algorithms,
especially in the ML framework, a second function needs to be used as well;
see Chapter 9.)
Finally, there is the choice between on-line and batch algorithms. In most
cases, the whole data set is available before the estimation, which is called
in different contexts batch, block, or off-line estimation. This is the case
where FastICA can be used, and it is the algorithm that we recommend. On-
272
PRACTICAL CONSIDERATIONS
line or adaptive algorithms are needed in signal-processing applications where
the mixing matrix may change on-line, and fast tracking is needed. In the
on-line case, the recommended algorithms are those obtained by stochastic
gradient methods. It should also be noted that in some cases, the FastICA
algorithm may not converge well as Newton-type algorithms sometimes exhibit

oscillatory behavior. This problem can be alleviated by using gradient methods,
or combinations of the two (see [197]).
13.5 CONCLUDING REMARKS AND REFERENCES
In this chapter, we considered some practical problems in ICA. When dealing with
time signals, low-pass filtering of the data is useful to reduce noise. On the other
hand, high-pass filtering, or computing innovation processes is useful to increase the
independence and nongaussianity of the components. One of these, or their combi-
nation may be very useful in practice. Another very useful thing to do is to reduce the
dimension of the data by PCA. This reduces noise and prevents overlearning. It may
also solve the problems with data that has a smaller number of ICs than mixtures.
Problems
13.1 Take a Fourier transform on every observed signal . Does the ICA model
still hold, and in what way?
13.2 Prove the theorem on innovations.
Computer assignments
13.1 Take a gaussian white noise sequence. Low-pass filter it by a low-pass filter
with coefficients ( ,0,0,1,1,1,1,1,0,0,0, ). What does the signal look like?
13.2 High-pass filter the gaussian white noise sequence. What does the signal look
like?
13.3 Generate 100 samples of 100 independent components. Run FastICA on this
data without any mixing. What do the estimated ICs look like? Is the estimate o f the
mixing matrix close to identity?

×