16
ICA with Overcomplete
Bases
A difficult problem in independent component analysis (ICA) is encountered if the
number of mixtures is smaller than the number of independent components .This
means that the mixing system is not invertible: We cannot obtain the independent
components (ICs) by simply inverting the mixing matrix . Therefore, even if
we knew the mixing matrix exactly, we could not recover the exact values of the
independent components. This is because information is lost in the mixing process.
This situation is often called ICA with overcomplete bases. This is because we
have in the ICA model
(16.1)
where the number of “basis vectors”, , is larger than the dimension of the space of
: thus this basis is “too large”, or overcomplete. Such a situation sometimes occurs
in feature extraction of images, for example.
As with noisy ICA, we actually have two different problems. First, how to estimate
the mixing matrix, and second, how to estimate the realizations of the independent
components. This is in stark contrast to ordinary ICA, where these two problems are
solved at the same time. This problem is similar to the noisy ICA in another respect
as well: It is much more difficult than the basic ICA problem, and the estimation
methods are less developed.
305
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
306
ICA WITH OVERCOMPLETE BASES
16.1 ESTIMATION OF THE INDEPENDENT COMPONENTS
16.1.1 Maximum likelihood estimation
Many methods for estimating the mixing matrix use as subroutines methods that
estimate the independent components for a known mixing matrix. Therefore, we
shall first treat methods for reconstructing the independent components, assuming
that we know the mixing matrix. Let us denote by the number of mixtures and by
the number of independent components. Thus, the mixing matrix has size
with , and therefore it is not invertible.
The simplest method of estimating the independent components would be to use
the pseudoinverse of the mixing matrix. This yields
(16.2)
In some situations, such a simple pseudoinverse gives a satisfactory solution, but in
many cases we need a more sophisticated estimate.
A more sophisticated estimator of
can be obtained by maximum likelihood
(ML) estimation [337, 275, 195], in a manner similar to the derivation of the ML or
maximum a posteriori (MAP) estimator of the noise-free independent components in
Chapter 15. We can write the posterior probability of as follows:
(16.3)
where is an indicator function that is 1 if and 0 otherwise. The (prior)
probability densities of the independent components are given by . Thus, we
obtain the maximum likelihood estimator of as
arg (16.4)
Alternatively, we could assume that there is noise present as well. In this case, we
get a likelihood that is formally the same as with ordinary noisy mixtures in (15.16).
The only difference is in the number of components in the formula.
The problem with the maximum likelihood estimator is that it is not easy to
compute. This optimization cannot be expressed as a simple function in analytic
form in any interesting case. It can be obtained in closed form if the have
gaussian distribution: In this case the optimum is given by the pseudoinverse in (16.2).
However, since ICA with gaussian variables is of little interest, the pseudoinverse is
not a very satisfactory solution in many cases.
In general, therefore, the estimator given by (16.4) can only be obtained by
numerical optimization. A gradient ascent method can be easily derived. One
case where the optimization is easier than usual is when the have a Laplacian
distribution:
(16.5)
ESTIMATION OF THE MIXING MATRIX
307
Ignoring uninteresting constants, we have
arg (16.6)
which can be formulated as a linear program and solved by classic methods for linear
programming [275].
16.1.2 The case of supergaussian components
Using a supergaussian distribution, such as the Laplacian distribution, is well justified
in feature extraction, where the components are supergaussian. Using the Laplacian
density also leads to an interesting phenomenon: The ML estimator gives coefficients
of which only are nonzero. Thus, only the minimum number of the components
are activated. Thus we obtain a sparse decomposition in the sense that the components
are quite often equal to zero.
It may seem at first glance that it is useless to try to estimate the ICs by ML
estimation, because they cannot be estimated exactly in any case. This is not so,
however; due to this phenomenon of sparsity, the ML estimation is very useful. In
the case where the independent components are very supergaussian, most of them
are very close to zero because of the large peak of the pdf at zero. (This is related to
the principle of sparse coding that will be treated in more detail in Section 21.2.)
Thus, those components that are not zero may not be very many, and the system
may be invertible for those components. If we first determine which components are
likely to be clearly nonzero, and then invert that part of the linear system, we may
be able to get quite accurate reconstructions of the ICs. This is done implicitly in
the ML estimation method. For example, assume that there are three speech signals
mixed into two mixtures. Since speech signals are practically zero most of the time
(which is reflected in their strong supergaussianity), we could assume that only two
of the signals are nonzero at the same time, and successfully reconstruct those two
signals [272].
16.2 ESTIMATION OF THE MIXING MATRIX
16.2.1 Maximizing joint likelihood
To estimate the mixing matrix, one can use maximum likelihood estimation. In
the simplest case of ML estimation, we formulate the joint likelihood of and the
realization of the , and maximize it with respect to all these variables. It is slightly
simpler to use a noisy version of the joint likelihood. This is of the same form as the
one in Eq. (15.16):
(16.7)
min
308
ICA WITH OVERCOMPLETE BASES
where is the noise variance, here assumed to be infinitely small, the are the
realizations of the independent components, and is an irrelevant constant. The
functions are the log-densities of the independent components.
Maximization of (16.7) with respect to and could be accomplished by a
global gradient ascent with respect to all the variables [337]. Another approach
to maximization of the likelihood is to use an alternating variables technique [195],
in which we first compute the ML estimates of the for a fixed and then,
using this new we compute the ML estimates of the and so on. The ML
estimate of the for a given is given by the methods of the preceding section,
considering the noise to be infinitely small. The ML estimate of for given
can be computed as
(16.8)
This algorithm needs some extra stabilization, however. For example, normalizing
the estimates of the to unit norm is necessary. Further stabilization can be obtained
by first whitening the data. Then we have (considering infinitely small noise)
(16.9)
which means that the rows of form an orthonormal system. This orthonormality
could be enforced after every step of (16.8), for further stabilization.
16.2.2 Maximizing likelihood approximations
Maximization of the joint likelihood is a rather crude method of estimation. From
a Bayesian viewpoint, what we really want to maximize is the marginal posterior
probability of the mixing matrix. (For basic concepts of Bayesian estimation, see
Section 4.6.)
A more sophisticated form of maximum likelihood estimation can be obtained
by using a Laplace approximation of the posterior distribution of . This improves
the stability of the algorithm, and has been successfully used for estimation of
overcomplete bases from image data [274], as well as for separation of audio signals
[272]. For details on the Laplace approximation, see [275]. An alternative for the
Laplace approximation is provided by ensemble learning; see Section 17.5.1.
A promising direction of research is given by Monte Carlo methods. These are
a class of methods often used in Bayesian estimation, and are based on numerical
integration using stochastic algorithms. One method in this class, Gibbs sampling,
has been used in [338] for overcomplete basis estimation. Monte Carlo methods
typically give estimators with good statistical properties; the drawback is that they
are computationally very demanding.
Also, one could use an expectation-maximization (EM) algorithm [310, 19]. Using
gaussian mixtures as models for the distributions of the independent components, the
algorithm can be derived in analytical form. The problem is, however, that its
complexity grows exponentially with the dimension of , and thus it can only be used
,
,
ESTIMATION OF THE MIXING MAT RIX
309
in small dimensions. Suitable approximations of the algorithm might alleviate this
limitation [19].
A very different approximation of the likelihood method was derived in [195],
in which a form of competitive neural learning was used to estimate overcomplete
bases with supergaussian data. This is a computationally powerful approximation
that seems to work for certain data sets. The idea is that the extreme case of
sparsity or supergaussianity is encountered when at most one of the ICs is nonzero
at any one time. Thus we could simply assume that only one of the components is
nonzero for a given data point, for example, the one with the highest value in the
pseudoinverse reconstruction. This is not a realistic assumption in itself, but it may
give an interesting approximation of the real situation in some cases.
16.2.3 Approximate estimation using quasiorthogonality
The maximum likelihood methods discussed in the preceding sections give a well-
justified approach to ICA estimation with overcomplete bases. The problem with
most of the methods in the preceding section is that they are computationally quite
expensive. A typical application of ICA with overcomplete bases is, however, feature
extraction. In feature extraction, we usually have spaces of very high dimensions.
Therefore, we show here a method [203] that is more heuristically justified, but has
the advantage of being not more expensive computationally than methods for basic
ICA estimation. This method is based on the FastICA algorithm, combined with the
concept of quasiorthogonality.
Sparse approximately uncorrelated decompositions
Our heuristic ap-
proach is justified by the fact that in feature extraction for many kinds of natural
data, the ICA model is only a rather coarse approximation. In particular, the number
of potential “independent components” seems to be infinite: The set of such com-
ponents is closer to a continuous manifold than a discrete set. One evidence for
this is that classic ICA estimation methods give different basis vectors when started
with different initial values, and the number of components thus produced does not
seem to be limited. Any classic ICA estimation method gives a rather arbitrary col-
lection of components which are somewhat independent, and have sparse marginal
distributions.
We can also assume, for simplicity, that the data is prewhitened as a preprocessing
step, as in most ICA method in Part II. Then the independent components are simply
given by the dot-products of the whitened data vector with the basis vectors .
Due to the preceding considerations, we assume in our approach that what is
usually needed, is a collection of basis vectors that has the following two properties:
1. The dot-products of the observed data with the basis vectors have sparse
(supergaussian) marginal distributions.
2. The should be approximately uncorrelated (“quasiuncorrelated”). Equiva-
lently, the vectors should be approximately orthogonal (“quasiorthogonal”).
310
ICA WITH OVERCOMPLETE BASES
A decomposition with these two properties seems to capture the essential properties
of the decomposition obtained by estimation of the ICA model. Such decompositions
could be called sparse approximately uncorrelated decompositions.
It is clear that it is possible to find highly overcomplete basis sets that have the
first property of these two. Classic ICA estimation is usually based on maximizing
the sparseness (or, in general, nongaussianity) of the dot-products, so the existence
of several different classic ICA decompositions for a given image data set shows the
existence of decompositions with the first property.
What is not obvious, however, is that it is possible to find strongly overcomplete
decompositions such that the dot-products are approximately uncorrelated. The main
point here is that this is possible because of the phenomenon of quasiorthogonality.
Quasiorthogonality in high-dimensional spaces
Quasiorthogonality [247]
is a somewhat counterintuitive phenomenon encountered in very high-dimensional
spaces. In a certain sense, there is much more room for vectors in high-dimensional
spaces. The point is that in an
-dimensional space, where is large, it is possible
to have (say) vectors that are practically orthogonal, i.e., their angles are close to
90 degrees. In fact, when grows, the angles can be made arbitrarily close to 90
degrees. This must be contrasted with small-dimensional spaces: If, for example,
, even the maximally separated vectors exhibit angles of 45 degrees.
For example, in image decomposition, we are usually dealing with spaces whose
dimensions are of the order of 100. Therefore, we can easily find decompositions of,
say, 400 basis vectors, such that the vectors are quite orthogonal, with practically all
the angles between basis vectors staying above 80 degrees.
FastICA with quasiorthogonalization
To obtain a quasiuncorrelated sparse
decomposition as defined above, we need two things. First, a method for finding
vectors that have maximally sparse dot-products, and second, a method of qua-
siorthogonalization of such vectors. Actually, most classic ICA algorithms can be
considered as maximizing the nongaussianity of the dot-products with the basis vec-
tors, provided that the data is prewhitened. (This was shown in Chapter 8.) Thus the
main problem here is constructing a proper method for quasidecorrelation.
We have developed two methods for quasidecorrelation: one of them is symmet-
ric and the other one is deflationary. This dichotomy is the same as in ordinary
decorrelation methods used in ICA. As above, it is here assumed that the data is
whitened.
A simple way of achieving quasiorthogonalization is to modify the ordinary
deflation scheme based on a Gram-Schmidt-like orthogonalization. This means
that we estimate the basis vectors one by one. When we have estimated basis
vectors , we run the one-unit fixed-point algorithm for , and after
every iteration step subtract from a certain proportion of the ’projections’
of the previously estimated vectors, and then renormalize
ESTIMATION OF THE MIXING MAT RIX
311
:
1.
2.
(16.10)
where
is a constant determining the force of quasiorthogonalization. If ,we
have ordinary, perfect orthogonalization. We have found in our experiments that an
in the range is sufficient in spaces where the dimension is 64.
In certain applications it may be desirable to use a symmetric version of quasi-
orthogonalization, in which no vectors are “privileged” over others [210, 197]. This
can be accomplished, for example, by the following algorithm:
1.
2. Normalize each column of to unit norm
(16.11)
which is closely related to the iterative symmetric orthogonalization method used for
basic ICA in Section 8.4.3. The present algorithm is simply doing one iteration of the
iterative algorithm. In some cases, it may be necessary to do two or more iterations,
although in the experiments below, just one iteration was sufficient.
Thus, the algorithm that we propose is similar to the FastICA algorithm as de-
scribed, e.g. in Section 8.3.5 in all other respects than the orthogonalization, which
is replaced by one of the preceding quasiorthogonalization methods.
Experiments with over complete image bases
We applied our algorithm on
image windows (patches) of pixels taken from natural images. Thus, we used
ICA for feature extraction as explained in detail in Chapter 21.
The mean of the image window (DC component) was removed as a preprocess-
ing step, so the dimension of the data was 63. Both deflationary and symmetric
quasiorthogonalization were used. The nonlinearity used in the FastICA algorithm
was the hyperbolic tangent. Fig. 16.1 shows an estimated approximately 4 times
overcomplete basis (with 240 components). The sample size was 14000. The results
shown here were obtained using the symmetric approach; the deflationary approach
yielded similar results, with the parameter fixed at .
The results show that the estimated basis vectors are qualitatively quite similar
to those obtained by other, computationally more expensive methods [274]; they
are also similar to those obtained by basic ICA (see Chapter 21). Moreover, by
computing the dot-products between different basis vectors, we see that the basis is,
indeed, quasiorthogonal. This validates our heuristic approach.
16.2.4 Other approaches
We mention here some other algorithms for estimation of overcomplete bases. First,
in [341], independent components with binary values were considered, and a geo-
metrically motivated method was proposed. Second, a tensorial algorithm for the
overcomplete estimation problem was proposed in [63]. Related theoretical results
were derived in [58]. Third, a natural gradient approach was developed in [5]. Fur-
312
ICA WITH OVERCOMPLETE BASES
Fig. 16.1
The basis vectors of a 4 times overcomplete basis. The dimension of the data is
63 (excluding the DC component, i.e., the local mean) and the number of basis vectors is 240.
The results are shown in the original space, i.e., the inverse of the preprocessing (whitening)
was performed. The symmetric approach was used. The basis vectors are very similar to
Gabor functions or wavelets, as is typical with image data (see Chapter 21).
CONCLUDING REMARKS
313
ther developments on estimation of overcomplete bases using methods similar to the
preceding quasiorthogonalization algorithm can be found in [208].
16.3 CONCLUDING REMARKS
The ICA problem becomes much more complicated if there are more independent
components than observed mixtures. Basic ICA methods cannot be used as such. In
most practical applications, it may be more useful to use the basic ICA model as an
approximation of the overcomplete basis model, because the estimation of the basic
model can be performed with reliable and efficient algorithms.
When the basis is overcomplete, the formulation of the likelihood is difficult,
since the problem belongs to the class of missing data problems. Methods based
on maximum likelihood estimation are therefore computationally rather inefficient.
To obtain computationally efficient algorithms, strong approximations are necessary.
For example, one can use a modification of the FastICA algorithm that is based on
finding a quasidecorrelating sparse decomposition. This algorithm is computationally
very efficient, reducing the complexity of overcomplete basis estimation to that of
classic ICA estimation.