Tải bản đầy đủ (.pdf) (13 trang)

Tài liệu Part III: EXTENSIONS AND RELATED METHODS pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (259.94 KB, 13 trang )

Part III
EXTENSIONS AND
RELATED METHODS
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
15
Noisy ICA
In real life, there is always some kind of noise present in the observations. Noise
can correspond to actual physical noise in the measuring devices, or to inaccuracies
of the model used. Therefore, it has been proposed that the independent component
analysis (ICA) model should include a noise term as well. In this chapter, we consider
different methods for estimating the ICA model when noise is present.
However, estimation of the mixing matrix seems to be quite difficult when noise
is present. It could be argued that in practice, a better approach could often be to
reduce noise in the data before performing ICA. For example, simple filtering of
time-signals is often very useful in this respect, and so is dimension reduction by
principal component analysis (PCA); see Sections 13.1.2 and 13.2.2.
In noisy ICA, we also encounter a new problem: estimation of the noise-free
realizations of the independent components (ICs). The noisy model is not invertible,
and therefore estimation of the noise-free components requires new methods. This
problem leads to some interesting forms of denoising.
15.1 DEFINITION
Here we extend the basic ICA model to the situation where noise is present. The
noise is assumed to be additive. This is a rather realistic assumption, standard in
factor analysis and signal processing, and allows for a simple formulation of the noisy
model. Thus, the noisy ICA model can be expressed as
(15.1)


293
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
294
NOISY ICA
where is the noise vector. Some further assumptions on the noise
are usually made. In particular, it is assumed that
1. The noise is independent from the independent components.
2. The noise is gaussian.
The covariance matrix of the noise, say , is often assumed to of the form ,but
this may be too restrictive in some cases. In any case, the noise covariance is assumed
to be known. Little work on estimation of an unknown noise covariance has been
conducted; see [310, 215, 19].
The identifiability of the mixing matrix in the noisy ICA model is guaranteed
under the same restrictions that are sufficient in the basic case,
1
basically meaning
independence and nongaussianity. In contrast, the realizations of the independent
components can no longer be identified, because they cannot be completely sepa-
rated from noise.
15.2 SENSOR NOISE VS. SOURCE NOISE
In the typical case where the noise covariance is assumed to be of the form ,the
noise in Eq. (15.1) could be considered as “sensor” noise. This is because the noise
variables are separately added on each sensor, i.e., observed variable . Thisisin
contrast to “source” noise, in which the noise is added to the independent components
(sources). Source noise can be modeled with an equation slightly different from the

preceding, given by
(15.2)
where again the covariance of the noise is diagonal. In fact, we could consider the
noisy independent components, given by , and rewrite the model as
(15.3)
We see that this is just the basic ICA model, with modified independent components.
What is important is that the assumptions of the basic ICA model are still valid: the
components of
are nongaussian and independent. Thus we can estimate the model
in (15.3) by any method for basic ICA. This gives us a perfectly suitable estimator
for the noisy ICA model. This way we can estimate the mixing matrix and the noisy
independent components. The estimation of the original independent components
from the noisy ones is an additional problem, though; see below.
This idea is, in fact, more general. Assume that the noise covariance has the form
(15.4)
1
This seems to be admitted by the vast majority of ICA researchers. We are not aware of any rigorous
proofs of this property, though.
FEW NOISE SOURCES
295
Then the noise vector can be transformed into another one , which can be
called equivalent source noise. Then the equation (15.1) becomes
(15.5)
The point is that the covariance of is , and thus the transformed components in
are independent. Thus, we see again that the mixing matrix can be estimated
by basic ICA methods.
To recapitulate: if the noise is added to the independent components and not to the
observed mixtures, or has a particular covariance structure, the mixing matrix can be
estimated by ordinary ICA methods. The denoising of the independent components
is another problem, though; it will be treated in Section 15.5 below.

15.3 FEW NOISE SOURCES
Another special case that reduces to the basic ICA model can be found, when the
number of noise components and independent components is not very large. In
particular, if their total number is not larger than the number of mixtures, we again
have an ordinary ICA model, in which some of the components are gaussian noise and
others are the real independent components. Such a model could still be estimated
by the basic ICA model, using one-unit algorithms with less units than the dimension
of the data.
In other words, we could define the vector of the independent components as
where the are the “real” independent
components and the are the noise variables. Assume that the number
of mixtures equals , that is the number of real ICs plus the number of noise
variables. In this case, the ordinary ICA model holds with ,where is
a matrix that incorporates the mixing of the real ICs and the covariance structure
of the noise, and the number of the independent components in is equal to the
number of observed mixtures. Therefore, finding the most nongaussian directions,
we can estimate the real independent components. We cannot estimate the remaining
dummy independent components that are actually noise variables, but we did not
want to estimate them in the first place.
The applicability of this idea is quite limited, though, since in most cases we want
to assume that the noise is added on each mixture, in which case , the number
of real ICs plus the number of noise variables, is necessarily larger than the number
of mixtures, and the basic ICA model does not hold for .
15.4 ESTIMATION OF THE MIXING MATRIX
Not many methods for noisy ICA estimation exist in the general case. The estimation
of the noiseless model seems to be a challenging task in itself, and thus the noise is
usually neglected in order to obtain tractable and simple results. Moreover, it may
296
NOISY ICA
be unrealistic in many cases to assume that the data could be divided into signals and

noise in any meaningful way.
Here we treat first the problem of estimating the mixing matrix. Estimation of the
independent components will be treated below.
15.4.1 Bias removal techniques
Perhaps the most promising approach to noisy ICA is given by bias removal tech-
niques. This means that ordinary (noise-free) ICA methods are modified so that the
bias due to noise is removed, or at least reduced.
Let us denote the noise-free data in the following by
(15.6)
We can now use the basic idea of finding projections, say
, in which nongaus-
sianity, is locally maximized for whitened data, with constraint .Asshown
in Chapter 8, projections in such directions give consistent estimates of the indepen-
dent components, if the measure of nongaussianity is well chosen. This approach
could be used for noisy ICA as well, if only we had measures of nongaussianity
which are immune to gaussian noise, or at least, whose values for the original data
can be easily estimated from noisy observations. We have ,
and thus the point is to measure the nongaussianity of from the observed
so that the measure is not affected by the noise .
Bias removal for kurtosis
If the measure of nongaussianity is kurtosis (the
fourth-order cumulant), it is almost trivial to construct one-unit methods for noisy
ICA, because kurtosis is immune to gaussian noise. This is because the kurtosis of
equals the kurtosis of , as can be easily proven by the basic properties of
kurtosis.
It must be noted, however, that in the preliminary whitening, the effect of noise
must be taken into account; this is quite simple if the noise covariance matrix is
known. Denoting by the covariance matrix of the observed noisy
data, the ordinary whitening should be replaced by the operation
(15.7)

In other words, the covariance matrix of the noise-free data should be used in
whitening instead of the covariance matrix of the noisy data. In the following, we
call this operation “quasiwhitening”. After this operation, the quasiwhitened data
follows a noisy ICA model as well:
(15.8)
where is orthogonal,and is a linear transform of the original noise in (15.1).
Thus, the theorem in Chapter 8 is valid for , and finding local maxima of the absolute
value of kurtosis is a valid method for estimating the independent components.
ESTIMATION OF THE MIXING MATRIX
297
Bias removal for general nongaussianity measures
As was argued in
Chapter 8, it is important in many applications to use measures of nongaussianity
that have better statistical properties than kurtosis. We introduced the following
measure:
(15.9)
where the function is a sufficiently regular nonquadratic function, and is a
standardized gaussian variable.
Such a measure could be used for noisy data as well, if only we were able to
estimate
of the noise-free data from the noisy observations . Denoting
by a nongaussian random variable, and by a gaussian noise variable of variance
, we should be able to express the relation between and
in simple algebraic terms. In general, this relation seems quite complicated, and can
be computed only using numerical integration.
However, it was shown in [199] that for certain choices of , a similar relation
becomes very simple. The basic idea is to choose to be the density function of
a zero-mean gaussian random variable, or a related function. These nonpolynomial
moments are called gaussian moments.
Denote by

(15.10)
the gaussian density function with variance , and by the th ( )
derivative of . Denote further by the th integral function of ,
obtained by , where we define .(The
lower integration limit is here quite arbitrary, but has to be fixed.) Then we have
the following theorem [199]:
Theorem 15.1 Let be any nongaussian random variable, and an independent
gaussian noise variable of variance . Define the gaussian function as in (15.10).
Then for any constant , we have
(15.11)
with . Moreover, (15.11) still holds when is replaced by for any
integer index .
The theorem means that we can estimate the independent components from noisy
observations by maximizing a general contrast function of the form (15.9), where
the direct estimation of the statistics of the noise-free data is made
possible by using . We call the statistics of the form
the gaussian moments of the data. Thus, for quasiwhitened data , we maximize the
following contrast function:
(15.12)
298
NOISY ICA
with . This gives a consistent (i.e., convergent) method of
estimating the noisy ICA model, as was shown in Chapter 8.
To use these results in practice, we need to choose some values for . In fact,
disappears from the final algorithm, so value for this parameter need not be chosen.
Two indices for the gaussian moments seem to be of particular interest: and
. The first corresponds to the gaussian density function; its use was proposed
in Chapter 8. The case is interesting because the contrast function is then
of the form of a (negative) log-density of a supergaussian variable. In fact,
can be very accurately approximated by , which was also used

in Chapter 8.
FastICA for noisy data
Using the unbiased measures of nongaussianity given in
this section, we can derive a variant of the FastICA algorithm [198]. Using kurtosis
or gaussian moments give algorithms of a similar form, just like in the noise-free
case.
The algorithm takes the form [199, 198]:
(15.13)
where , the new value of , is normalized to unit norm after every iteration, and
is given by
(15.14)
The function is here the derivative of , and can thus be chosen among the following:
(15.15)
where is an approximation of , which is the gaussian cumulative distribution
function (these relations hold up to some irrelevant constants). These functions cover
essentially the nonlinearities ordinarily used in the FastICA algorithm.
15.4.2 Higher-order cumulant methods
A different approach to estimation of the mixing matrix is given by methods using
higher-order cumulants only. Higher-order cumulants are unaffected by gaussian
noise (see Section 2.7), and therefore any such estimation method would be immune
to gaussian noise. Such methods can be found in [63, 263, 471]. The problem is,
however, that such methods often use cumulants of order 6. Higher-order cumulants
are sensitive to outliers, and therefore methods using cumulants of orders higher
than 4 are unlikely to be very useful in practice. A nice feature of this approach is,
however, that we do not need to know the noise covariance matrix.
Note that the cumulant-based methods in Part II used both second- and fourth-
order cumulants. Second-order cumulants are not immune to gaussian noise, and
therefore the cumulant-based method introduced in the previous chapters would not
ESTIMATION OF THE NOISE-FREE INDEPENDENT COMPONENTS
299

be immune either. Most of the cumulant-based methods could probably be modified
to work in the noisy case, as we did in this chapter for methods maximizing the
absolute value of kurtosis.
15.4.3 Maximum likelihood methods
Another approach for estimation of the mixing matrix with noisy data is given by
maximum likelihood (ML) estimation. First, one could maximize the joint likelihood
of the mixing matrix and the realizations of the independent components, as in
[335, 195, 80]. This is given by
(15.16)
where is defined as ,the are the realizations of the indepen-
dent components, and is an irrelevant constant. The are the logarithms of the
probability density functions (pdf’s) of the independent components. Maximization
of this joint likelihood is, however, computationally very expensive.
A more principled method would be to maximize the (marginal) likelihood of
the mixing matrix, and possibly that of the noise covariance, which was done in
[310]. This was based on the idea of approximating the densities of the independent
components as gaussian mixture densities; the application of the EM algorithm
then becomes feasible. In [42], the simpler case of discrete-valued independent
components was treated. A problem with the EM algorithm is, however, that the
computational complexity grows exponentially with the dimension of the data.
A more promising approach might be to use bias removal techniques so as to
modify existing ML algorithms to be consistent with noisy data. Actually, the bias
removal techniques given here can be interpreted as such methods; a related method
was given in [119].
Finally, let us mention a method based on the geometric interpretation of the
maximum likelihood estimator, introduced in [33], and a rather different approach
for narrow-band sources, introduced in [76].
15.5 ESTIMATION OF THE NOISE-FREE INDEPENDENT
COMPONENTS
15.5.1 Maximum a posteriori estimation

In noisy ICA, it is not enough to estimate the mixing matrix. Inverting the mixing
matrix in (15.1), we obtain
(15.17)
300
NOISY ICA
In other words, we only get noisy estimates of the independent components. There-
fore, we would like to obtain estimates of the original independent components
that are somehow optimal, i.e., contain minimum noise.
A simple approach to this problem would be to use the maximum a posteriori
(MAP) estimates. See Section 4.6.3 for the definition. Basically, this means that we
take the values that have maximum probability, given the . Equivalently, we take
as those values that maximize the joint likelihood in (15.16), so this could also be
called a maximum likelihood (ML) estimator.
To compute the MAP estimator, let us take the gradient of the log-likelihood
(15.16) with respect to the and equate this
to 0.
Thus we obtain the
equation
(15.18)
where the derivative of the log-density, denoted by , is applied separately on each
component of the vector .
In fact, this method gives a nonlinear generalization of classic Wiener filtering pre-
sented in Section 4.6.2. An alternative approach would be to use the time-structure
of the ICs (see Chapter 18) for denoising. This results in a method resembling the
Kalman filter; see [250, 249].
15.5.2 Special case of shrinkage estimation
Solving for the is not easy, however. In general,we must use numerical optimization.
A simple special case is obtained if the noise covariance is assumed to be of the same
form as in (15.4) [200, 207]. This corresponds to the case of (equivalent) source
noise. Then (15.18) gives

(15.19)
where the scalar component-wise function is obtained by inverting the relation
(15.20)
Thus, the MAP estimator is obtained by inverting a certain function involving ,or
the score function [395] of the density of . For nongaussian variables, the score
function is nonlinear, and so is .
In general, the inversion required in (15.20) may be impossible analytically. Here
we show three examples, which will be shown to have great practical value in
Chapter 21, where the inversion can be done easily.
Example 15.1 Assume that has a Laplacian (or double exponential) distribution of
unit variance. Then , sign ,and takes the
form
sign (15.21)
0
ESTIMATION OF THE NOISE-FREE INDEPENDENT COMPONENTS
301
(Rigorously speaking, the function in (15.20) is not invertible in this case, but ap-
proximating it by a sequence of invertible functions, (15.21) is obtained as the limit.)
The function in (15.21) is a shrinkage function that reduces the absolute value of its
argument by a fixed amount, as depicted in Fig 15.1. Intuitively, the utility of such a
function can be seen as follows. Since the density of a supergaussian random variable
(e.g., a Laplacian random variable) has a sharp peak at zero, it can be assumed that
small values of the noisy variable correspond to pure noise, i.e., to . Thresh-
olding such values to zero should thus reduce noise, and the shrinkage function can
indeed be considered a soft thresholding operator.
Example 15.2 More generally, assume that the score function is approximated as a
linear combination of the score functions of the gaussian and the Laplacian distribu-
tions:
sign (15.22)
with . This corresponds to assuming the following density model for :

(15.23)
where is an irrelevant scaling constant. This is depicted in Fig. 15.2. Then we
obtain
sign (15.24)
This function is a shrinkage with additional scaling, as depicted in Fig 15.1.
Example 15.3 Yet another possibility is to use the following strongly supergaussian
probability density:
(15.25)
with parameters , see Fig. 15.2. When , the Laplacian density is
obtained as the limit. The strong sparsity of the densities given by this model can be
seen e.g., from the fact that the kurtosis [131, 210] of these densities is always larger
than the kurtosis of the Laplacian density, and reaches infinity for . Similarly,
reaches infinity as goes to zero. The resulting shrinkage function given by
(15.20) can be obtained after some straightforward algebraic manipulations as:
sign
(15.26)
where ,and is set to zero in case the square root in (15.26)
is imaginary. This is a shrinkage function that has a stronger thresholding flavor, as
depicted in Fig. 15.1.
302
NOISY ICA
Fig. 15.1
Plots of the shrinkage functions. The effect of the functions is to reduce the
absolute value of its argument by a certain amount which depends on the noise level. Small
arguments are set to zero. This reduces gaussian noise for sparse random variables. Solid line:
shrinkage corresponding to Laplacian density as in (15.21). Dashed line: typical shrinkage
function obtained from (15.24). Dash-dotted line: typical shrinkage function obtained from
(15.26). For comparison, the line
is given by dotted line. All the densities were
normalized to unit variance, and noise variance was fixed to

.
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
0
0.5
1
1.5
Fig. 15.2
Plots of densities corresponding to models (15.23) and (15.25) of the sparse
components. Solid line: Laplacian density. Dashed line: a typical moderately supergaussian
density given by (15.23). Dash-dotted line: a typical strongly supergaussian density given by
(15.25). For comparison, gaussian density is given by dotted line.
DENOISING BY SPARSE CODE SHRINKAGE
303
15.6 DENOISING BY SPARSE CODE SHRINKAGE
Although the basic purpose of noisy ICA estimation is to estimate the ICs, the model
can be used to develop an interesting denoising method as well.
Assume that we observe a noisy version,
(15.27)
of the data , which has previously been modeled by ICA
(15.28)
To denoise , we can compute estimates of the independent components by the
above MAP estimation procedure. Then we can reconstruct the data as
(15.29)
The point is that if the mixing matrix is orthogonal and the noise covariance is of
the form , the condition in (15.4) is fulfilled. This condition of the noise is a
common one. Thus we could approximate the mixing matrix by an orthogonal one,
for example the one obtained by orthogonalization of the mixing matrix as in (8.48).
This method is called sparse code shrinkage [200, 207], since it means that we
transform the data into a sparse, i.e., supergaussian code, and then apply shrinkage
on that code. To summarize, the method is as follows.

1. First, using a noise-free training set of , estimate ICA and orthogonalize the
mixing matrix. Denote the orthogonal mixing matrix by . Estimate a
density model for each sparse component, using the models in (15.23)
and (15.25).
2. Compute for each noisy observation the corresponding noisy sparse com-
ponents . Apply the shrinkage nonlinearity as defined in
(15.24), or in (15.26), on each component , for every observation index .
Denote the obtained components by .
3. Invert the transform to obtain estimates of the noise-free data, given by
.
For experiments using sparse code shrinkage on image denoising, see Chapter 21.
In that case, the method is closely related to wavelet shrinkage and “coring” methods
[116, 403].
304
NOISY ICA
15.7 CONCLUDING REMARKS
In this chapter, we treated the estimation of the ICA model when additive sensor
noise is present. First of all, it was shown that in some cases, the mixing matrix can
be estimated with basic ICA methods without any further complications. In cases
where this is not possible, we discussed bias removal techniques for estimation of
the mixing matrix, and introduced a bias-free version of the FastICA algorithm.
Next, we considered how to estimate the noise-free independent components, i.e.,
how to denoise the initial estimates of the independent components. In the case of
supergaussian data, it was shown that this led to so-called shrinkage estimation. In
fact, we found an interesting denoising procedure called sparse code shrinkage.
Note that in contrast to Part II where we considered the estimation of the basic
ICA model, the material in this chapter is somewhat speculative in character. The
utility of many of the methods in this chapter has not been demonstrated in practice.
We would like to warn the reader not to use the noisy ICA methods lightheartedly:
It is always advisable to first attempt to denoise the data so that basic ICA methods

can be used, as discussed in Chapter 13.

×