Tải bản đầy đủ (.pdf) (14 trang)

Independent component analysis P18

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (327.19 KB, 14 trang )

18
Methods using Time
Structure
The model of independent component analysis (ICA) that we have considered so
far consists of mixing independent random variables, usually linearly. In many
applications, however, what is mixed is not random variables but time signals, or
time series. This is in contrast to the basic ICA model in which the samples of
x
have no particular order: We could shuffle them in any way we like, and this would
have no effect on the validity of the model, nor on the estimation methods we have
discussed. If the independent components (ICs) are time signals, the situation is quite
different.
In fact, if the ICs are time signals, they may contain much more structure than sim-
ple random variables. For example, the autocovariances (covariances over different
time lags) of the ICs are then well-defined statistics. One can then use such additional
statistics to improve the estimation of the model. This additional information can
actually make the estimation of the model possible in cases where the basic ICA
methods cannot estimate it, for example, if the ICs are gaussian but correlated over
time.
In this chapter, we consider the estimation of the ICA model when the ICs are
time signals,
s
i
(t)t =1 ::: T
,where
t
is the time index. In the previous chapters,
we denoted by
t
the sample index, but here
t


has a more precise meaning, since it
defines an order between the ICs. The model is then expressed by
x(t)=As(t)
(18.1)
where
A
is assumed to be square as usual, and the ICs are of course independent. In
contrast, the ICs need not be nongaussian.
In the following, we shall make some assumptions on the time structure of the ICs
that allow for the estimation of the model. These assumptions are alternatives to the
341
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright

2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
342
METHODS USING TIME STRUCTURE
assumption of nongaussianity made in other chapters of this book. First, we shall
assume that the ICs have different autocovariances (in particular, they are all different
from zero). Second, we shall consider the case where the variances of the ICs are
nonstationary. Finally, we discuss Kolmogoroff complexity as a general framework
for ICA with time-correlated mixtures.
We do not here consider the case where it is the mixing matrix that changes in
time; see [354].
18.1 SEPARATION BY AUTOCOVARIANCES
18.1.1 Autocovariances as an alternative to nongaussianity
The simplest form of time structure is given by (linear) autocovariances. This means

covariances between the values of the signal at different time points: cov
(x
i
(t)x
i
(t 
 ))
where

is some lag constant,
 =1 2 3:::
. If the data has time-dependencies,
the autocovariances are often different from zero.
In addition to the autocovariances of one signal, we also need covariances between
two signals: cov
(x
i
(t)x
j
(t   ))
where
i 6= j
. All these statistics for a given time
lag can be grouped together in the time-lagged covariance matrix
C
x

= E fx(t)x(t   )
T
g

(18.2)
The theory of time-dependent signals was briefly discussed in Section 2.8.
As we saw in Chapter 7, the problem in ICA is that the simple zero-lagged
covariance (or correlation) matrix
C
does not contain enough parameters to allow the
estimation of
A
. This means that simply finding a matrix
V
so that the components
of the vector
z(t)=Vx(t)
(18.3)
are white, is not enough to estimate the independent components. This is because
there is an infinity of different matrices
V
that give decorrelated components. This
is why in basic ICA, we have to use the nongaussian structure of the independent
components, for example, by minimizing the higher-order dependencies as measured
by mutual information.
The key point here is that the information in a time-lagged covariance matrix
C
x

could be used instead of the higher-order information [424, 303]. What we do is
to find a matrix
B
so that in addition to making the instantaneous covariances of
y(t)=Bx(t)

go to zero, the lagged covariances are made zero as well:
E fy
i
(t)y
j
(t   )g =0
for all
i j 
(18.4)
The motivation for this is that for the ICs
s
i
(t)
, the lagged covariances are all zero due
to independence. Using these lagged covariances, we get enough extra information
to estimate the model, under certain conditions specified below. No higher-order
information is then needed.
SEPARATION BY AUTOCOVARIANCES
343
18.1.2 Using one time lag
In the simplest case, we can use just one time lag. Denote by

such a time lag, which
is very often taken equal to 1. A very simple algorithm can now be formulated to find
a matrix that cancels both the instantaneous covariances and the ones corresponding
to lag

.
Consider whitened data (see Chapter 6), denoted by
z

. Then we have for the
orthogonal separating matrix
W
:
Wz(t)=s(t)
(18.5)
Wz(t   )=s(t   )
(18.6)
Let us consider a slightly modified version of the lagged covariance matrix as defined
in (18.2), given by

C
z

=
1
2
C
z

+(C
z

)
T
]
(18.7)
We have by linearity and orthogonality the relation

C

z

=
1
2
W
T
E fs(t)s(t   )
T
g + E fs(t   )s(t)
T
g]W = W
T

C
s

W
(18.8)
Due to the independence of the
s
i
(t)
, the time-lagged covariance matrix
C
s

=
E fs(t)s(t   )g
is diagonal; let us denote it by

D
. Clearly, the matrix

C
s

equals
this same matrix. Thus we have

C
z

= W
T
DW
(18.9)
What this equation shows is that the matrix
W
is part of the eigenvaluedecomposition
of

C
z

. The eigenvalue decomposition of this symmetric matrix is simple to compute.
In fact, the reason why we considered this matrix instead of the simple time-lagged
covariance matrix (as in [303]) was precisely that we wanted to have a symmetric
matrix, because then the eigenvalue decomposition is well defined and simple to
compute. (It is actually true that the lagged covariance matrix is symmetric if the
data exactly follows the ICA model, but estimates of such matrices are not symmetric.)

The AMUSE algorithm
Thus we have a simple algorithm, called AMUSE [424],
for estimating the separating matrix
W
for whitened data:
1. Whiten the (zero-mean) data
x
to obtain
z(t)
.
2. Compute the eigenvalue decomposition of

C
z

=
1
2
C

+ C
T

]
,where
C

=
E fz(t)z(t   )g
is the time-lagged covariance matrix, for some lag


.
3. The rows of the separating matrix
W
are given by the eigenvectors.
An essentially similar algorithm was proposed in [303].
344
METHODS USING TIME STRUCTURE
This algorithm is very simple and fast to compute. The problem is, however, that
it only works when the eigenvectors of the matrix

C

are uniquely defined. This is
the case if the eigenvalues are all distinct (not equal to each other). If some of the
eigenvalues are equal, then the corresponding eigenvectors are not uniquely defined,
and the corresponding ICs cannot be estimated. This restricts the applicability of this
method considerably. These eigenvalues are given by cov
(s
i
(t)s
i
(t   ))
, and thus
the eigenvalues are distinct if and only if the lagged covariances are different for all
the ICs.
As a remedy to this restriction, one can search for a suitable time lag

so that
the eigenvalues are distinct, but this is not always possible: If the signals

s
i
(t)
have
identical power spectra, that is, identical autocovariances, then no value of

makes
estimation possible.
18.1.3 Extension to several time lags
An extension of the AMUSE method that improves its performance is to consider
several time lags

instead of a single one. Then, it is enough that the covariances for
one of these time lags are different. Thus the choice of

is a somewhat less serious
problem.
In principle, using several time lags, we want to simultaneously diagonalize all the
corresponding lagged covariance matrices. It must be noted that the diagonalization
is not possible exactly, since the eigenvectors of the different covariance matrices
are unlikely to be identical, except in the theoretical case where the data is exactly
generated by the ICA model. So here we formulate functions that express the degree
of diagonalization obtained and find its maximum.
One simple way of measuring the diagonality of a matrix
M
is to use the operator
off
(M)=
X
i6=j

m
2
ij
(18.10)
which gives the sum of squares of the off-diagonal elements
M
.Whatwenow
want to do is to minimize the sum of the off-diagonal elements of several lagged
covariances of
y = Wz
. As before, we use the symmetric version

C
y

of the lagged
covariance matrix. Denote by
S
the set of the chosen lags

. Then we can write this
as an objective function
J (w)
:
J
1
(W)=
X
 2S
off

(W

C
z

W
T
)
(18.11)
Minimizing
J
1
under the constraint that
W
is orthogonal gives us the estimation
method. This minimization could be performed by (projected) gradient descent.
Another alternative is to adapt the existing methods for eigenvalue decomposition to
this simultaneous approximate diagonalization of several matrices. The algorithm
called SOBI (second-order blind identification) [43] is based on these principles, and
so is TDSEP [481].
z
SEPARATION BY AUTOCOVARIANCES
345
The criterion
J
1
can be simplified. For an orthogonal transformation,
W
,thesum
of the squares of the elements of

WMW
T
is constant.
1
Thus, the “off” criterion
could be expressed as the difference of the total sum of squares minus the sum of the
squares on the diagonal. Thus we can formulate
J
2
(W)=
X
 2S
X
i
(w
T
i

C
z

w
i
)
2
(18.12)
where the
w
T
i

are the rows of
W
. Thus, minimizing
J
2
is equivalent to minimizing
J
1
.
An alternative method for measuring the diagonality can be obtained using the
approach in [240]. For any positive-definite matrix
M
,wehave
X
i
log m
ii
 log j det Mj
(18.13)
and the equality holds only for diagonal
M
. Thus, we could measure the nondiago-
nality of
M
by
F (M)=
X
i
log m
ii

 log j det Mj
(18.14)
Again, the total nondiagonality of the
C

at different time lags can be measured
by the sum of these measures for different time lags. This gives us the following
objective function to minimize:
J
3
(W)=
1
2
X
 2S
F (

C
y

)=
1
2
X
 2S
F (W

C
z


W
T
)
(18.15)
Just as in maximum likelihood (ML) estimation,
W
decouples from the term involv-
ing the logarithm of the determinant. We obtain
J
3
(W)=
X
 2S
X
i
1
2
log(w
T
i

C
z

w
i
)  log j det Wj
1
2
log j det


C
z

j
(18.16)
Considering whitened data, in which case
W
can be constrained orthogonal, we see
that the term involving the determinant is constant, and we finally have
J
3
(W)=
X
 2S
X
i
1
2
log(w
T
i

C
z

w
i
)+
const. (18.17)

This is in fact rather similar to the function
J
2
in (18.12). The only difference is
that the function
u
2
has been replaced by
1=2 log(u)
. What these functions have
1
This is because it equals trace
(WMW
T
(WMW
T
)
T
) =
trace
(WMM
T
W
T
) =
trace
(W
T
WMM
T

)=
trace
(MM
T
)
.

×