Tải bản đầy đủ (.pdf) (20 trang)

Tài liệu Bài 6: Principal Component Analysis and Whitening pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (452.07 KB, 20 trang )

6
Principal Component
Analysis and Whitening
Principal component analysis (PCA) and the closely related Karhunen-Lo
`
eve trans-
form, or the Hotelling transform, are classic techniques in statistical data analysis,
feature extraction, and data compression, stemming from the early work of Pearson
[364]. Given a set of multivariate measurements, the purpose is to find a smaller set of
variables with less redundancy, that would give as good a representation as possible.
This goal is related to the goal of independent component analysis (ICA). However,
in PCA the redundancy is measured by correlations between data elements, while
in ICA the much richer concept of independence is used, and in ICA the reduction
of the number of variables is given less emphasis. Using only the correlations as in
PCA has the advantage that the analysis can be based on second-order statistics only.
In connection with ICA, PCA is a useful preprocessing step.
The basic PCA problem is outlined in this chapter. Both the closed-form solution
and on-line learning algorithms for PCA are reviewed. Next, the related linear
statistical technique of factor analysis is discussed. The chapter is concluded by
presenting how data can be preprocessed by whitening, removing the effect of first-
and second-order statistics, which is very helpful as the first step in ICA.
6.1 PRINCIPAL COMPONENTS
The starting point for PCA is a random vector
x
with
n
elements. There is available
a sample
x(1) ::: x(T )
from this random vector. No explicit assumptions on the
probability density of the vectors are made in PCA, as long as the first- and second-


order statistics are known or can be estimated from the sample. Also, no generative
125
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright

2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
126
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
model is assumed for vector
x
. Typically the elements of
x
are measurements like
pixel gray levels or values of a signal at different time instants. It is essential in
PCA that the elements are mutually correlated, and there is thus some redundancy
in
x
, making compression possible. If the elements are independent, nothing can be
achieved by PCA.
In the PCA transform, the vector
x
is first centered by subtracting its mean:
x  x 
E
fxg
The mean is in practice estimated from the available sample
x(1) ::: x(T )

(see
Chapter 4). Let us assume in the following that the centering has been done and thus
E
fxg = 0
.Next,
x
is linearly transformed to another vector
y
with
m
elements,
m < n
, so that the redundancy induced by the correlations is removed. This is
done by finding a rotated orthogonal coordinate system such that the elements of
x
in the new coordinates become uncorrelated. At the same time, the variances of
the projections of
x
on the new coordinate axes are maximized so that the first axis
corresponds to the maximal variance, the second axis corresponds to the maximal
variance in the direction orthogonal to the first axis, and so on.
For instance, if
x
has a gaussian density that is constant over ellipsoidal surfaces
in the
n
-dimensional space, then the rotated coordinate system coincides with the
principal axes of the ellipsoid. A two-dimensional example is shown in Fig. 2.7 in
Chapter 2. The principal components are now the projections of the data points on the
two principal axes,

e
1
and
e
2
. In addition to achieving uncorrelated components, the
variances of the components (projections) also will be very different in most appli-
cations, with a considerable number of the variances so small that the corresponding
components can be discarded altogether. Those components that are left constitute
the vector
y
.
As an example, take a set of
88
pixel windows from a digital image,an application
that is considered in detail in Chapter 21. They are first transformed, e.g., using row-
by-row scanning, into vectors
x
whose elements are the gray levels of the 64 pixels
in the window. In real-time digital video transmission, it is essential to reduce this
data as much as possible without losing too much of the visual quality, because the
total amount of data is very large. Using PCA, a compressed representation vector
y
can be obtained from
x
, which can be stored or transmitted. Typically,
y
can have as
few as 10 elements, and a good replica of the original
8  8

image window can still
be reconstructed from it. This kind of compression is possible because neighboring
elements of
x
, which are the gray levels of neighboring pixels in the digital image,
are heavily correlated. These correlations are utilized by PCA, allowing almost the
same information to be represented by a much smaller vector
y
. PCA is a linear
technique, so computing
y
from
x
is not heavy, which makes real-time processing
possible.
PRINCIPAL COMPONENTS
127
6.1.1 PCA by variance maximization
In mathematical terms, consider a linear combination
y
1
=
n
X
k=1
w
k1
x
k
= w

T
1
x
of the elements
x
1
:::x
n
of the vector
x
.The
w
11
:::w
n1
are scalar coefficients or
weights, elements of an
n
-dimensional vector
w
1
,and
w
T
1
denotes the transpose of
w
1
.
The factor

y
1
is called the first principal component of
x
, if the variance of
y
1
is
maximally large. Because the variance depends on both the norm and orientation of
the weight vector
w
1
and grows without limits as the norm grows, we impose the
constraint that the norm of
w
1
is constant, in practice equal to 1. Thus we look for a
weight vector
w
1
maximizing the PCA criterion
J
PCA
1
(w
1
)=
E
fy
2

1
g =
E
f(w
T
1
x)
2
g = w
T
1
E
fxx
T
gw
1
= w
T
1
C
x
w
1
(6.1)
so that
kw
1
k =1
(6.2)
There E

f:g
is the expectation over the (unknown) density of input vector
x
,andthe
norm of
w
1
is the usual Euclidean norm defined as
kw
1
k =(w
T
1
w
1
)
1=2
=
n
X
k=1
w
2
k1
]
1=2
The matrix
C
x
in Eq. (6.1) is the

n  n
covariance matrix of
x
(see Chapter 4) given
for the zero-mean vector
x
by the correlation matrix
C
x
=
E
fxx
T
g
(6.3)
It is well known from basic linear algebra (see, e.g., [324, 112]) that the solution
to the PCA problem is given in terms of the unit-length eigenvectors
e
1
:::e
n
of
the matrix
C
x
. The ordering of the eigenvectors is such that the corresponding
eigenvalues
d
1
 ::: d

n
satisfy
d
1
 d
2
 :::  d
n
. The solution maximizing (6.1) is
given by
w
1
= e
1
Thus the first principal component of
x
is
y
1
= e
T
1
x
.
The criterion
J
PCA
1
in eq. (6.1) can be generalized to
m

principal components,
with
m
any number between 1 and
n
. Denoting the
m
-th (
1  m  n
)principal
component by
y
m
= w
T
m
x
, with
w
m
the corresponding unit norm weight vector, the
variance of
y
m
is now maximized under the constraint that
y
m
is uncorrelated with
all the previously found principal components:
E

fy
m
y
k
g =0 k<m:
(6.4)
Note that the principal components
y
m
have zero means because
E
fy
m
g = w
T
m
E
fxg =0
128
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
The condition (6.4) yields:
E
fy
m
y
k
g =
E
f(w
T

m
x)(w
T
k
x)g = w
T
m
C
x
w
k
=0
(6.5)
For the second principal component, we have the condition that
w
T
2
Cw
1
= d
1
w
T
2
e
1
=0
(6.6)
because we already know that
w

1
= e
1
. We are thus looking for maximal variance
E
fy
2
2
g =
E
f(w
T
2
x)
2
g
in the subspace orthogonal to the first eigenvector of
C
x
.The
solution is given by
w
2
= e
2
Likewise, recursively it follows that
w
k
= e
k

Thus the
k
th principal component is
y
k
= e
T
k
x
.
Exactly the same result for the
w
i
is obtained if the variances of
y
i
are maxi-
mized under the constraint that the principal component vectors are orthonormal, or
w
T
i
w
j
= 
ij
.Thisisleftasanexercise.
6.1.2 PCA by minimum mean-square error compression
In the preceding subsection, the principal components were defined as weighted sums
of the elements of
x

with maximal variance, under the constraints that the weights
are normalized and the principal components are uncorrelated with each other. It
turns out that this is strongly related to minimum mean-square error compression
of
x
, which is another way to pose the PCA problem. Let us search for a set of
m
orthonormal basis vectors, spanning an
m
-dimensional subspace, such that the mean-
square error between
x
and its projection on the subspace is minimal. Denoting again
the basis vectors by
w
1
:::w
m
, for which we assume
w
T
i
w
j
= 
ij
the projection of
x
on the subspace spanned by them is
P

m
i=1
(w
T
i
x)w
i
. The mean-
square error (MSE) criterion, to be minimized by the orthonormal basis
w
1
:::w
m
,
becomes
J
PCA
MSE
=
E
fkx 
m
X
i=1
(w
T
i
x)w
i
k

2
g
(6.7)
It is easy to show (see exercises) that due to the orthogonality of the vectors
w
i
,this
criterion can be further written as
J
PCA
MSE
=
E
fkxk
2
g
E
f
m
X
j =1
(w
T
j
x)
2
g
(6.8)
=
trace

(C
x
) 
m
X
j =1
w
T
j
C
x
w
j
(6.9)
PRINCIPAL COMPONENTS
129
It can be shown (see, e.g., [112]) that the minimum of (6.9) under the orthonor-
mality condition on the
w
i
is given by any orthonormal basis of the PCA subspace
spanned by the
m
first eigenvectors
e
1
 ::: e
m
. However, the criterion does not spec-
ify the basis of this subspace at all. Any orthonormal basis of the subspace will give

the same optimal compression. While this ambiguity can be seen as a disadvantage,
it should be noted that there may be some other criteria by which a certain basis in
the PCA subspace is to be preferred over others. Independent component analysis is
a prime example of methods in which PCA is a useful preprocessing step, but once
the vector
x
has been expressed in terms of the first
m
eigenvectors, a further rotation
brings out the much more useful independent components.
It can also be shown [112] that the value of the minimum mean-square error of
(6.7) is
J
PCA
MSE
=
n
X
i=m+1
d
i
(6.10)
the sum of the eigenvalues corresponding to the discarded eigenvectors
e
m+1
:::e
n
.
If the orthonormality constraint is simply changed to
w

T
j
w
k
= !
k

jk
(6.11)
where all the numbers
!
k
are positive and different, then the mean-square error
problem will have a unique solution given by scaled eigenvectors [333].
6.1.3 Choosing the number of principal components
From the result that the principal component basis vectors
w
i
are eigenvectors
e
i
of
C
x
, it follows that
E
fy
2
m
g =

E
fe
T
m
xx
T
e
m
g = e
T
m
C
x
e
m
= d
m
(6.12)
The variances of the principal components are thus directly given by the eigenvalues
of
C
x
. Note that, because the principal components have zero means, a small eigen-
value (a small variance)
d
m
indicates that the value of the corresponding principal
component
y
m

is mostly close to zero.
An important application of PCA is data compression. The vectors
x
in the original
data set (that have first been centered by subtracting the mean) are approximated by
the truncated PCA expansion
^
x =
m
X
i=1
y
i
e
i
(6.13)
Then we know from (6.10) that the mean-square error E
fkx 
^
xk
2
g
is equal to
P
n
i=m+1
d
i
. As the eigenvalues are all positive, the error decreases when more and
more terms are included in (6.13), until the error becomes zero when

m = n
or all
the principal components are included. A very important practical problem is how to
130
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
choose
m
in (6.13); this is a trade-off between error and the amount of data needed
for the expansion. Sometimes a rather small number of principal components are
sufficient.
Fig. 6.1
Leftmost column: some digital images in a
32  32
grid. Second column: means
of the samples. Remaining columns: reconstructions by PCA when 1, 2, 5, 16, 32, and 64
principal components were used in the expansion.
Example 6.1 In digital image processing, the amount of data is typically very large,
and data compression is necessary for storage, transmission, and feature extraction.
PCA is a simple and efficient method. Fig. 6.1 shows 10 handwritten characters that
were represented as binary
32  32
matrices (left column) [183]. Such images, when
scanned row by row, can be represented as 1024-dimensional vectors. For each of the
10 character classes, about 1700 handwritten samples were collected, and the sample
means and covariance matrices were computed by standard estimation methods. The
covariance matrices were
1024  1024
matrices. For each class, the first 64 principal
component vectors or eigenvectors of the covariance matrix were computed. The
second column in Fig. 6.1 shows the sample means, and the other columns show the

reconstructions (6.13) for various values of
m
. In the reconstructions, the sample
means have been added again to scale the images for visual display. Note how a
relatively small percentage of the 1024 principal components produces reasonable
reconstructions.
PRINCIPAL COMPONENTS
131
The condition (6.12) can often be used in advance to determine the number of
principal components
m
, if the eigenvalues are known. The eigenvalue sequence
d
1
d
2
 ::: d
n
of a covariance matrix for real-world measurement data is usually
sharply decreasing, and it is possible to set a limit below which the eigenvalues,
hence principal components, are insignificantly small. This limit determines how
many principal components are used.
Sometimes the threshold can be determined from some prior information on the
vectors
x
. For instance, assume that
x
obeys a signal-noise model
x =
m

X
i=1
a
i
s
i
+ n
(6.14)
where
m<n
.There
a
i
are some fixed vectors and the coefficients
s
i
are random
numbers that are zero mean and uncorrelated. We can assume that their variances
have been absorbed in vectors
a
i
so that they have unit variances. The term
n
is
white noise, for which E
fnn
T
g = 
2
I

. Then the vectors
a
i
span a subspace, called
the signal subspace, that has lower dimensionality than the whole space of vectors
x
. The subspace orthogonal to the signal subspace is spanned by pure noise and it is
called the noise subspace.
It is easy to show (see exercises) that in this case the covariance matrix of
x
has a
special form:
C
x
=
m
X
i=1
a
i
a
T
i
+ 
2
I
(6.15)
The eigenvalues are now the eigenvalues of
P
m

i=1
a
i
a
T
i
, added by the constant

2
.
But the matrix
P
m
i=1
a
i
a
T
i
has at most
m
nonzero eigenvalues, and these correspond
to eigenvectors that span the signal subspace. When the eigenvalues of
C
x
are
computed, the first
m
form a decreasing sequence and the rest are small constants,
equal to


2
:
d
1
>d
2
> ::: > d
m
>d
m+1
= d
m+2
= ::: = d
n
= 
2
It is usually possible to detect where the eigenvalues become constants, and putting
a threshold at this index,
m
, cuts off the eigenvalues and eigenvectors corresponding
to pure noise. Then only the signal part remains.
A more disciplined approach to this problem was given by [453]; see also [231].
They give formulas for two well-known information theoretic modeling criteria,
Akaike’s information criterion (AIC) and the minimum description length criterion
(MDL), as functions of the signal subspace dimension
m
. The criteria depend on the
length
T

of the sample
x(1):::x(T )
and on the eigenvalues
d
1
:::d
n
of the matrix
C
x
. Finding the minimum point gives a good value for
m
.
6.1.4 Closed-form computation of PCA
To use the closed-form solution
w
i
= e
i
given earlier for the PCA basis vectors, the
eigenvectors of the covariance matrix
C
x
must be known. In the conventional use of
132
PRINCIPAL COMPONENT ANALYSIS AND WHITENING
PCA, there is a sufficiently large sample of vectors
x
available, from which the mean
and the covariance matrix

C
x
can be estimated by standard methods (see Chapter 4).
Solving the eigenvector–eigenvalueproblem for
C
x
gives the estimate for
e
1
.There
are several efficient numerical methods available for solving the eigenvectors, e.g.,
the QR algorithm with its variants [112, 153, 320].
However, it is not always feasible to solve the eigenvectors by standard numerical
methods. In an on-line data compression application like image or speech coding,
the data samples
x(t)
arrive at high speed, and it may not be possible to estimate the
covariance matrix and solve the eigenvector–eigenvalue problem once and for all.
One reason is computational: the eigenvector problem is numerically too demanding
if the dimensionality
n
is large and the sampling rate is high. Another reason is that the
covariance matrix
C
x
may not be stationary, due to fluctuating statistics in the sample
sequence
x(t)
, so the estimate would have to be incrementally updated. Therefore,
the PCA solution is often replaced by suboptimal nonadaptive transformations like

the discrete cosine transform [154].
6.2 PCA BY ON-LINE LEARNING
Another alternative is to derive gradient ascent algorithms or other on-line methods
for the preceding maximization problems. The algorithms will then converge to the
solutions of the problems, that is, to the eigenvectors. The advantage of this approach
is that such algorithms work on-line, using each input vector
x(t)
once as it becomes
available and making an incremental change to the eigenvector estimates, without
computing the covariance matrix at all. This approach is the basis of the PCA neural
network learning rules.
Neural networks provide a novel way for parallel on-line computation of the PCA
expansion. The PCA network [326] is a layer of parallel linear artificial neurons
shown in Fig. 6.2. The output of the
i
th unit (
i =1:::m
)is
y
i
= w
T
i
x
, with
x
denoting the
n
-dimensional input vector of the network and
w

i
denoting the weight
vector of the
i
th unit. The number of units,
m
, will determine how many principal
components the network will compute. Sometimes this can be determined in advance
for typical inputs, or
m
can be equal to
n
if all principal components are required.
The PCA network learns the principal components by unsupervised learning rules,
by which the weight vectors are gradually updated until they become orthonormal
and tend to the theoretically correct eigenvectors. The network also has the ability to
track slowly varying statistics in the input data, maintaining its optimality when the
statistical properties of the inputs do not stay constant. Due to their parallelism and
adaptivity to input data, such learning algorithms and their implementations in neural
networks are potentially useful in feature detection and data compression tasks.
In ICA, where decorrelating the mixture variables is a useful preprocessing step,
these learning rules can be used in connection to on-line ICA.

×