Tải bản đầy đủ (.pdf) (20 trang)

Tài liệu Bài 5: Information Theory docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (422.62 KB, 20 trang )

5
Information Theory
Estimation theory gives one approach to characterizing random variables. This was
based on building parametric models and describing the data by the parameters.
An alternative approach is given by information theory. Here the emphasis is on
coding. We want to code the observations. The observations can then be stored
in the memory of a computer, or transmitted by a communications channel, for
example. Finding a suitable code depends on the statistical properties of the data.
In independent component analysis (ICA), estimation theory and information theory
offer the two principal theoretical approaches.
In this chapter, the basic concepts of information theory are introduced. The latter
half of the chapter deals with a more specialized topic: approximation of entropy.
These concepts are needed in the ICA methods of Part II.
5.1 ENTROPY
5.1.1 Definition of entropy
Entropy is the basic concept of information theory. Entropy
H
is defined for a
discrete-valued random variable
X
as
H (X )=
X
i
P (X = a
i
)logP (X = a
i
)
(5.1)
where the


a
i
are the possible values of
X
. Depending on what the base of the
logarithm is, different units of entropy are obtained. Usually, the logarithm with
base 2 is used, in which case the unit is called a bit. In the following the base is
105
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright

2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
106
INFORMATION THEORY
Fig. 5.1
The function
f
in (5.2), plotted on the interval
0 1]
.
not important since it only changes the measurement scale, so it is not explicitly
mentioned.
Let us define the function
f
as
f (p)=p log p
for

0  p  1
(5.2)
This is a nonnegative function that is zero for
p =0
and for
p =1
, and positive for
values in between; it is plotted in Fig. 5.1. Using this function, entropy can be written
as
H (X )=
X
i
f (P (X = a
i
))
(5.3)
Considering the shape of
f
, we see that the entropy is small if the probabilities
P (X = a
i
)
are close to
0
or
1
, and large if the probabilities are in between.
In fact, the entropy of a random variable can be interpreted as the degree of
information that the observation of the variable gives. The more “random”, i.e.,
unpredictable and unstructured the variable is, the larger its entropy. Assume that the

probabilities are all close to
0
, expect for one that is close to
1
(the probabilities must
sum up to one). Then there is little randomness in the variable, since it almost always
takes the same value. This is reflected in its small entropy. On the other hand, if all
the probabilities are equal, then they are relatively far from
0
and
1
,and
f
takes large
values. This means that the entropy is large, which reflects the fact that the variable
is really random: We cannot predict which value it takes.
Example 5.1 Let us consider a random variable
X
that can have only two values,
a
and
b
. Denote by
p
the probability that it has the value
a
, then the probability that it
is
b
is equal to

1  p
. The entropy of this random variable can be computed as
H (X )=f (p)+f (1  p)
(5.4)
ENTROPY
107
Thus, entropy is a simple function of
p
. (It does not depend on the values
a
and
b
.)
Clearly, this function has the same properties as
f
: it is a nonnegative function that
is zero for
p =0
and for
p =1
, and positive for values in between. In fact, it it is
maximized for
p =1=2
(this is left as an exercice). Thus, the entropy is largest when
the values are both obtained with a probability of
50%
. In contrast, if one of these
values is obtained almost always (say, with a probability of
99:9%
), the entropy of

X
is small, since there is little randomness in the variable.
5.1.2 Entropy and coding length
The connection between entropy and randomness can be made more rigorous by
considering coding length. Assume that we want to find a binary code for a large
number of observations of
X
, so that the code uses the minimum number of bits
possible. According to the fundamental results of information theory, entropy is
very closely related to the length of the code required. Under some simplifying
assumptions, the length of the shortest code is bounded below by the entropy, and
this bound can be approached arbitrarily close, see, e.g., [97]. So, entropy gives
roughly the average minimum code length of the random variable.
Since this topic is out of the scope of this book, we will just illustrate it with two
examples.
Example 5.2 Consider again the case of a random variable with two possible values,
a
and
b
. If the variable almost always takes the same value, its entropy is small. This
is reflected in the fact that the variable is easy to code. In fact, assume the value
a
is almost always obtained. Then, one efficient code might be obtained simply by
counting how many
a
’s are found between two subsequent observations of
b
,and
writing down these numbers. If we need to code only a few numbers, we are able to
code the data very efficiently.

In the extreme case where the probability of
a
is
1
, there is actually nothing left to
code and the coding length is zero. On the other hand, if both values have the same
probability, this trick cannot be used to obtain an efficient coding mechanism, and
every value must be coded separately by one bit.
Example 5.3 Consider a random variable
X
that can have eight different values
with probabilities
(1=2 1=4 1=8 1=16 1=64 1=64 1=64 1=64)
. The entropy of
X
is
2
bits (this computation is left as an exercice to the reader). If we just coded the
data in the ordinary way, we would need 3 bits for every observation. But a more
intelligent way is to code frequent values with short binary strings and infrequent
values with longer strings. Here, we could use the following strings for the outcomes:
0,10,110,1110,111100,111101,111110,111111. (Note that the strings can be written
one after another with no spaces since they are designed so that one always knows
when the string ends.) With this encoding the average number of bits needed for
each outcome is only 2, which is in fact equal to the entropy. So we have gained a
33% reduction of coding length.
108
INFORMATION THEORY
5.1.3 Differential entropy
The definition of entropy for a discrete-valued random variable can be generalized

for continuous-valued random variables and vectors, in which case it is often called
differential entropy.
The differential entropy
H
of a random variable
x
with density
p
x
(:)
is defined
as:
H (x)=
Z
p
x
( )logp
x
( )d =
Z
f (p
x
( ))d
(5.5)
Differential entropy can be interpreted as a measure of randomness in the same way
as entropy. If the random variable is concentrated on certain small intervals, its
differential entropy is small.
Note that differential entropy can be negative. Ordinary entropy cannot be negative
because the function
f

in (5.2) is nonnegative in the interval
0 1]
, and discrete
probabilities necessarily stay in this interval. But probability densities can be larger
than
1
, in which case
f
takes negative values. So, when we speak of a “small
differential entropy”, it may be negative and have a large absolute value.
It is now easy to see what kind of random variables have small entropies. They
are the ones whose probability densities take large values, since these give strong
negative contributions to the integral in (5.8). This means that certain intervals are
quite probable. Thus we again find that entropy is small when the variable is not very
random, that is, it is contained in some limited intervals with high probabilities.
Example 5.4 Consider a random variable
x
that has a uniform probability distribu-
tion in the interval
0a]
. Its density is given by
p
x
( )=
(
1=a
for
0    a
0
otherwise

(5.6)
The differential entropy can be evaluated as
H (x)=
Z
a
0
1
a
log
1
a
d =loga
(5.7)
Thus we see that the entropy is large if
a
is large, and small if
a
is small. This is
natural because the smaller
a
is, the less randomness there is in
x
. In the limit where
a
goes to
0
, differential entropy goes to
1
, because in the limit,
x

is no longer
random at all: it is always
0
.
The interpretation of entropy as coding length is more or less valid with differ-
ential entropy. The situation is more complicated, however, since the coding length
interpretation requires that we discretize (quantize) the values of
x
. In this case, the
coding length depends on the discretization, i.e., on the accuracy with which we want
to represent the random variable. Thus the actual coding length is given by the sum
of entropy and a function of the accuracy of representation. We will not go into the
details here; see [97] for more information.
ENTROPY
109
The definition of differential entropy can be straightforwardly generalized to the
multidimensional case. Let
x
be a random vector with density
p
x
(:)
. The differential
entropy is then defined as:
H (x)=
Z
p
x
( )logp
x

( )d =
Z
f (p
x
( ))d
(5.8)
5.1.4 Entropy of a transformation
Consider an invertible transformation of the random vector
x
,say
y = f (x)
(5.9)
In this section, we show the connection between the entropy of
y
and that of
x
.
A short, if somewhat sloppy derivation is as follows. (A more rigorous derivation
is given in the Appendix.) Denote by
J f ( )
the Jacobian matrix of the function
f
, i.e.,
the matrix of the partial derivatives of
f
at point

. The classic relation between the
density
p

y
of
y
and the density
p
x
of
x
, as given in Eq. (2.82), can then be formulated
as
p
y
( )=p
x
(f
1
( ))j det J f (f
1
( ))j
1
(5.10)
Now, expressing the entropy as an expectation
H (y)=E flog p
y
(y)g
(5.11)
we get
E flog p
y
(y)g = E flogp

x
(f
1
(y))j det J f (f
1
(y))j
1
]g
= E flogp
x
(x)j det J f (x)j
1
]g = E flog p
x
(x)gE flog j det J f (x)jg
(5.12)
Thus we obtain the relation between the entropies as
H (y)=H (x)+E flog j det J f (x)jg
(5.13)
In other words, the entropy is increased in the transformation by
E flog j det J f (x)jg
.
An important special case is the linear transformation
y = Mx
(5.14)
in which case we obtain
H (y)=H (x)+logj det Mj
(5.15)
This also shows that differential entropy is not scale-invariant. Consider a random
variable

x
. If we multiply it by a scalar constant,

, differential entropy changes as
H (x)=H (x) + log jj
(5.16)
Thus, just by changing the scale, we can change the differential entropy. This is why
the scale of
x
often is fixed before measuring its differential entropy.
110
INFORMATION THEORY
5.2 MUTUAL INFORMATION
5.2.1 Definition using entropy
Mutual information is a measure of the information that members of a set of random
variables have on the other random variables in the set. Using entropy, we can define
the mutual information
I
between
n
(scalar) random variables,
x
i
i = 1:::n
,as
follows
I (x
1
x
2

:::x
n
)=
n
X
i=1
H (x
i
)  H (x)
(5.17)
where
x
is the vector containing all the
x
i
.
Mutual information can be interpreted by using the interpretation of entropy as
code length. The terms
H (x
i
)
give the lengths of codes for the
x
i
when these are
coded separately, and
H (x)
gives the code length when
x
is coded as a random

vector, i.e., all the components are coded in the same code. Mutual information thus
shows what code length reduction is obtained by coding the whole vector instead
of the separate components. In general, better codes can be obtained by coding the
whole vector. However, if the
x
i
are independent, they give no information on each
other, and one could just as well code the variables separately without increasing
code length.
5.2.2 Definition using Kullback-Leibler divergence
Alternatively, mutual information can be interpreted as a distance, using what is
called the Kullback-Leibler divergence. This is defined between two
n
-dimensional
probability density functions (pdf’s)
p
1
and
p
2
as
 (p
1
p
2
)=
Z
p
1
( )log

p
1
( )
p
2
( )
d
(5.18)
The Kullback-Leibler divergence can be considered as a kind of a distance between
the two probability densities, because it is always nonnegative, and zero if and only if
the two distributions are equal. This is a direct consequence of the (strict) convexity
of the negative logarithm, and the application of the classic Jensen’s inequality.
Jensen’s inequality (see [97]) says that for any strictly convex function
f
and any
random variable
y
,wehave
E ff (y )gf (E fy g)
(5.19)
Take
f (y )= log(y)
, and assume that
y = p
2
(x)=p
1
(x)
where
x

has the distribution
given by
p
1
.Thenwehave
 (p
1
p
2
)=E ff (y )g = E f log
p
2
(x)
p
1
(x)
g =
Z
p
1
( )f log
p
2
( )
p
1
( )
gd
 f (E fy g)= log
Z

p
1
( )f
p
2
( )
p
1
( )
gd =  log
Z
p
2
( )d =0
(5.20)
MAXIMUM ENTROPY
111
Moreover, we have equality in Jensen’s inequality if and only if
y
is constant. In our
case, it is constant if and only if the two distributions are equal, so we have proven
the announced property of the Kullback-Leibler divergence.
Kullback-Leibler divergence is not a proper distance measure, though, because it
is not symmetric.
To apply Kullback-Leibler divergence here, let us begin by considering that if
random variables
x
i
were independent, their joint probability density could be fac-
torized according to the definition of independence. Thus one might measure the

independence of the
x
i
as the Kullback-Leibler divergence between the real density
p
1
= p
x
( )
and the factorized density
p
2
= p
1
(
1
)p
2
(
2
):::p
n
(
n
)
,wherethe
p
i
(:)
are the marginal densities of the

x
i
. In fact, simple algebraic manipulations show that
this quantity equals the mutual information that we defined using entropy in (5.17),
which is left as an exercice.
The interpretation as Kullback-Leibler divergence implies the following important
property: Mutual information is always nonnegative, and it is zero if and only if the
variables are independent. This is a direct consequence of the properties of the
Kullback-Leibler divergence.
5.3 MAXIMUM ENTROPY
5.3.1 Maximum entropy distributions
An important class of methods that have application in many domains is given by the
maximum entropy methods. These methods apply the concept of entropy to the task
of regularization.
Assume that the information available on the density
p
x
(:)
of the scalar random
variable
x
is of the form
Z
p( )F
i
( )d = c
i

for
i =1:::

(5.21)
which means in practice that we have estimated the expectations
E fF
i
(x)g
of
m
different functions
F
i
of
x
. (Note that
i
is here an index, not an exponent.)
The question is now: What is the probability density function
p
0
that satisfies the
constraints in (5.21), and has maximum entropy among such densities? (Earlier, we
defined the entropy of random variable, but the definition can be used with pdf’s as
well.) This question can be motivated by noting that a finite number of observations
cannot tell us exactly what
p
is like. So we might use some kind of regularization to
obtain the most useful
p
compatible with these measurements. Entropy can be here
considered as a regularization measure that helps us find the least structured density
compatible with the measurements. In other words, the maximum entropy density

can be interpreted as the density that is compatible with the measurements and makes
the minimum number of assumptions on the data. This is because entropy can be
interpreted as a measure of randomness, and therefore the maximum entropy density
n

112
INFORMATION THEORY
is the most random of all the pdf’s that satisfy the constraints. For further details on
why entropy can be used as a measure of regularity, see [97, 353].
The basic result of the maximum entropy method (see, e.g. [97, 353]) tells us that
under some regularity conditions, the density
p
0
( )
which satisfies the constraints
(5.21) and has maximum entropy among all such densities, is of the form
p
0
( )=A exp(
X
i
a
i
F
i
( ))
(5.22)
Here,
A
and

a
i
are constants that are determined from the
c
i
, using the constraints
in (5.21) (i.e., by substituting the right-hand side of (5.22) for
p
in (5.21)), and the
constraint
R
p
0
( )d = 1
. This leads in general to a system of
n +1
nonlinear
equations that may be difficult to solve, and in general, numerical methods must be
used.
5.3.2 Maximality property of the gaussian distribution
Now, consider the set of random variables that can take all the values on the real line,
and have zero mean and a fixed variance, say 1 (thus, we have two constraints). The
maximum entropy distribution for such variables is the gaussian distribution. This is
because by (5.22), the distribution has the form
p
0
( )=A exp(a
1

2

+ a
2
 )
(5.23)
and all probability densities of this form are gaussian by definition (see Section 2.5).
Thus we have the fundamental result that a gaussian variable has the largest
entropy among all random variables of unit variance. This means that entropy
could be used as a measure of nongaussianity. In fact, this shows that the gaussian
distribution is the “most random” or the least structured of all distributions. Entropy
is small for distributions that are clearly concentrated on certain values, i.e., when the
variable is clearly clustered, or has a pdf that is very “spiky”. This property can be
generalized to arbitrary variances, and what is more important, to multidimensional
spaces: The gaussian distribution has maximum entropy among all distributions with
a given covariance matrix.
5.4 NEGENTROPY
The maximality property given in Section 5.3.2 shows that entropy could be used to
define a measure of nongaussianity. A measure that is zero for a gaussian variable and
always nonnegative can be simply obtained from differential entropy, and is called
negentropy. Negentropy
J
is defined as follows
J (x)=H (x
gauss
)  H (x)
(5.24)

×