Part I
MATHEMATICAL
PRELIMINARIES
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
2
Random Vectors and
Independence
In this chapter, we review central concepts of probability theory,statistics, and random
processes. The emphasis is on multivariate statistics and random vectors. Matters
that will be needed later in this book are discussed in more detail, including, for
example, statistical independence and higher-order statistics. The reader is assumed
to have basic knowledge on single variable probability theory, so that fundamental
definitions such as probability, elementary events, and random variables are familiar.
Readers who already have a good knowledge of multivariate statistics can skip most
of this chapter. For those who need a more extensive review or more information on
advanced matters, many good textbooks ranging from elementary ones to advanced
treatments exist. A widely used textbook covering probability, random variables, and
stochastic processes is [353].
2.1 PROBABILITY DISTRIBUTIONS AND DENSITIES
2.1.1 Distribution of a random variable
In this book, we assume that random variables are continuous-valued unless stated
otherwise. The cumulative distribution function (cdf)
F
x
of a random variable
x
at
point
x = x
0
is defined as the probability that
x x
0
:
F
x
(x
0
)=P (x x
0
)
(2.1)
Allowing
x
0
to change from
1
to
1
defines the whole cdf for all values of
x
.
Clearly, for continuous random variables the cdf is a nonnegative, nondecreasing
(often monotonically increasing) continuous function whose values lie in the interval
15
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
16
RANDOM VECTORS AND INDEPENDENCE
σσ
m
Fig. 2.1
A gaussian probability density function with mean
m
and standard deviation
.
0 F
x
(x) 1
. From the definition, it also follows directly that
F
x
(1)=0
,and
F
x
(+1)=1
.
Usually a probability distribution is characterized in terms of its density function
rather than cdf. Formally, the probability density function (pdf)
p
x
(x)
of a continuous
random variable
x
is obtained as the derivative of its cumulative distribution function:
p
x
(x
0
)=
dF
x
(x)
dx
x=x
0
(2.2)
In practice, the cdf is computed from the known pdf by using the inverse relationship
F
x
(x
0
)=
Z
x
0
1
p
x
( )d
(2.3)
For simplicity,
F
x
(x)
is often denoted by
F (x)
and
p
x
(x)
by
p(x)
, respectively. The
subscript referring to the random variable in question must be used when confusion
is possible.
Example 2.1 The gaussian (or normal) probability distribution is used in numerous
models and applications, for example to describe additive noise. Its density function
is given by
p
x
(x)=
1
p
2
2
exp
(x m)
2
2
2
(2.4)
PROBABILITY DISTRIBUTIONS AND DENSITIES
17
Here the parameter
m
(mean) determines the peak point of the symmetric density
function, and
(standard deviation), its effective width (flatness or sharpness of the
peak). See Figure 2.1 for an illustration.
Generally, the cdf of the gaussian density cannot be evaluated in closed form using
(2.3). The term
1=
p
2
2
in front of the density (2.4) is a normalizing factor that
guarantees that the cdf becomes unity when
x
0
!1
. However, the values of the
cdf can be computed numerically using, for example, tabulated values of the error
function
erf
(x)=
1
p
2
Z
x
0
exp
2
2
d
(2.5)
The error function is closely related to the cdf of a normalized gaussian density, for
which the mean
m =0
and the variance
2
=1
. See [353] for details.
2.1.2 Distribution of a random vector
Assume now that
x
is an
n
-dimensional random vector
x =(x
1
x
2
::: x
n
)
T
(2.6)
where
T
denotes the transpose. (We take the transpose because all vectors in this book
are column vectors. Note that vectors are denoted by boldface lowercase letters.) The
components
x
1
x
2
::: x
n
of the column vector
x
are continuous random variables.
The concept of probability distribution generalizes easily to such a random vector.
In particular, the cumulative distribution function of
x
is defined by
F
x
(x
0
)=P (x x
0
)
(2.7)
where
P (:)
again denotes the probability of the event in parentheses, and
x
0
is
some constant value of the random vector
x
. The notation
x x
0
means that each
component of the vector
x
is less than or equal to the respective component of the
vector
x
0
. The multivariate cdf in Eq. (2.7) has similar properties to that of a single
random variable. It is a nondecreasing function of each component, with values lying
in the interval
0 F
x
(x) 1
. When all the components of
x
approach infinity,
F
x
(x)
achieves its upper limit
1
; when any component
x
i
!1
,
F
x
(x)=0
.
The multivariate probability density function
p
x
(x)
of
x
is defined as the derivative
of the cumulative distribution function
F
x
(x)
with respect to all components of the
random vector
x
:
p
x
(x
0
)=
@
@x
1
@
@x
2
:::
@
@x
n
F
x
(x)
x=x
0
(2.8)
Hence
F
x
(x
0
)=
Z
x
0
1
p
x
(x)dx =
Z
x
01
1
Z
x
02
1
:::
Z
x
0n
1
p
x
(x)dx
n
:::dx
2
dx
1
(2.9)
18
RANDOM VECTORS AND INDEPENDENCE
where
x
0i
is the
i
th component of the vector
x
0
. Clearly,
Z
+1
1
p
x
(x)dx =1
(2.10)
This provides the appropriate normalization condition that a true multivariate proba-
bility density
p
x
(x)
must satisfy.
In many cases, random variables have nonzero probability density functions only
on certain finite intervals. An illustrative example of such a case is presented below.
Example 2.2 Assume that the probability density function of a two-dimensional
random vector
z
=
(x y )
T
is
p
z
(z)=p
xy
(x y )=
(
3
7
(2 x)(x + y ) x 2 0 2] y 2 0 1]
0
elsewhere
Let us now compute the cumulative distribution function of
z
. It is obtained by
integrating over both
x
and
y
, taking into account the limits of the regions where the
density is nonzero. When either
x 0
or
y 0
, the density
p
z
(z)
and consequently
also the cdf is zero. In the region where
0 <x 2
and
0 <y 1
, the cdf is given
by
F
z
(z)=F
xy
(x y )=
Z
y
0
Z
x
0
3
7
(2 )( + )d d
=
3
7
xy
x + y
1
3
x
2
1
4
xy
In the region where
0 < x 2
and
y > 1
, the upper limit in integrating over
y
becomes equal to 1, and the cdf is obtained by inserting
y =1
into the preceding
expression. Similarly, in the region
x>2
and
0 <y 1
, the cdf is obtained by
inserting
x = 2
to the preceding formula. Finally, if both
x > 2
and
y > 1
,the
cdf becomes unity, showing that the probability density
p
z
(z)
has been normalized
correctly. Collecting these results yields
F
z
(z)=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
0 x 0
or
y 0
3
7
xy (x + y
1
3
x
2
1
4
xy ) 0 <x 2 0 <y 1
3
7
x(1 +
3
4
x
1
3
x
2
) 0 <x 2y > 1
6
7
y (
2
3
+
1
2
y ) x>2 0 <y 1
1 x>2
and
y>1
2.1.3 Joint and marginal distributions
The joint distribution of two different random vectors can be handled in a similar
manner. In particular, let
y
be another random vector having in general a dimension
m
different from the dimension
n
of
x
. The vectors
x
and
y
can be concatenated to
EXPECTATIONS AND MOMENTS
19
a "supervector"
z
T
=
(x
T
y
T
)
, and the preceding formulas used directly. The cdf
that arises is called the joint distribution function of
x
and
y
, and is given by
F
xy
(x
0
y
0
)=P (x x
0
y y
0
)
(2.11)
Here
x
0
and
y
0
are some constant vectors having the same dimensions as
x
and
y
,
respectively, and Eq. (2.11) defines the joint probability of the event
x x
0
and
y y
0
.
The joint density function
p
xy
(x y)
of
x
and
y
is again defined formally by dif-
ferentiating the joint distribution function
F
xy
(x y)
with respect to all components
of the random vectors
x
and
y
. Hence, the relationship
F
xy
(x
0
y
0
)=
Z
x
0
1
Z
y
0
1
p
xy
( )d d
(2.12)
holds, and the value of this integral equals unity when both
x
0
!1
and
y
0
!1
.
The marginal densities
p
x
(x)
of
x
and
p
y
(y)
of
y
are obtained by integrating
over the other random vector in their joint density
p
xy
(x y)
:
p
x
(x)=
Z
1
1
p
xy
(x )d
(2.13)
p
y
(y)=
Z
1
1
p
xy
( y)d
(2.14)
Example 2.3 Consider the joint density given in Example 2.2. The marginal densi-
ties of the random variables
x
and
y
are
p
x
(x)=
Z
1
0
3
7
(2 x)(x + y )dy x 2 0 2]
=
(
3
7
(1 +
3
2
x x
2
) x 2 0 2]
0
elsewhere
p
y
(y )=
Z
2
0
3
7
(2 x)(x + y )dx y 2 0 1]
=
(
2
7
(2 + 3y ) y 2 0 1]
0
elsewhere
2.2 EXPECTATIONS AND MOMENTS
2.2.1 Definition and general properties
In practice, the exact probability density function of a vector or scalar valued random
variable is usually unknown. However, one can use instead expectations of some
20
RANDOM VECTORS AND INDEPENDENCE
functions of that random variable for performing useful analyses and processing. A
great advantage of expectations is that they can be estimated directly from the data,
even though they are formally defined in terms of the density function.
Let
g(x)
denote any quantity derived from the random vector
x
. The quantity
g(x)
may be either a scalar, vector, or even a matrix. The expectation of
g(x)
is
denoted by E
fg(x)g
, and is defined by
E
fg(x)g =
Z
1
1
g(x)p
x
(x)dx
(2.15)
Here the integral is computed over all the components of
x
.The integration operation
is applied separately to every component of the vector or element of the matrix,
yielding as a result another vector or matrix of the same size. If
g(x)
=
x
, we get the
expectation E
fxg
of
x
; this is discussed in more detail in the next subsection.
Expectations have some important fundamental properties.
1. Linearity. Let
x
i
,
i =1::: m
be a set of different random vectors, and
a
i
,
i =1::: m
, some nonrandom scalar coefficients. Then
E
f
m
X
i=1
a
i
x
i
g =
m
X
i=1
a
i
E
fx
i
g
(2.16)
2. Linear transformation. Let
x
be an
m
-dimensional random vector, and
A
and
B
some nonrandom
k m
and
m l
matrices, respectively. Then
E
fAxg = A
E
fxg
E
fxBg =
E
fxgB
(2.17)
3. Transformation invariance. Let
y = g(x)
be a vector-valued function of the
random vector
x
.Then
Z
1
1
yp
y
(y)dy =
Z
1
1
g(x)p
x
(x)dx
(2.18)
Thus E
fyg
=E
fg(x)g
, even though the integrations are carried out over
different probability density functions.
These properties can be proved using the definition of the expectation operator
and properties of probability density functions. They are important and very helpful
in practice, allowing expressions containing expectations to be simplified without
actually needing to compute any integrals (except for possibly in the last phase).
2.2.2 Mean vector and correlation matrix
Moments of a random vector
x
are typical expectations used to characterize it. They
are obtained when
g(x)
consists of products of components of
x
. In particular, the
EXPECTATIONS AND MOMENTS
21
first moment of a random vector
x
is called the mean vector
m
x
of
x
. It is defined
as the expectation of
x
:
m
x
=
E
fxg =
Z
1
1
xp
x
(x)dx
(2.19)
Each component
m
x
i
of the
n
-vector
m
x
is given by
m
x
i
=
E
fx
i
g =
Z
1
1
x
i
p
x
(x)dx =
Z
1
1
x
i
p
x
i
(x
i
)dx
i
(2.20)
where
p
x
i
(x
i
)
is the marginal density of the
i
th component
x
i
of
x
. This is because
integrals over all the other components of
x
reduce to unity due to the definition of
the marginal density.
Another important set of moments consists of correlations between pairs of com-
ponents of
x
. The correlation
r
ij
between the
i
th and
j
th component of
x
is given
by the second moment
r
ij
=
E
fx
i
x
j
g =
Z
1
1
x
i
x
j
p
x
(x)dx =
Z
1
1
Z
1
1
x
i
x
j
p
x
i
x
j
(x
i
x
j
)dx
j
dx
i
(2.21)
Note that correlation can be negative or positive.
The
n n
correlation matrix
R
x
=
E
fxx
T
g
(2.22)
of the vector
x
represents in a convenient form all its correlations,
r
ij
being the
element in row
i
and column
j
of
R
x
.
The correlation matrix
R
x
has some important properties:
1. It is a symmetric matrix:
R
x
=
R
T
x
.
2. It is positive semidefinite:
a
T
R
x
a 0
(2.23)
for all
n
-vectors
a
. Usually in practice
R
x
is positive definite, meaning that
for any vector
a 6=0
, (2.23) holds as a strict inequality.
3. All the eigenvalues of
R
x
are real and nonnegative (positive if
R
x
is positive
definite). Furthermore, all the eigenvectors of
R
x
are real, and can always be
chosen so that they are mutually orthonormal.
Higher-order moments can be defined analogously, but their discussion is post-
poned to Section 2.7. Instead, we shall first consider the corresponding central and
second-order moments for two different random vectors.
22
RANDOM VECTORS AND INDEPENDENCE
2.2.3 Covariances and joint moments
Central moments are defined in a similar fashion to usual moments, but the mean
vectors of the random vectors involved are subtracted prior to computing the ex-
pectation. Clearly, central moments are only meaningful above the first order. The
quantity corresponding to the correlation matrix
R
x
is called the covariance matrix
C
x
of
x
, and is given by
C
x
=
E
f(x m
x
)(x m
x
)
T
g
(2.24)
The elements
c
ij
=
E
f(x
i
m
i
)(x
j
m
j
)g
(2.25)
of the
n n
matrix
C
x
are called covariances, and they are the central moments
corresponding to the correlations
1
r
ij
defined in Eq. (2.21).
The covariance matrix
C
x
satisfies the same properties as the correlation matrix
R
x
. Using the properties of the expectation operator, it is easy to see that
R
x
= C
x
+ m
x
m
T
x
(2.26)
If the mean vector
m
x
= 0
, the correlation and covariance matrices become the
same. If necessary, the data can easily be made zero mean by subtracting the
(estimated) mean vector from the data vectors as a preprocessing step. This is a usual
practice in independent component analysis, and thus in later chapters, we simply
denote by
C
x
the correlation/covariance matrix, often even dropping the subscript
x
for simplicity.
For a single random variable
x
, the mean vector reduces to its mean value
m
x
=
E
fxg
, the correlation matrix to the second moment E
fx
2
g
, and the covariance matrix
to the variance of
x
2
x
=
E
f(x m
x
)
2
g
(2.27)
The relationship (2.26) then takes the simple form E
fx
2
g
=
2
x
+ m
2
x
.
The expectation operation can be extended for functions
g(x y)
of two different
random vectors
x
and
y
in terms of their joint density:
E
fg(x y)g =
Z
1
1
Z
1
1
g(x y)p
xy
(x y)dy dx
(2.28)
The integrals are computed over all the components of
x
and
y
.
Of the joint expectations, the most widely used are the cross-correlation matrix
R
xy
=
E
fxy
T
g
(2.29)
1
In classic statistics, the correlation coefficients
ij
=
c
ij
(c
ii
c
jj
)
1=2
are used, and the matrix consisting of
them is called the correlation matrix. In this book, the correlation matrix is defined by the formula (2.22),
which is a common practice in signal processing, neural networks, and engineering.
EXPECTATIONS AND MOMENTS
23
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
y
x
Fig. 2.2
An example of negative covariance
between the random variables
x
and
y
.
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
y
x
Fig. 2.3
An example of zero covariance be-
tween the random variables
x
and
y
.
and the cross-covariance matrix
C
xy
=
E
f(x m
x
)(y m
y
)
T
g
(2.30)
Note that the dimensions of the vectors
x
and
y
can be different. Hence, the cross-
correlation and -covariance matrices are not necessarily square matrices, and they are
not symmetric in general. However, from their definitions it follows easily that
R
xy
= R
T
yx
C
xy
= C
T
yx
(2.31)
If the mean vectors of
x
and
y
are zero, the cross-correlation and cross-covariance
matrices become the same. The covariance matrix
C
x+y
of the sum of two random
vectors
x
and
y
having the same dimension is often needed in practice. It is easy to
see that
C
x+y
= C
x
+ C
xy
+ C
yx
+ C
y
(2.32)
Correlations and covariances measure the dependence between the random vari-
ables using their second-order statistics. This is illustrated by the following example.
Example 2.4 Consider the two different joint distributions
p
xy
(x y )
of the zero
mean scalar random variables
x
and
y
shown in Figs. 2.2 and 2.3. In Fig. 2.2,
x
and
y
have a clear negative covariance (or correlation). A positive value of
x
mostly
implies that
y
is negative, and vice versa. On the other hand, in the case of Fig. 2.3,
it is not possible to infer anything about the value of
y
by observing
x
. Hence, their
covariance
c
xy
0
.
24
RANDOM VECTORS AND INDEPENDENCE
2.2.4 Estimation of expectations
Usually the probability density of a random vector
x
is not known, but there is often
available a set of
K
samples
x
1
x
2
::: x
K
from
x
. Using them, the expectation
(2.15) can be estimated by averaging over the sample using the formula [419]
E
fg(x)g
1
K
K
X
j =1
g(x
j
)
(2.33)
For example, applying (2.33), we get for the mean vector
m
x
of
x
its standard
estimator, the sample mean
^
m
x
=
1
K
K
X
j =1
x
j
(2.34)
where the hat over
m
is a standard notation for an estimator of a quantity.
Similarly, if instead of the joint density
p
xy
(x y)
of the random vectors
x
and
y
, we know
K
sample pairs
(x
1
y
1
) (x
2
y
2
)::: (x
K
y
K
)
, we can estimate the
expectation (2.28) by
E
fg(x y)g
1
K
K
X
j =1
g(x
j
y
j
)
(2.35)
For example, for the cross-correlation matrix, this yields the estimation formula
^
R
xy
=
1
K
K
X
j =1
x
j
y
T
j
(2.36)
Similar formulas are readily obtained for the other correlation type matrices
R
xx
,
C
xx
,and
C
xy
.
2.3 UNCORRELATEDNESS AND INDEPENDENCE
2.3.1 Uncorrelatedness and whiteness
Two random vectors
x
and
y
are uncorrelated if their cross-covariance matrix
C
xy
is a zero matrix:
C
xy
=
E
f(x m
x
)(y m
y
)
T
g = 0
(2.37)
This is equivalent to the condition
R
xy
=
E
fxy
T
g =
E
fxg
E
fy
T
g = m
x
m
T
y
(2.38)
UNCORRELATEDNESS AND INDEPENDENCE
25
In the special case of two different scalar random variables
x
and
y
(for example,
two components of a random vector
z
),
x
and
y
are uncorrelated if their covariance
c
xy
is zero:
c
xy
=
E
f(x m
x
)(y m
y
)g =0
(2.39)
or equivalently
r
xy
=
E
fxy g =
E
fxg
E
fy g = m
x
m
y
(2.40)
Again, in the case of zero-mean variables, zero covariance is equivalent to zero
correlation.
Another important special case concerns the correlations between the components
of a single random vector
x
given by the covariance matrix
C
x
defined in (2.24). In
this case a condition equivalent to (2.37) can never be met, because each component
of
x
is perfectly correlated with itself. The best that we can achieve is that different
components of
x
are mutually uncorrelated, leading to the uncorrelatedness condition
C
x
=
E
f(x m
x
)(x m
x
)
T
g = D
(2.41)
Here
D
is an
n n
diagonal matrix
D =
diag
(c
11
c
22
::: c
nn
)=
diag
(
2
x
1
2
x
2
:::
2
x
n
)
(2.42)
whose
n
diagonal elements are the variances
2
x
i
=E
f(x
i
m
x
i
)
2
g
=
c
ii
of the
components
x
i
of
x
.
In particular, random vectors having zero mean and unit covariance (and hence
correlation) matrix, possibly multiplied by a constant variance
2
, are said to be
white. Thus white random vectors satisfy the conditions
m
x
= 0 R
x
= C
x
= I
(2.43)
where
I
is the
n n
identity matrix.
Assume now that an orthogonal transformation defined by an
n n
matrix
T
is
applied to the random vector
x
. Mathematically, this can be expressed
y = Tx
where
T
T
T = TT
T
= I
(2.44)
An orthogonal matrix
T
defines a rotation (change of coordinate axes) in the
n
-
dimensional space, preserving norms and distances. Assuming that
x
is white, we
get
m
y
=
E
fTxg = T
E
fxg = Tm
x
= 0
(2.45)
and
C
y
= R
y
=
E
fTx(Tx)
T
g = T
E
fxx
T
gT
T
= TR
x
T
T
= TT
T
= I
(2.46)
26
RANDOM VECTORS AND INDEPENDENCE
showing that
y
is white, too. Hence we can conclude that the whiteness property is
preserved under orthogonal transformations. In fact, whitening of the original data
can be made in infinitely many ways. Whitening will be discussed in more detail
in Chapter 6, because it is a highly useful and widely used preprocessing step in
independent component analysis.
It is clear that there also exists infinitely many ways to decorrelate the original
data, because whiteness is a special case of the uncorrelatedness property.
Example 2.5 Consider the linear signal model
x = As + n
(2.47)
where
x
is an
n
-dimensional random or data vector,
A
an
n m
constant matrix,
s
an
m
-dimensional random signal vector, and
n
an
n
-dimensional random vector that
usually describes additive noise. The correlation matrix of
x
then becomes
R
x
=
E
fxx
T
g =
E
f(As + n)(As + n)
T
g
=
E
fAss
T
A
T
g +
E
fAsn
T
g +
E
fns
T
A
T
g +
E
fnn
T
g
= A
E
fss
T
gA
T
+ A
E
fsn
T
g +
E
fns
T
gA
T
+
E
fnn
T
g
= AR
s
A
T
+ AR
sn
+ R
ns
A
T
+ R
n
(2.48)
Usually the noise vector
n
is assumed to have zero mean, and to be uncorrelated with
the signal vector
s
. Then the cross-correlation matrix between the signal and noise
vectors vanishes:
R
sn
=
E
fsn
T
g =
E
fsg
E
fn
T
g = 0
(2.49)
Similarly,
R
ns
= 0
, and the correlation matrix of
x
simplifies to
R
x
= AR
s
A
T
+ R
n
(2.50)
Another often made assumption is that the noise is white, which means here that
the components of the noise vector
n
are all uncorrelated and have equal variance
2
, so that in (2.50)
R
n
=
2
I
(2.51)
Sometimes, for example in a noisy version of the ICA model (Chapter 15), the
components of the signal vector
s
are also mutually uncorrelated, so that the signal
correlation matrix becomes the diagonal matrix
D
s
=
diag
(
E
fs
2
1
g
E
fs
2
2
g:::
E
fs
2
m
g)
(2.52)
where
s
1
s
2
::: s
m
are components of the signal vector
s
. Then (2.50) can be
written in the form
R
x
= AD
s
A
T
+
2
I =
m
X
i=1
E
fs
2
i
ga
i
a
T
i
+
2
I
(2.53)
UNCORRELATEDNESS AND INDEPENDENCE
27
where
a
i
is the
i
th column vector of the matrix
A
.
The noisy linear signal or data model (2.47) is encountered frequently in signal
processing and other areas, and the assumptions made on
s
and
n
vary depending
on the problem at hand. It is straightforward to see that the results derived in this
example hold for the respective covariance matrices as well.
2.3.2 Statistical independence
A key concept that constitutes the foundation of independent component analysis is
statistical independence. For simplicity, consider first the case of two different scalar
random variables
x
and
y
. The random variable
x
is independent of
y
, if knowing the
value of
y
does not give any information on the value of
x
. For example,
x
and
y
can
be outcomes of two events that have nothing to do with each other, or random signals
originating from two quite different physical processes that are in no way related to
each other. Examples of such independent random variables are the value of a dice
thrown and of a coin tossed, or speech signal and background noise originating from
a ventilation system at a certain time instant.
Mathematically, statistical independence is defined in terms of probability densi-
ties. The random variables
x
and
y
are said to be independent if and only if
p
xy
(x y )=p
x
(x)p
y
(y )
(2.54)
In words, the joint density
p
xy
(x y )
of
x
and
y
must factorize into the product
of their marginal densities
p
x
(x)
and
p
y
(y )
. Equivalently, independence could be
defined by replacing the probability density functions in the definition (2.54) by the
respective cumulative distribution functions, which must also be factorizable.
Independent random variables satisfy the basic property
E
fg (x)h(y )g =
E
fg (x)g
E
fh(y )g
(2.55)
where
g (x)
and
h(y )
are any absolutely integrable functions of
x
and
y
, respectively.
This is because
E
fg (x)h(y )g =
Z
1
1
Z
1
1
g (x)h(y )p
xy
(x y )dy dx
(2.56)
=
Z
1
1
g (x)p
x
(x)dx
Z
1
1
h(y )p
y
(y )dy =
E
fg (x)g
E
fh(y )g
Equation (2.55) reveals that statistical independence is a much stronger property than
uncorrelatedness. Equation (2.40), defining uncorrelatedness, is obtained from the
independence property (2.55) as a special case where both
g (x)
and
h(y )
are linear
functions, and takes into account second-order statistics (correlations or covariances)
only. However, if the random variables have gaussian distributions, independence
and uncorrelatedness become the same thing. This very special property of gaussian
distributions will be discussed in more detail in Section 2.5.
Definition (2.54) of independence generalizes in a natural way for more than
two random variables, and for random vectors. Let
x y z:::
be random vectors
28
RANDOM VECTORS AND INDEPENDENCE
which may in general have different dimensions. The independence condition for
x y z:::
is then
p
xyz:::
(x y z:::)=p
x
(x)p
y
(y)p
z
(z) :::
(2.57)
and the basic property (2.55) generalizes to
E
fg
x
(x)g
y
(y)g
z
(z) :::g =
E
fg
x
(x)g
E
fg
y
(y)g
E
fg
z
(z)g :::
(2.58)
where
g
x
(x)
,
g
y
(y)
,and
g
z
(z)
are arbitrary functions of the random variables
x
,
y
,
and
z
for which the expectations in (2.58) exist.
The general definition (2.57) gives rise to a generalization of the standard notion
of statistical independence. The components of the random vector
x
are themselves
scalar random variables, and the same holds for
y
and
z
. Clearly, the components
of
x
can be mutually dependent, while they are independent with respect to the
components of the other random vectors
y
and
z
, and (2.57) still holds. A similar
argument applies to the random vectors
y
and
z
.
Example 2.6 First consider the random variables
x
and
y
discussed in Examples 2.2
and 2.3. The joint density of
x
and
y
, reproduced here for convenience,
p
xy
(x y )=
(
3
7
(2 x)(x + y ) x 2 0 2]y 2 0 1]
0
elsewhere
is not equal to the product of their marginal densities
p
x
(x)
and
p
y
(y )
computed in
Example 2.3. Hence, Eq. (2.54) is not satisfied, and we conclude that
x
and
y
are not
independent. Actually this can be seen directly by observing that the joint density
f
xy
(x y )
given above is not factorizable, since it cannot be written as a product of
two functions
g (x)
and
h(y )
depending only on
x
and
y
.
Consider then the joint density of a two-dimensional random vector
x =(x
1
x
2
)
T
and a one-dimensional random vector
y = y
given by [419]
p
xy
(x y)=
(
(x
1
+3x
2
)y x
1
x
2
2 0 1] y 2 0 1]
0
elsewhere
Using the above argument, we see that the random vectors
x
and
y
are statistically
independent, but the components
x
1
and
x
2
of
x
are not independent. The exact
verification of these results is left as an exercise.
2.4 CONDITIONAL DENSITIES AND BAYES’ RULE
Thus far, we have dealt with the usual probability densities, joint densities, and
marginal densities. Still one class of probability density functions consists of con-
ditional densities. They are especially important in estimation theory, which will
CONDITIONAL DENSITIES AND BAYES’ RULE
29
be studied in Chapter 4. Conditional densities arise when answering the following
question: “What is the probability density of a random vector
x
given that another
random vector
y
has the fixed value
y
0
?” Here
y
0
is typically a specific realization
of a measurement vector
y
.
Assuming that the joint density
p
xy
(x y)
of
x
and
y
and their marginal densities
exist, the conditional probability density of
x
given
y
is defined as
p
xjy
(xjy)=
p
xy
(x y)
p
y
(y)
(2.59)
This can be interpreted as follows: assuming that the random vector
y
lies in the
region
y
0
< y y
0
+y
, the probability that
x
lies in the region
x
0
< x x
0
+x
is
p
xjy
(x
0
jy
0
)x
. Here
x
0
and
y
0
are some constant vectors, and both
x
and
y
are small. Similarly,
p
yjx
(yjx)=
p
xy
(x y)
p
x
(x)
(2.60)
In conditional densities, the conditioning quantity,
y
in (2.59) and
x
in (2.60), is
thought to be like a nonrandom parameter vector, even though it is actually a random
vector itself.
Example 2.7 Consider the two-dimensional joint density
p
xy
(x y )
depicted in
Fig. 2.4. For a given constant value
x
0
, the conditional distribution
p
yjx
(y jx
0
)=
p
xy
(x
0
y)
p
x
(x
0
)
Hence, it is a one-dimensional distribution obtained by "slicing" the joint distribution
p(x y )
parallel to the
y
-axis at the point
x = x
0
. Note that the denominator
p
x
(x
0
)
is
merely a scaling constant that does not affect the shape of the conditional distribution
p
yjx
(y jx
0
)
as a function of
y
.
Similarly, the conditional distribution
p
xjy
(xjy
0
)
can be obtained geometrically
by slicing the joint distribution of Fig. 2.4 parallel to the
x
-axis at the point
y = y
0
.
The resulting conditional distributions are shown in Fig. 2.5 for the value
x
0
=1:27
,
and Fig. 2.6 for
y
0
= 0:37
.
From the definitions of the marginal densities
p
x
(x)
of
x
and
p
y
(y)
of
y
given in
Eqs. (2.13) and (2.14), we see that the denominators in (2.59) and (2.60) are obtained
by integrating the joint density
p
xy
(x y)
over the unconditional random vector. This
also shows immediately that the conditional densities are true probability densities
satisfying
Z
1
1
p
xjy
( jy)d =1
Z
1
1
p
yjx
( jx)d =1
(2.61)
If the random vectors
x
and
y
are statistically independent, the conditional density
p
xjy
(xjy)
equals to the unconditional density
p
x
(x)
of
x
,since
x
does not depend
30
RANDOM VECTORS AND INDEPENDENCE
−2
−1
0
1
2
−2
−1
0
1
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
x
y
Fig. 2.4
A two-dimensional joint density of the random variables
x
and
y
.
in any way on
y
, and similarly
p
yjx
(yjx)
=
p
y
(y)
, and both Eqs. (2.59) and (2.60)
can be written in the form
p
xy
(x y)=p
x
(x)p
y
(y)
(2.62)
which is exactly the definition of independence of the random vectors
x
and
y
.
In the general case, we get from Eqs. (2.59) and (2.60) two different expressions
for the joint density of
x
and
y
:
p
xy
(x y)=p
yjx
(yjx)p
x
(x)=p
xjy
(xjy)p
y
(y)
(2.63)
From this, for example, a solution can be found for the density of
y
conditioned on
x
:
p
yjx
(yjx)=
p
xjy
(xjy)p
y
(y)
p
x
(x)
(2.64)
where the denominator can be computed by integrating the numerator if necessary:
p
x
(x)=
Z
1
1
p
xjy
(xj )p
y
( )d
(2.65)