Tải bản đầy đủ (.pdf) (28 trang)

Tài liệu Bài 4: Estimation Theory ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (628.83 KB, 28 trang )

4
Estimation Theory
An important issue encountered in various branches of science is how to estimate the
quantities of interest from a given finite set of uncertain (noisy) measurements. This
is studied in estimation theory, which we shall discuss in this chapter.
There exist many estimation techniques developed for various situations; the
quantities to be estimated may be nonrandom or have some probability distributions
themselves, and they may be constant or time-varying. Certain estimation methods
are computationally less demanding but they are statistically suboptimal in many
situations, while statistically optimal estimation methods can have a very high com-
putational load, or they cannot be realized in many practical situations. The choice
of a suitable estimation method also depends on the assumed data model, which may
be either linear or nonlinear, dynamic or static, random or deterministic.
In this chapter, we concentrate mainly on linear data models, studying the esti-
mation of their parameters. The two cases of deterministic and random parameters
are covered, but the parameters are always assumed to be time-invariant. The meth-
ods that are widely used in context with independent component analysis (ICA) are
emphasized in this chapter. More information on estimation theory can be found in
books devoted entirely or partly to the topic, for example [299, 242, 407, 353, 419].
Prior to applying any estimation method, one must select a suitable model that
well describes the data, as well as measurements containing relevant information on
the quantities of interest. These important, but problem-specific issues will not be
discussed in this chapter. Of course, ICA is one of the models that can be used. Some
topics related to the selection and preprocessing of measurements are treated later in
Chapter 13.
77
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright


2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
78
ESTIMATION THEORY
4.1 BASIC CONCEPTS
Assume there are
T
scalar measurements
x(1)x(2)::: x(T )
containing informa-
tion about the
m
quantities

1

2
::: 
m
that we wish to estimate. The quantities

i
are called parameters hereafter. They can be compactly represented as the parameter
vector
 =(
1

2
::: 
m

)
T
(4.1)
Hence, the parameter vector

is an
m
-dimensional column vector having as its
elements the individual parameters. Similarly, the measurements can be represented
as the
T
-dimensional measurement or data vector
1
x
T
=x(1)x(2)::: x(T )]
T
(4.2)
Quite generally, an estimator
^

of the parameter vector

is the mathematical
expression or function by which the parameters can be estimated from the measure-
ments:
^
 = h(x
T
)=h(x(1)x(2)::: x(T ))

(4.3)
For individual parameters, this becomes
^

i
= h
i
(x
T
) i =1::: m
(4.4)
If the parameters

i
are of a different type, the estimation formula (4.4) can be quite
different for different
i
. In other words, the components
h
i
of the vector-valued
function
h
can have different functional forms. The numerical value of an estimator
^

i
, obtained by inserting some specific given measurements into formula (4.4), is
called the estimate of the parameter


i
.
Example 4.1 Two parameters that are often needed are the mean

and variance

2
of a random variable
x
. Given the measurement vector (4.2), they can be estimated
from the well-known formulas, which will be derived later in this chapter:
^ =
1
T
T
X
j =1
x(j )
(4.5)
^
2
=
1
T  1
T
X
j =1
x(j )  ^]
2
(4.6)

1
The data vector consisting of
T
subsequent scalar samples is denoted in this chapter by
x
T
for distin-
guishing it from the ICA mixture vector
x
, whose components consist of different mixtures.
BASIC CONCEPTS
79
Example 4.2 Another example of an estimation problem is a sinusoidal signal in
noise. Assume that the measurements obey the measurement (data) model
x(j )=A sin(!t(j )+)+v(j ) j =1::: T
(4.7)
Here
A
is the amplitude,
!
the angular frequency, and

the phase of the sinusoid,
respectively. The measurements are made at different time instants
t(j )
, which are
often equispaced. They are corrupted by additive noise
v (j )
, which is often assumed
to be zero mean white gaussian noise. Depending on the situation, we may wish to

estimate some of the parameters
A
,
!
,and

, or all of them. In the latter case, the
parameter vector becomes

=
(A !  )
T
. Clearly, different formulas must be used
for estimating
A
,
!
,and

. The amplitude
A
depends linearly on the measurements
x(j )
, while the angular frequency
!
and the phase

depend nonlinearly on the
x(j )
.

Various estimation methods for this problem are discussed, for example, in [242].
Estimation methods can be divided into two broad classes depending on whether
the parameters

are assumed to be deterministic constants,orrandom.Inthe
latter case, it is usually assumed that the parameter vector

has an associated
probability density function (pdf)
p

( )
. This pdf, called aprioridensity, is in
principle assumed to be completely known. In practice, such exact information is
seldom available. Rather, the probabilistic formalism allows incorporation of useful
but often somewhat vague prior information on the parameters into the estimation
procedure for improving the accuracy. This is done by assuming a suitable prior
distribution reflecting knowledge about the parameters. Estimation methods using
the a priori distribution
p

( )
are often called Bayesian ones, because they utilize
the Bayes’ rule discussed in Section 4.6.
Another distinction between estimators can be made depending on whether they
are of batch type or on-line. In batch type estimation (also called off-line estimation),
all the measurements must first be available, and the estimates are then computed
directly from formula (4.3). In on-line estimation methods (also called adaptive or
recursive estimation), the estimates are updated using new incoming samples. Thus
the estimates are computed from the recursive formula

^
(j +1) = h
1
(
^
 (j )) + h
2
(x(j +1)
^
(j ))
(4.8)
where
^
(j )
denotes the estimate based on
j
first measurements
x(1)x(2)::: x(j )
.
The correction or update term
h
2
(x(j +1)
^
(j ))
depends only on the new incoming
(j +1)
-th sample
x(j +1)
and the current estimate

^
 (j )
. For example, the estimate
^
of the mean in (4.5) can be computed on-line as follows:
^(j )=
j  1
j
^(j  1) +
1
j
x(j )
(4.9)
80
ESTIMATION THEORY
4.2 PROPERTIES OF ESTIMATORS
Now briefly consider properties that a good estimator should satisfy.
Generally, assessing the quality of an estimate is based on the estimation error,
which is defined by
~
 =  
^
 =   h(x
T
)
(4.10)
Ideally, the estimation error
~

should be zero, or at least zero with probability one.

But it is impossible to meet these extremely stringent requirements for a finite data
set. Therefore, one must consider less demanding criteria for the estimation error.
Unbiasedness and consistency
The first requirement is that the mean value
of the error E
f
~
g
should be zero. Taking expectations of the both sides of Eq. (4.10)
leads to the condition
E
f
^
g =
E
fg
(4.11)
Estimators that satisfy the requirement (4.11) are called unbiased. The preceding def-
inition is applicable to random parameters. For nonrandom parameters, the respective
definition is
E
f
^
 j g = 
(4.12)
Generally, conditional probability densities and expectations, conditioned by the
parameter vector

, are used throughout in dealing with nonrandom parameters to
indicate that the parameters


are assumed to be deterministic constants. In this case,
the expectations are computed over the random data only.
If an estimator does not meet the unbiasedness conditions (4.11) or (4.12). it
is said to be biased. In particular, the bias
b
is defined as the mean value of the
estimation error:
b =
E
f
~
g
,or
b =
E
f
~
 j g
(4.13)
If the bias approaches zero as the number of measurements grows infinitely large, the
estimator is called asymptotically unbiased.
Another reasonable requirement for a good estimator
^

is that it should converge
to the true value of the parameter vector

, at least in probability,
2

when the number of
measurements grows infinitely large. Estimators satisfying this asymptotic property
are called consistent. Consistent estimators need not be unbiased; see [407].
Example 4.3 Assume that the observations
x(1)x(2)::: x(T )
are independent.
The expected value of the sample mean (4.5) is
E
f ^g =
1
T
T
X
j =1
E
fx(j )g =
1
T
T = 
(4.14)
2
See for example [299, 407] for various definitions of stochastic convergence.
PROPERTIES OF ESTIMATORS
81
Thus the sample mean is an unbiased estimator of the true mean

. It is also consistent,
which can be seen by computing its variance
E
f(^  )

2
g =
1
T
2
T
X
j =1
E
fx(j )  ]
2
g =
1
T
2
T
2
=

2
T
(4.15)
The variance approaches zero when the number of samples
T ! 1
, implying
together with unbiasedness that the sample mean (4.5) converges in probability to the
true mean

.
Mean-square error

It is useful to introduce a scalar-valued loss function
L(
~
)
for describing the relative importance of specific estimation errors
~

. A popular loss
function is the squared estimation error
L(
~
)
=
k
~
 k
2
=
k  
^
 k
2
because of its
mathematical tractability. More generally, typical properties required from a valid
loss function are that it is symmetric:
L(
~
)
=
L(

~
)
; convex or alternatively at least
nondecreasing; and (for convenience) that the loss corresponding to zero error is
zero:
L(0)
= 0. The convexity property guarantees that the loss function decreases
as the estimation error decreases. See [407] for details.
The estimation error
~

is a random vector depending on the (random) measurement
vector
x
T
. Hence, the value of the loss function
L(
~
)
is also a random variable. To
obtain a nonrandom error measure, is is useful to define the performance index or
error criterion
E
as the expectation of the respective loss function. Hence,
E =
E
fL(
~
)g
or

E =
E
fL(
~
) j g
(4.16)
where the first definition is used for random parameters

and the second one for
deterministic ones.
A widely used error criterion is the mean-square error (MSE)
E
MSE
=
E
fk  
^
 k
2
g
(4.17)
If the mean-square error tends asymptotically to zero with increasing number of
measurements, the respective estimator is consistent. Another important property of
the mean-square error criterion is that it can be decomposed as (see (4.13))
E
MSE
=
E
fk
~

  b k
2
g+ k b k
2
(4.18)
The first term E
fk
~
  b k
2
g
on the right-hand side is clearly the variance of the
estimation error
~

. Thus the mean-square error
E
MSE
measures both the variance
and the bias of an estimator
^

. If the estimator is unbiased, the mean-square error
coincides with the variance of the estimator. Similar definitions hold for deterministic
parameters when the expectations in (4.17) and (4.18) are replaced by conditional
ones.
Figure 4.1 illustrates the bias
b
and standard deviation


(square root of the variance

2
) for an estimator
^

of a single scalar parameter

. In a Bayesian interpretation
(see Section 4.6), the bias and variance of the estimator
^

are, respectively, the mean
82
ESTIMATION THEORY

b

E (
^
 )
p(
^
 jx)
^

Fig. 4.1
Bias
b
and standard deviation


of an estimator
^

.
and variance of the posterior distribution
p
^
jx
T
(
^
 j x)
of the estimator
^

given the
observed data
x
T
.
Still another useful measure of the quality of an estimator is given by the covariance
matrix of the estimation error
C
~

=
E
f
~


~

T
g =
E
f( 
^
 )( 
^
)
T
g
(4.19)
It measures the errors of individual parameter estimates, while the mean-square error
is an overall scalar error measure for all the parameter estimates. In fact, the mean-
square error (4.17) can be obtained by summing up the diagonal elements of the error
covariance matrix (4.19), or the mean-square errors of individual parameters.
Efficiency
An estimator that provides the smallest error covariance matrix among
all unbiased estimators is the best one with respect to this quality criterion. Such
an estimator is called an efficient one, because it optimally uses the information
contained in the measurements. A symmetric matrix
A
is said to be smaller than
another symmetric matrix
B
,or
A < B
, if the matrix

B  A
is positive definite.
A very important theoretical result in estimation theory is that there exists a lower
bound for the error covariance matrix (4.19) of any estimator based on available
measurements. This is provided by the Cramer-Rao lower bound. In the following
theorem, we formulate the Cramer-Rao lower bound for unknown deterministic
parameters.
PROPERTIES OF ESTIMATORS
83
Theorem 4.1 [407] If
^

is any unbiased estimator of

based on the measurement
data
x
, then the covariance matrix of error in the estimator is bounded below by the
inverse of the Fisher information matrix J:
E
f( 
^
)( 
^
)
T
j  gJ
1
(4.20)
where

J =
E
(

@
@ 
ln p(x
T
j  )

@
@ 
ln p(x
T
j )

T
j 
)
(4.21)
Here it is assumed that the inverse
J
1
exists. The term
@
@ 
ln p(x
T
j )
is

recognized to be the gradient vector of the natural logarithm of the joint distribu-
tion
3
p(x
T
j  )
of the measurements
x
T
for nonrandom parameters

. The partial
derivatives must exist and be absolutely integrable.
It should be noted that the estimator
^

must be unbiased, otherwise the preceding
theorem does not hold. The theorem cannot be applied to all distributions (for
example, to the uniform one) because of the requirement of absolute integrability of
the derivatives. It may also happen that there does not exist any estimator achieving
the lower bound. Anyway, the Cramer-Rao lower bound can be computed for many
problems, providing a useful measure for testing the efficiency of specific estimation
methods designed for those problems. A more thorough discussion of the Cramer-
Rao lower bound with proofs and results for various types of parameters can be found,
for example, in [299, 242, 407, 419]. An example of computing the Cramer-Rao
lower bound will be given in Section 4.5.
Robustness
In practice, an important characteristic of an estimator is its ro-
bustness [163, 188]. Roughly speaking, robustness means insensitivity to gross
measurement errors, and errors in the specification of parametric models. A typical

problem with many estimators is that they may be quite sensitive to outliers, that is,
observations that are very far from the main bulk of data. For example, consider the
estimation of the mean from
100
measurements. Assume that all the measurements
(but one) are distributed between
1
and
1
, while one of the measurements has the
value
1000
. Using the simple estimator of the mean given by the sample average
in (4.5), the estimator gives a value that is not far from the value
10
. Thus, the
single, probably erroneous, measurement of
1000
had a very strong influence on the
estimator. The problem here is that the average corresponds to minimization of the
squared distance of measurements from the estimate [163, 188]. The square function
implies that measurements far away dominate.
Robust estimators can be obtained, for example, by considering instead of the
square error other optimization criteria that grow slower than quadratically with
the error. Examples of such criteria are the absolute value criterion and criteria
3
We have here omitted the subscript
x j 
of the density function
p(x j )

for notational simplicity. This
practice is followed in this chapter unless confusion is possible.
84
ESTIMATION THEORY
that saturate as the error grows large enough [83, 163, 188]. Optimization criteria
growing faster than quadratically generally have poor robustness, because a few
large individual errors corresponding to the outliers in the data may almost solely
determine the value of the error criterion. In the case of estimating the mean, for
example, one can use the median of measurements instead of the average. This
corresponds to using the absolute value in the optimization function, and gives a very
robust estimator: the single outlier has no influence at all.
4.3 METHOD OF MOMENTS
One of the simplest and oldest estimation methods is the method of moments.Itis
intuitively satisfying and often leads to computationally simple estimators, but on the
other hand, it has some theoretical weaknesses. We shall briefly discuss the moment
method because of its close relationship to higher-order statistics.
Assume now that there are
T
statistically independent scalar measurements or data
samples
x(1)x(2)::: x(T )
that have a common probability distribution
p(x j )
characterized by the parameter vector

=
(
1

2

::: 
m
)
T
in (4.1). Recall from
Section 2.7 that the
j
th moment

j
of
x
is defined by

j
=
E
fx
j
j g =
Z
1
1
x
j
p(x j )dx j =1 2:::
(4.22)
Here the conditional expectations are used to indicate that the parameters

are

(unknown) constants. Clearly, the moments

j
are functions of the parameters

.
On the other hand, we can estimate the respective moments directly from the
measurements. Let us denote by
d
j
the
j
th estimated moment, called the
j
th sample
moment. It is obtained from the formula (see Section 2.2)
d
j
=
1
T
T
X
i=1
x(i)]
j
(4.23)
The simple basic idea behind the method of moments is to equate the theoretical
moments


j
with the estimated ones
d
j
:

j
()=
j
(
1

2
::: 
m
)=d
j
(4.24)
Usually,
m
equations for the
m
first moments
j = 1::: m
are sufficient for
solving the
m
unknown parameters

1


2
::: 
m
. If Eqs. (4.24) have an acceptable
solution, the respective estimator is called the moment estimator, and it is denoted in
the following by
^

MM
.
Alternatively, one can use the theoretical central moments

j
=
E
f(x  
1
)
j
j  g
(4.25)
and the respective estimated sample central moments
s
j
=
1
T  1
T
X

i=1
x(i)  d
1
]
j
(4.26)
METHOD OF MOMENTS
85
to form the
m
equations

j
(
1

2
::: 
m
)=s
j
 j =1 2::: m
(4.27)
for solving the unknown parameters

=
(
1

2

::: 
m
)
T
.
Example 4.4 Assume now that
x(1)x(2)::: x(T )
are independent and identi-
cally distributed samples from a random variable
x
having the pdf
p(x j )=
1

2
exp


(x  
1
)

2

(4.28)
where

1
< x < 1
and


2
> 0
. We wish to estimate the parameter vector

=
(
1

2
)
T
using the method of moments. For doing this, let us first compute the
theoretical moments

1
and

2
:

1
=
E
fx j g =
Z
1

1
x


2
exp


(x  
1
)

2

dx = 
1
+ 
2
(4.29)

2
=
E
fx
2
j g =
Z
1

1
x
2


2
exp


(x  
1
)

2

dx =(
1
+ 
2
)
2
+ 
2
2
(4.30)
The moment estimators are obtained by equating these expressions with the first two
sample moments
d
1
and
d
2
, respectively, which yields

1

+ 
2
= d
1
(4.31)
(
1
+ 
2
)
2
+ 
2
2
= d
2
(4.32)
Solving these two equations leads to the moment estimates
^

1M M
= d
1
 (d
2
 d
2
1
)
1=2

(4.33)
^

2M M
= (d
2
 d
2
1
)
1=2
(4.34)
The other possible solution
^

2M M
=
(d
2
 d
2
1
)
1=2
must be rejected because the
parameter

2
must be positive. In fact, it can be observed that
^


2M M
equals the
sample estimate of the standard deviation, and
^

1M M
can be interpreted as the mean
minus the standard deviation of the distribution, both estimated from the available
samples.
The theoretical justification for the method of moments is that the sample moments
d
j
are consistent estimators of the respective theoretical moments

j
[407]. Similarly,
the sample central moments
s
j
are consistent estimators of the true central moments

j
. A drawback of the moment method is that it is often inefficient. Therefore, it
is usually not applied provided that other, better estimators can be constructed. In
general, no claims can be made on the unbiasedness and consistency of estimates
86
ESTIMATION THEORY
given by the method of moments. Sometimes the moment method does not even lead
to an acceptable estimator.

These negative remarks have implications in independent component analysis. Al-
gebraic, cumulant-based methods proposed for ICA are typically based on estimating
fourth-order moments and cross-moments of the components of the observation (data)
vectors. Hence, one could claim that cumulant-based ICA methods inefficiently uti-
lize, in general, the information contained in the data vectors. On the other hand,
these methods have some advantages. They will be discussed in more detail in
Chapter 11, and related methods can be found in Chapter 8 as well.
4.4 LEAST-SQUARES ESTIMATION
4.4.1 Linear least-squares method
The least-squares method can be regarded as a deterministic approach to the es-
timation problem where no assumptions on the probability distributions, etc., are
necessary. However, statistical arguments can be used to justify the least-squares
method, and they give further insight into its properties. Least-squares estimation is
discussed in numerous books, in a more thorough fashion from estimation point-of-
view, for example, in [407, 299].
In the basic linear least-squares method, the
T
-dimensional data vectors
x
T
are
assumed to obey the following model:
x
T
= H + v
T
(4.35)
Here

is again the

m
-dimensional parameter vector, and
v
T
is a
T
-vector whose
components are the unknown measurement errors
v (j )j =1::: T
.The
T  m
observation matrix
H
is assumed to be completely known. Furthermore, the number
of measurements is assumed to be at least as large as the number of unknown
parameters, so that
T  m
. In addition, the matrix
H
has the maximum rank
m
.
First, it can be noted that if
m = T
, we can set
v
T
=
0
, and get a unique solution


=
H
1
x
T
. If there were more unknown parameters than measurements (
m>T
),
infinitely many solutions would exist for Eqs. (4.35) satisfying the condition
v
=
0
. However, if the measurements are noisy or contain errors, it is generally highly
desirable to have much more measurements than there are parameters to be estimated,
in order to obtain more reliable estimates. So, in the following we shall concentrate
on the case
T >m
.
When
T >m
, equation (4.35) has no solution for which
v
T
=
0
. Because the
measurement errors
v
T

are unknown, the best that we can then do is to choose an
estimator
^

that minimizes in some sense the effect of the errors. For mathematical
convenience, a natural choice is to consider the least-squares criterion
E
LS
=
1
2
k v
T
k
2
=
1
2
(x
T
 H)
T
(x
T
 H)
(4.36)
LEAST-SQUARES ESTIMATION
87
Note that this differs from the error criteria in Section 4.2 in that no expectation is
involved and the criterion

E
LS
tries to minimize the measurement errors
v
, and not
directly the estimation error
 
^

.
Minimization of the criterion (4.36) with respect to the unknown parameters

leads to so-called normal equations [407, 320, 299]
(H
T
H)
^

LS
= H
T
x
T
(4.37)
for determining the least-squares estimate
^

LS
of


. It is often most convenient to
solve
^

LS
from these linear equations. However, because we assumed that the matrix
H
has full rank, we can explicitly solve the normal equations, getting
^

LS
=(H
T
H)
1
H
T
x
T
= H
+
x
T
(4.38)
where
H
+
=
(H
T

H)
1
H
T
is the pseudoinverse of
H
(assuming that
H
has maximal
rank
m
and more rows than columns:
T >m
) [169, 320, 299].
The least-squares estimator can be analyzed statistically by assuming that the
measurement errors have zero mean: E
fv
T
g
=
0
. It is easy to see that the least-
squares estimator is unbiased: E
f
^

LS
j  g
=


. Furthermore, if the covariance
matrix of the measurement errors
C
v
=E
fv
T
v
T
T
g
is known, one can compute the
covariance matrix (4.19) of the estimation error. These simple analyses are left as an
exercise to the reader.
Example 4.5 The least-squares method is commonly applied in various branches of
science to linear curve fitting. The general setting here is as follows. We try to fit to
the measurements the linear model
y (t)=
m
X
i=1
a
i

i
(t)+v(t)
(4.39)
Here

i

(t)
,
i =1 2::: m
,are
m
basis functions that can be generally nonlinear
functions of the argument
t
— it suffices that the model (4.39) be linear with respect
to the unknown parameters
a
i
. Assume now that there are available measurements
y (t
1
)y(t
2
)::: y(t
T
)
at argument values
t
1
t
2
::: t
T
, respectively. The linear
model (4.39) can be easily written in the vector form (4.35), where now the parameter
vector is given by

 =a
1
a
2
::: a
m
]
T
(4.40)
and the data vector by
x
T
=y (t
1
)y(t
2
)::: y(t
T
)]
T
(4.41)
Similarly, the vector
v
T
=
v (t
1
)v(t
2
)::: v(t

T
)]
T
contains the error terms
v (t
i
)
.
The observation matrix becomes
H =
2
6
6
6
4

1
(t
1
) 
2
(t
1
)  
m
(t
1
)

1

(t
2
) 
2
(t
2
)  
m
(t
2
)
.
.
.
.
.
.
.
.
.
.
.
.

1
(t
T
) 
2
(t

T
)  
m
(t
T
)
3
7
7
7
5
(4.42)

×