Mendel, J.M. “Estimation Theory and Algorithms: From Gauss to Wiener to Kalman”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c
1999byCRCPressLLC
15
Estimation Theory and Algorithms:
From Gauss to Wiener to Kalman
Jerry M. Mendel
University of Southern California
15.1 Introduction
15.2 Least-Squares Estimation
15.3 Properties of Estimators
15.4 Best Linear Unbiased Estimation
15.5 Maximum-Likelihood Estimation
15.6 Mean-Squared Estimation of Random Parameters
15.7 Maximum A Posteriori Estimation of Random Parameters
15.8 The Basic State-Variable Model
15.9 State Estimation for the Basic State-Variable Model
Prediction
•
Filtering (the Kalman Filter)
•
Smoothing
15.10 Digital Wiener Filtering
15.11 Linear Prediction in DSP, and Kalman Filtering
15.12 Iterated Least Squares
15.13 Extended Kalman Filter
Acknowledgment
References
Further Information
15.1 Introduction
Estimation is one of four modeling problems. The other three are representation (how something
should be modeled), measurement (which physical quantities should be measured and how they
should be measured), and validation (demonstrating confidence in the model). Estimation, which
fits in between the problems of measurement and validation, deals with the determination of those
physical quantities that cannot be measured from those that can be measured. We shall cover a wide
range of estimation techniques including weighted least squares, best linear unbiased, maximum-
likelihood, mean-squared, and maximum-a posteriori. These techniques are for parameter or state
estimation or a combination of the two, as applied to either linear or nonlinear models.
Thediscrete-timeviewpoint is emphasized in this chapterbecause: (1) muchrealdataiscollectedin
a digitized manner, so it isin a form ready to be processedby discrete-timeestimation algorithms; and
(2) the mathematics associatedwith discrete-time estimation theory is simpler than with continuous-
time estimation theory. We view (discrete-time) estimation theory as the extension of classical signal
processing to the design of discrete-time (digital) filters that process uncertain data in a optimal
manner. Estimation theory can, therefore, be viewed as a natural adjunct to digital signal processing
theory. Mendel [12] is the primary reference for all the material in this chapter.
c
1999 by CRC Press LLC
Estimation algorithms process data and, as such, must be implemented on a digital computer.
Our computation philosophy is, whenever possible, leave it to the experts. Many of our chapter’s
algorithms can be used with MATLAB
TM
and appropriate toolboxes (MATLAB is a registered trade-
mark of The MathWorks, Inc.). See [12] for specific connections between MATLAB
TM
and toolbox
M-files and the algorithms of this chapter.
The main model that we shall direct our attention to is linear in the unknown parameters, namely
Z(k) = H(k)θ + V(k) .
(15.1)
In this model, which we refer to as a “generic linear model,” Z(k) = col (z(k), z(k − 1),...,z(k−
N + 1)), which is N × 1, is called the measurement vector. Its elements are z(j) = h
(j)θ + v(j);
θ which is n × 1, is called the parameter vector, and contains the unknown deterministic or random
parameters that will be estimated using oneor more ofthis chapter’stechniques; H(k), which is N ×n,
is called the observation matrix; and, V(k), which is N × 1, is called the measurement noise vector.
By convention, the argument “k”ofZ(k), H(k), and V(k) denotes the fact that the last measurement
used to construct (15.1)isthekth.
Examples of problems that can be cast into the form of the generic linear model are: identifying
the impulse response coefficients in the convolutional summation model for a linear time-invariant
system from noisy output measurements; identifying the coefficients of a linear time-invariant finite-
difference equation model for a dynamical system from noisy output measurements; function ap-
proximation; state estimation; estimating parameters of a nonlinear model using a linearized version
of that model; deconvolution; and identifying the coefficients in a discretized Volterra series repre-
sentation of a nonlinear system.
The following estimation notation is used throughout this chapter:
ˆ
θ(k)denotes an estimate of θ
and
˜
θ(k)denotes the errorin estimation, i.e.,
˜
θ(k) = θ −
ˆ
θ(k). The generic linearmodel is thestarting
point for the derivation of manyclassical parameter estimation techniques, and the estimation model
for Z(k) is
ˆ
Z(k) = H(k)
ˆ
θ(k). In the rest of this chapter we develop specific structures for
ˆ
θ(k). These
structures are referred to as estimators. Estimates are obtained whenever data are processed by an
estimator.
15.2 Least-Squares Estimation
The method of least squares dates back to Karl Gauss around 1795 and is the cornerstone for most
estimationtheory. The weightedleast-squaresestimator (WLSE),
ˆ
θ
WLS
(k), isobtainedbyminimizing
the objective function J [
ˆ
θ(k)]=
˜
Z
(k)W(k)
˜
Z(k), where [using (15.1)]
˜
Z(k) = Z(k) −
ˆ
Z(k) =
H(k)
˜
θ(k)+V(k), and weightingmatrix W(k)mustbesymmetric andpositivedefinite. Thisweighting
matrix can be used to weight recent measurements more (or less) heavily than past measurements.
If W(k) = cI, so that all measurements are weighted the same, then weighted least-squares reduces
to least squares, in which case, we obtain
ˆ
θ
LS
(k). Setting dJ[
ˆ
θ(k)]/d
ˆ
θ(k) = 0, we find that:
ˆ
θ
WLS
(k) =
H
(k)W(k)H(k)
−1
H
(k)W(k)Z(k)
(15.2)
and, consequently,
ˆ
θ
LS
(k) =
H
(k)H(k)
−1
H
(k)Z(k)
(15.3)
Note, also, that J [
ˆ
θ
WLS
(k)]=Z
(k)W(k)Z(k) −
ˆ
θ
WLS
(k)H
(k)W(k)H(k)
ˆ
θ
WLS
(k).
Matrix H
(k)W(k)H(k) must be nonsingular for its inverse in (15.2) to exist. This is true if W(k)
is positive definite, as assumed, and H(k) is of maximum rank. We know that
ˆ
θ
WLS
(k) minimizes
J [
ˆ
θ
WLS
(k)] because d
2
J [
ˆ
θ(k)]/d
ˆ
θ
2
(k) = 2H
(k)W(k)H(k) > 0, since H
(k)W(k)H(k) is invert-
ible. Estimator
ˆ
θ
WLS
(k) processes the measurements Z(k) linearly; hence, it is referred to as a linear
c
1999 by CRC Press LLC
estimator. In practice, we do not compute
ˆ
θ
WLS
(k) using (15.2), because computing the inverse
of H
(k)W(k)H(k) is fraught with numerical difficulties. Instead, the so-called normal equations
[H
(k)W(k)H(k)]
ˆ
θ
WLS
(k) = H
(k)W(k)Z(k) are solved using stable algorithms from numerical lin-
ear algebra (e.g., [3] indicating that one approach to solving the normal equations is to convert the
original least squares problem into an equivalent, easy-to-solve problem using orthogonal transfor-
mations such as Householder or Givens transformations). Note, also, that (15.2) and (15.3) apply to
the estimation of either deterministic or random parameters, because nowhere in the derivation of
ˆ
θ
WLS
(k) did we have to assume that θ was or was not random. Finally, note that WLSEs may not be
invariant under changes of scale. One way to circumvent this difficulty is to use normalized data.
Least-squares estimates can also be computed using the singular-value decomposition (SVD) of
matrix H(k). This computation is valid for both the overdetermined (N < n) and underdetermined
(N > n) situations and for the situation when H(k) may or may not be of full rank. The SV D of
K × M matrix A is:
U
AV =
0
0 0
(15.4)
where U and V are unitary matrices,
= diag (σ
1
,σ
2
,...,σ
r
), and σ
1
≥ σ
2
≥ ...≥ σ
r
> 0.The
σ
i
’s are the singular values of A, and r is the rank of A. Let the SVD of H(k) begivenby(15.4). Even
if H(k) is not of maximum rank, then
ˆ
θ
LS
(k) = V
−1
0
0 0
U
Z(k)
(15.5)
where
−1
= diag (σ
−1
1
σ
−1
2
,...,σ
−1
r
) and r is the rank of H(k). Additionally, in the overdeter-
mined case,
ˆ
θ
LS
(k) =
r
i=1
v
i
(k)
σ
2
i
(k)
v
i
(k)H
(k)Z(k)
(15.6)
Similar formulas exist for computing
ˆ
θ
WLS
(k).
Equations (15.2) and (15.3) are batch equations, because they process all of the measurements
at one time. These formulas can be made recursive in time by using simple vector and matrix
partitioning techniques. The information form of the recursive WLSE is:
ˆ
θ
WLS
(k + 1) =
ˆ
θ
WLS
(k) + K
w
(k + 1)[z(k + 1) − h
(k + 1)
ˆ
θ
WLS
(k)]
(15.7)
K
w
(k + 1) = P(k + 1)h(k + 1)w(k + 1)
(15.8)
P
−1
(k + 1) = P
−1
(k) + h(k + 1)w(k + 1)h
(k + 1)
(15.9)
Equations (15.8) and (15.9) require the inversion of n × n matrix P.Ifn is large, then this will
be a costly computation. Applying a matrix inversion lemma to (15.9), one obtains the following
alternative covariance form of the recursive WLSE: Equation (15.7), and
K
w
(k + 1) = P(k)h(k + 1)
h
(k + 1)P(k)h(k + 1) +
1
w(k + 1)
−1
(15.10)
P(k + 1) =
I − K
w
(k + 1)h
(k + 1)
P(k)
(15.11)
Equations (15.7)–(15.9)or(15.7), (15.10), and (15.11), are initialized by
ˆ
θ
WLS
(n) and P
−1
(n),where
P(n) =[H
(n)W(n)H(n)]
−1
, and are used for k = n, n + 1,...,N − 1.
Equation (15.7) can be expressed as
ˆ
θ
WLS
(k + 1) =
I − K
w
(k + 1)h
(k + 1)
ˆ
θ
WLS
(k) + K
w
(k + 1)z(k + 1)
(15.12)
c
1999 by CRC Press LLC
which demonstrates that the recursive WLSE is a time-varying digital filter that is excited by random
inputs (i.e., the measurements), one whose plant matrix [I − K
w
(k + 1)h
(k + 1)] may itself be
random because K
w
(k + 1) and h(k + 1) may be random, depending upon the specific application.
The random natures of these matrices make the analysis of this filter exceedingly difficult.
Two recursions are present in the recursive WLSEs. The first is the vector recursion for
ˆ
θ
WLS
given
by (15.7). Clearly,
ˆ
θ
WLS
(k +1) cannot be computed from this expression until measurement z(k +1)
is available. The second is the matrix recursion for either P
−1
givenby(15.9)orP givenby(15.11).
Observe that values for these matrices can be precomputed before measurements are made. A digital
computer implementation of (15.7)–(15.9)isP
−1
(k +1) → P(k +1) → K
w
(k +1) →
ˆ
θ
WLS
(k +1),
whereas for (15.7), (15.10), and (15.11), it is P(k) → K
w
(k + 1) →
ˆ
θ
WLS
(k + 1) → P(k + 1).
Finally, the recursive WLSEs can even be used for k = 0, 1,...,N − 1. Often z(0) = 0, or there
is no measurement made at k = 0, so that we can set z(0) = 0. In this case we can set w(0) = 0,
and the recursive WLSEs can be initialized by setting
ˆ
θ
WLS
(0) = 0 and P(0) to a diagonal matrix of
very large numbers. This is very commonly done in practice. Fast fixed-order recursive least-squares
algorithms that are based on the Givens rotation [3] and can be implemented using systolic arrays
are described in [5] and the references therein.
15.3 Properties of Estimators
How dowe know whether or notthe resultsobtained from the WLSE, orfor that matter any estimator,
are good? To answer this question, we must make use of the fact that all estimators represent
transformations of random data; hence,
ˆ
θ(k)is itself random, so that its properties must be studied
from a statistical viewpoint. This fact, and its consequences, which seem so obvious to us today, are
due to the eminent statistician R.A. Fischer.
It is common to distinguish between small-sample and large-sample properties of estimators. The
term “sample” refers to the number of measurements used to obtain
ˆ
θ, i.e., the dimension of Z.
The phrase “small sample” means any number of measurements (e.g., 1, 2, 100, 10
4
,orevenan
infinite number), whereas the phrase “large sample” means “an infinite number of measurements.”
Large-sample properties are also referred to as asymptotic properties. If an estimator possesses as
small-sample property, it also possesses the associated large-sample property; but the converse is not
always true. Although large sample means an infinite number of measurements, estimators begin
to enjoy large-sample properties for much fewer than an infinite number of measurements. How
few usually depends on the dimension of θ,n, the memory of the estimators, and in general on the
underlying, albeit unknown, probability density function.
A thorough study into
ˆ
θ would mean determining its probability density function p(
ˆ
θ). Usually,
it is too difficult to obtain p(
ˆ
θ) for most estimators (unless
ˆ
θ is multivariate Gaussian); thus, it is
customary to emphasize the first-and second-order statistics of
ˆ
θ (or its associated error
˜
θ = θ −
ˆ
θ),
the mean and the covariance.
Small-sample properties of an estimator are unbiasedness and efficiency. An estimator is unbiased
if its mean value is tracking the unknown parameter at every value of time, i.e., the mean value of
the estimation error is zero at every value of time. Dispersion about the mean is measured by error
variance. Efficiency is related to how small the error variance will be. Associatedwith efficiency is the
very famous Cramer-Rao inequality (Fisher information matrix, in the case of a vector of parameters)
which places a lower bound on the error variance, a bound that does not depend on a particular
estimator.
Large-sample properties of an estimator are asymptotic unbiasedness, consistency, asymptotic
normality, and asymptotic efficiency. Asymptotic unbiasedness and efficiency are limiting forms of
their small sample counterparts, unbiasedness and efficiency. The importance of an estimator being
asymptotically normal (Gaussian) is that its entire probabilistic description is then known, and it
c
1999 by CRC Press LLC
can be entirely characterized just by its asymptotic first- and second-order statistics. Consistency is
a form of convergence of
ˆ
θ(k) to θ; it is synonymous with convergence in probability. One of the
reasons for the importance of consistency in estimation theory is that any continuous function of a
consistent estimator is itself a consistent estimator, i.e., “consistency carries over.” It is also possible
to examine other types of stochastic convergence for estimators, such as mean-squared convergence
and convergence with probability 1. A general carry-over property does not exist for these two types
of convergence; it must be established case-by case (e.g., [11]).
Generally speaking, it is very difficult to establish small sample or large sample properties for least-
squares estimators, except in the very special case when H(k) and V(k) are statistically independent.
While this condition is satisfied in the application of identifying an impulse response, it is violated
in the important application of identifying the coefficients in a finite difference equation, as well
as in many other important engineering applications. Many large sample properties of LSEs are
determined by establishing that the LSE is equivalent to another estimator for which it is known that
the large sample property holds true. We pursue this below.
Least-squares estimators require no assumptions about the statistical nature of the generic model.
Consequently, the formula for the WLSE is easy to derive. The price paid for not making assumptions
about the statistical nature of the generic linear model is great difficulty in establishing small or large
sample properties of the resulting estimator.
15.4 Best Linear Unbiased Estimation
Our second estimator is both unbiased and efficient by design, and is a linear function of measure-
ments Z(k). It is called a best linear unbiased estimator (BLUE),
ˆ
θ
BLU
(k). As in the derivation of
the WLSE, we begin with our generic linear model; but, now we make two assumptions about this
model, namely: (1) H(k) must be deterministic, and (2) V(k) must be zero mean with positive
definite known covariance matrix R(k). The derivation of the BLUE is more complicated than the
derivation of the WLSE because of the design constraints; however, its performance analysis is much
easier because we build good performance into its design.
We begin by assuming the following linear structure for
ˆ
θ
BLU
(k),
ˆ
θ
BLU
(k) = F(k)Z(k). Matrix
F(k) is designed such that: (1)
ˆ
θ
BLU
(k) is an unbiased estimator of θ, and (2) the error variance for
each of the n parameters is minimized. In this way,
ˆ
θ
BLU
(k) will be unbiased and efficient (within
the class of linear estimators) by design. The resulting BLUE estimator is:
ˆ
θ
BLU
(k) =[H
(k)R
−1
(k)H(k)]H
(k)R
−1
(k)Z(k)
(15.13)
A very remarkable connection exists between the BLUE and WLSE, namely, the BLUE of θ is the
special case of the WLSE of θ when W(k) = R
−1
(k). Consequently, all results obtained in our
section above for
ˆ
θ
WLS
(k) can be applied to
ˆ
θ
BLU
(k) by setting W(k) = R
−1
(k). Matrix R
−1
(k)
weights the contributions of precise measurements heavily and deemphasizes the contributions of
imprecisemeasurements. The best linear unbiased estimation design technique has led to a weighting
matrix that is quite sensible.
If H(k) is deterministic and R(k) = σ
2
ν
I, then
ˆ
θ
BLU
(k) =
ˆ
θ
LS
(k). This result, known as the
Gauss-Markov theorem, is important because we have connected two seemingly different estimators,
one of which,
ˆ
θ
BLU
(k), has the properties of unbiasedness and minimum variance by design; hence,
in this case
ˆ
θ
LS
(k) inherits these properties.
In a recursive WLSE, matrix P(k) has no special meaning. In a recursive BLUE [which is obtained
by substituting W(k) = R
−1
(k) into (15.7)–(15.9), or (15.7), (15.10) and (15.11)], matrix P(k) is
the covariance matrix for the error between θ and
ˆ
θ
BLU
(k), i.e., P(k) =[H
(k)R
−1
(k)H(k)]
−1
=
cov [
˜
θ
BLU
(k)]. Hence, every time P(k) is calculated in the recursive BLUE, we obtain a quantitative
measure of how well we are estimating θ.
c
1999 by CRC Press LLC
Recall that we stated that WLSEs may change in numerical value under changes in scale. BLUEs
are invariant under changes in scale. This is accomplished automatically by setting W(k) = R
−1
(k)
in the WLSE.
The fact that H(k) must be deterministic severely limits the applicability of BLUEs in engineering
applications.
15.5 Maximum-Likelihood Estimation
Probability is associated with a forward experiment in which the probability model, p(Z(k)|θ),is
specified, including values for the parameters, θ, in that model (e.g., mean and variance in a Gaussian
density function), and data (i.e., realizations) are generated using this model. Likelihood, l(θ|Z(k)),
is proportional to probability. In likelihood, the data is given as well as the nature of the probability
model;but the parameters of the probability model are not specified. They must be determined from
the given data. Likelihood is, therefore, associated with an inverse experiment.
The maximum-likelihood method is based on the relatively simple idea that different (statistical)
populations generate different samples and that any given sample (i.e., set of data) is more likely to
have come from some populations than from others.
Inorder todeterminethemaximum-likelihoodestimate(MLE) of deterministic θ,
ˆ
θ
ML
, we needto
determine a formula for the likelihood function and then maximize that function. Becauselikelihood
is proportional to probability, we need to know the entire joint probability density function of the
measurements in order to determine a formula for the likelihood function. This, of course, is much
more information about Z(k) than was required in the derivation of the BLUE. In fact, it is the most
information that we can ever expect to know about the measurements. The price we pay for knowing
so much information about Z(k) is complexity in maximizing the likelihood function. Generally,
mathematical programming must be used in order to determine
ˆ
θ
ML
.
Maximum-likelihood estimates are very popular and widely used because they enjoy very good
large sample properties. They are consistent, asymptotically Gaussian with mean θ and covariance
matrix
1
N
J
−1
, in which J is the Fisher information matrix, and are asymptotically efficient. Functions
of maximum-likelihood estimates are themselves maximum-likelihood estimates, i.e., if g(θ ) is a
vector function mapping θ into an interval in r-dimensional Euclidean space, then g(
ˆ
θ
ML
) is a MLE
of g(θ ). This “invariance” property is usually not enjoyed by WLSEs or BLUEs.
In one special case it is very easy to compute
ˆ
θ
ML
, i.e., for our generic linear model in which H(k)
is deterministic and V(k) is Gaussian. In this case
ˆ
θ
ML
=
ˆ
θ
BLU
. These estimators are: unbiased,
because
ˆ
θ
BLU
is unbiased; efficient (within the class of linear estimators), because
ˆ
θ
BLU
is efficient;
consistent, because
ˆ
θ
ML
is consistent; and, Gaussian, because they depend linearly on Z(k), which is
Gaussian. If, in addition, R(k) = σ
2
ν
I, then
ˆ
θ
ML
(k) =
ˆ
θ
BLU
(k) =
ˆ
θ
LS
(k), and these estimators are
unbiased, efficient (within the class of linear estimators), consistent, and Gaussian.
The method of maximum-likelihood is limited to deterministic parameters. In the case of random
parameters, we can still use the WLSE or the BLUE, or, if additional information is available, we can
use either a mean-squared or maximum-a posteriori estimator, as described below. The former does
not use statistical information about the random parameters, whereas the latter does.
15.6 Mean-Squared Estimation of Random Parameters
Givenmeasurementsz(1), z(2),...,z(k), themean-squaredestimator(MSE)ofrandom θ,
ˆ
θ
MS
(k) =
φ[z(i), i = 1, 2,...,k], minimizes the mean-squared error J [
˜
θ
MS
(k)]=E{
˜
θ
MS
(k)
˜
θ
MS
(k)} [where
˜
θ
MS
(k) = θ −
ˆ
θ
MS
(k)]. The function φ[z(i), i = 1, 2,...,k] may be nonlinear or linear. Its exact
structure is determined by minimizing J [
˜
θ
MS
(k)].
c
1999 by CRC Press LLC
The solution to this mean-squared estimation problem, which is known as the fundamental the-
orem of estimation theory is:
ˆ
θ
MS
(k) = E
{
θ|Z(k)
}
(15.14)
As it stands, (15.14) is not terribly useful for computing
ˆ
θ
MS
(k). In general, we must first compute
p[θ|Z(k)] and then perform the requisite number of integrations of θp[θ|Z(k)] to obtain
ˆ
θ
MS
(k).
It is useful to separate this computation into two major cases; (1) θ and Z(k) are jointly Gaussian —
the Gaussian case, and (2) θ and Z(k) are not jointly Gaussian — the non-Gaussian case.
When θ and Z(k) are jointly Gaussian, the estimator that minimizes the mean-squared error is
ˆ
θ
MS
(k) = m
θ
+ P
θz
(k)P
−1
z
(k)
Z(k) − m
z
(k)
(15.15)
where m
θ
is the mean of θ,m
z
(k) is the mean of Z(k), P
z
(k) is the covariance matrix of Z(k), and
P
θz
(k) is the cross-covariance between θ and Z(k). Of course, to compute
ˆ
θ
MS
(k) using (15.15), we
must somehow know all of these statistics, and we must be sure that θ and Z(k) are jointly Gaussian.
For the generic linear model, Z(k) = H(k)θ +V(k), in which H(k) is deterministic, V(k) is Gaussian
noise with known invertible covariance matrix R(k), θ is Gaussian with mean m
θ
and covariance
matrix P
θ
, and, θ and V(k) are statistically independent, then θ and Z(k) are jointly Gaussian,
and, (15.15) becomes
ˆ
θ
MS
(k) = m
θ
+ P
θ
H
(k)
H(k)P
θ
H
(k) + R(k)
−1
[
Z(k) − H(k)m
θ
]
(15.16)
where error-covariance matrix P
MS
(k), which is associated with
ˆ
θ
MS
(k),is
P
MS
(k) = P
θ
− P
θ
H
(k)
H(k)P
θ
H
(k) + R(k)
−1
H(k)P
θ
=
P
−1
θ
+ H
(k)R
−1
(k)H(k)
−1
.
(15.17)
Using (15.17)in(15.16),
ˆ
θ
MS
(k) can be reexpressed as
ˆ
θ
MS
(k) = m
θ
+ P
MS
(k)H
(k)R
−1
(k)
[
Z(k) − H(k)m
θ
]
(15.18)
Suppose θ and Z(k) are not jointly Gaussian and that we know m
θ
, m
z
(k), P
z
(k), and P
θz
(k).In
this case, the estimator that is constrained to be an affine transformation of Z(k) and that minimizes
the mean-squared error is also given by (15.15).
We now know the answer to the following important question: When is the linear (affine) mean-
squaredestimatorthe sameas the mean-squaredestimator? The answeris when θ and Z(k)are jointly
Gaussian. If θ and Z(k) are not jointly Gaussian, then
ˆ
θ
MS
(k) = E{θ |Z(k)}, which, in general, is a
nonlinear function of measurements Z(k), i.e., it is a nonlinear estimator.
Associated with mean-squared estimation theory is the orthogonality principle: Suppose f [Z(k)]
is any function of the data Z(k); then the error in the mean-squared estimatoris orthogonal to f[Z(k)]
in the sense that E{[θ −
ˆ
θ
MS
(k)]f
[Z(k)]} = 0. A frequently encountered special case of this occurs
when f[Z(k)]=
ˆ
θ
MS
(k), in which case E{
˜
θ
MS
(k)
˜
θ
MS
(k)}=0.
When θ and Z(k) are jointly Gaussian,
ˆ
θ
MS
(k) in (15.15) has the following properties: (1) it is
unbiased; (2) each of its components has the smallest error variance; (3) it is a “linear” (affine)
estimator; (4) it is unique; and, (5) both
ˆ
θ
MS
(k) and
˜
θ
MS
(k) are multivariate Gaussian, which
means that these quantities are completely characterized by their first- and second-order statistics.
Tremendous simplifications occur when θ and Z(k) are jointly Gaussian!
Many of the results presented in this section are applicable to objective functions other than the
mean-squared objective function. See the supplementary material at the end of Lesson 13 in [12] for
discussions on a wide number of objective functions that lead to E{θ |Z(k)} as the optimal estimator
of θ, as well as discussions on a full-blown nonlinear estimator of θ.
c
1999 by CRC Press LLC