Class Notes in Statistics and Econometrics Part 19 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (333.43 KB, 13 trang )

CHAPTER 37
OLS With Random Constraint
A Bayesian considers the posterior density the full representation of the informa-
tion provided by sample and prior information. Frequentists have discoveered that
one can interpret the parameters of this density as estimators of the key unknown
parameters, and that these estimators have good sampling properties. Therefore
they have tried to re-derive the Bayesian formulas from frequentist principles.
If β satisﬁes the constraint Rβ = u only approximately or with uncertainty, it
has therefore become customary to specify
(37.0.55) Rβ = u + η, η ∼ (o, τ
2
Φ), η and ε
ε
ε uncorrelated.
Here it is assumed τ
2
> 0 and Φ positive deﬁnite.
877
878 37. OLS WITH RANDOM CONSTRAINT
Both interpretations are possible here: either u is a constant, which means nec-
essarily that β is random, or β is as usual a constant and u is random, coming from
whoever happened to do the research (this is why it is called “mixed estimation”).
It is the correct procedure in this situation to do GLS on the model
(37.0.56)

y
u

=

X

R

β +

ε
ε
ε
−η

with

ε
ε
ε
−η

∼


o
o

, σ
2

I O
O
1
κ
2

I


.
Therefore
(37.0.57)
ˆ
ˆ
β = (X

X + κ
2
R

R)
−1
(X

y + κ
2
R

u).
where κ
2
= σ
2
/τ
2
.

This
ˆ
ˆ
β is the BLUE if in repeated samples β and u are drawn from such distri-
butions that Rβ −u has mean o and variance τ
2
I, but
E
[β] can be anything. If one
considers both β and u ﬁxed, then
ˆ
ˆ
β is a biased estimator whose properties depend
on how close the true value of Rβ is to u.
Under the assumption of constant β and u, the MSE matrix of
ˆ
ˆ
β is smaller
than that of the OLS
ˆ
β if and only if the true parameter values β, u, and σ
2
satisfy
(37.0.58) (Rβ −u)


2
κ
2
I + R(X


X)
−1
R


−1
(Rβ −u) ≤ σ
2
.
37. OLS WITH RANDOM CONSTRAINT 879
This condition is a simple extension of (29.6.6).
An estimator of the form
ˆ
ˆ
β = (X

X + κ
2
I)
−1
X

y, where κ
2
is a constant, is
called “ordinary ridge regression.” Ridge regression can be considered the imposition
of a random constraint, even though it does not hold—again in an eﬀort to trade
bias for variance. This is similar to the imposition of a constraint which does not
hold. An explantation of the term “ridge” given by [VU81, p. 170] is that the ridge

solutions are near a ridge in the likelihood surface (at a point where the ridge is close
to the origin). This ridge is drawn in [VU81, Figures 1.4a and 1.4b].
Problem 402. Derive from (37.0.58) the wel l-known formula that the MSE of
ordinary ridge regression is smaller than that of the OLS estimator if and only if the
true parameter vector satisﬁes
(37.0.59) β


2
κ
2
I + (X

X)
−1

−1
β ≤ σ
2
.
Answer. In (37.0.58) set u = o and R = I. 
Whatever the true values of β and σ
2
, there is always a κ
2
> 0 for which (37.0.59)
or (37.0.58) holds. The corresponding statement for the trace of the MSE-matrix
has been one of the main justiﬁcations for ridge regression in [HK70b] and [HK70a],
and much of the literature about ridge regression has been inspired by the hop e that
880 37. OLS WITH RANDOM CONSTRAINT

one can estimate κ
2
in such a way that the MSE is better everywhere. This is indeed
done by the Stein-rule.
Ridge regression is reputed to be a good estimator when there is multicollinearity.
Problem 403. (Not eligible for in-class exams) Assume E[y] = µ, var(y) = σ
2
,
and you make n independent observations y
i
. Then the best linear unbiased estimator
of µ on the basis of these observations is the sample mean ¯y. For which range of
values of α is MSE[α¯y; µ] < MSE[¯y; µ]? Unfortunately, this value depends on µ and
can therefore not be used to improve the estimate.
Answer.
MSE[α¯y; µ] = E

(α¯y − µ)
2

= E

(α¯y − αµ + αµ − µ)
2

< MSE[¯y; µ] = var[¯y](37.0.60)
α
2
σ
2

/n + (1 −α)
2
µ
2
< σ
2
/n(37.0.61)
Now simplify it:
(1 − α)
2
µ
2
< (1 − α
2
)σ
2
/n = (1 − α)(1 + α)σ
2
/n(37.0.62)
This cannot be true for α ≥ 1, because for α = 1 one has equality, and for α > 1, the righthand side
is negative. Therefore we are allowed to assume α < 1, and can divide by 1 −α without disturbing
the inequality:
(1 − α)µ
2
< (1 + α)σ
2
/n(37.0.63)
µ
2
− σ

2
/n < α(µ
2
+ σ
2
/n)(37.0.64)
37. OLS WITH RANDOM CONSTRAINT 881
The answer is therefore
nµ
2
− σ
2
nµ
2
+ σ
2
< α < 1.(37.0.65)

Problem 404. (Not eligible for in-class exams) Assume y = Xβ + ε
ε
ε with ε
ε
ε ∼
(o, σ
2
I). If prior knowledge is available that P β lies in an ellipsoid centered around
p, i.e., (P β − p)

Φ
−1

(P β − p) ≤ h for some known positive deﬁnite symmetric
matrix Φ and scalar h, then one might argue that the SSE should be mimimized only
for those β inside this ellipsoid. Show that this inequality constrained mimimization
gives the same formula as OLS with a random constraint of the form κ
2
(Rβ −u) ∼
(o, σ
2
I) (where R and u are appropriately chosen constants, while κ
2
depends on
y. You don’t have to compute the precise values, simply indicate how R, u, and κ
2
should be determined.)
Answer. Decomp os e Φ
−1
= C

C where C is square, and deﬁne R = CP and u = Cp. The
mixed estimator β = β
∗
minimizes
(y − Xβ)

(y − Xβ) + κ
4
(Rβ − u)

(Rβ − u)(37.0.66)
= (y −Xβ)


(y − Xβ) + κ
4
(P β − p)

Φ
−1
(P β − p)(37.0.67)
Choose κ
2
such that β
∗
= (X

X + κ
4
P

Φ
−1
P )
−1
(X

y + κ
4
P

Φ
−1

p) satisﬁes the inequality
constraint with equality, i.e., (P β
∗
− p)

Φ
−1
(P β
∗
− p) = h. 
882 37. OLS WITH RANDOM CONSTRAINT
Answer. Now take any β that satisﬁes (P β − p)

Φ
−1
(P β − p) ≤ h. Then
(y − Xβ
∗
)

(y − Xβ
∗
) = (y − Xβ
∗
)

(y − Xβ
∗
) + κ
4

(P β
∗
− p)

Φ
−1
(P β
∗
− p) −κ
4
h
(37.0.68)
(because β
∗
satisﬁes the inequality constraint with equality)
≤ (
y − Xβ)

(y − Xβ) + κ
4
(P β − p)

Φ
−1
(P β − p) −κ
4
h(37.0.69)
(because β
∗
minimizes (37.0.67))

≤ (y −Xβ)

(y − Xβ)(37.0.70)
(because β satisﬁes the inequality constraint). Therefore β = β
∗
minimizes the inequality con-
strained problem. 
CHAPTER 38
Stein Rule Estimators
Problem 405. We will work with the regression model y = Xβ + ε
ε
ε with ε
ε
ε ∼
N(o, σ
2
I), which in addition is “orthonormal,” i.e., the X-matrix satisﬁes X

X =
I.
• a. 0 points Write down the simple formula for the OLS estimator
ˆ
β in this
model. Can you think of situations in which such an “orthonormal” model is appro-
priate?
Answer.
ˆ
β = X

y. Sclove [Scl68] gives as examples: if one regresses on orthonormal poly-

nomials, or on principal components. I guess also if one simply needs the means of a random
vector. It seems the important fact here is that one can order the regressors; if this is the case then
one can always make the Gram-Schmidt orthonormalization, which has the advantage that the jth
orthonormalized regressor is a linear combination of the ﬁrst j ordered regressors. 
883
884 38. STEIN RULE ESTIMATORS
• b. 0 points Assume one has Bayesian prior knowledge that β ∼ N(o, τ
2
I), and
β independent of ε
ε
ε. In the general case, if prior information is β ∼ N(ν, τ
2
A
−1
),
the Bayesian posterior mean is
ˆ
β
M
= (X

X + κ
2
A)
−1
(X

y + κ
2

Aν) where κ
2
=
σ
2
/τ
2
. Show that in the present case
ˆ
β
M
is proportional to the OLS estimate
ˆ
β with
proportionality factor (1 −
σ
2
τ
2
+σ
2
), i.e.,
(38.0.71)
ˆ
β
M
=
ˆ
β(1 −
σ

2
τ
2
+ σ
2
).
Answer. The formula given is (36.0.36), and in the present case, A
−1
= I. One can also view
it as a regression with a random constraint Rβ ∼ (o, τ
2
I) where R = I, which is mathematically
the same as considering the know mean vector, i.e., the null vector, as additional observations. In
either case one gets
(38.0.72)
ˆ
β
M
= (X

X + κ
2
A)
−1
X

y = (X

X + κ
2

R

R)
−1
X

y = (I +
σ
2
τ
2
I)
−1
X

y =
ˆ
β(1 −
σ
2
τ
2
+ σ
2
),
i.e., it shrinks the O LS
ˆ
β = X

y. 

• c. 0 points Formula (38.0.71) can only be used for estimation if the ratio
σ
2
/(τ
2
+ σ
2
) is know n. This is usually not the case, but it is possible to estimate
both σ
2
and τ
2
+ σ
2
from the data. The use of such estimates instead the actual
values of σ
2
and τ
2
in the Bayesian formulas is sometimes called “empirical Bayes.”
38. STEIN RULE ESTIMATORS 885
Show that E[
ˆ
β

ˆ
β] = k(τ
2
+ σ
2

), and that E[y

y −
ˆ
β

ˆ
β] = (n −k)σ
2
, where n is
the number of observations and k is the number of regressors.
Answer. Since y = Xβ + ε
ε
ε ∼ N (o, σ
2
XX

+ τ
2
I), it follows
ˆ
β = X

y ∼ N(o, (σ
2
+ τ
2
)I)
(where we now have a k-dimensional identity matrix), therefore E[
ˆ

β

ˆ
β] = k(σ
2
+τ
2
). Furthermore,
since M y = Mε
ε
ε regardle ss of whether β is random or not, σ
2
can be estimated in the usual
manner from the SSE: (n − k)σ
2
= E[ˆε

ˆε] = E[ˆε

ˆε] = E[y

My] = E[y

y −
ˆ
β

ˆ
β] because
M = I −XX


. 
• d. 0 points If one plugs the unbiased estimates of σ
2
and τ
2
+ σ
2
from part (c)
into (38.0.71), one obtains a version of the so-called “James and Stein” est imator
(38.0.73)
ˆ
β
JS
=
ˆ
β(1 − c
y

y −
ˆ
β

ˆ
β
ˆ
β

ˆ
β

).
What is the value of the constant c if one follows the above instructions? (This
estimator has become famous because for k ≥ 3 and c any number between 0 and
2(n − k)/(n − k + 2) the estimator (38.0.73) has a uniformly lower MSE than the
OLS
ˆ
β, where the MSE is measured as the trace of the MSE-matrix.)
Answer. c =
k
n−k
. I would need a proof that this is in the bounds. 
• e. 0 points The existence of the James and Stein estimator proves that the
OLS estimator is “inadmissible.” What does this mean? Can you explain why the
886 38. STEIN RULE ESTIMATORS
OLS estimator turns out to be deﬁcient exactly where it ostensibly tries to be strong?
What are the practical implications of this?
The properties of this estimator were ﬁrst discussed in James and Stein [JS61],
extending the work of Stein [Ste56].
Stein himself did not introduce the estimator as an “empirical Bayes” estimator,
and it is not certain that this is indeed the right way to look at it. Especially this
approach does not explain why the OLS cannot b e uniformly improved upon if k ≤ 2.
But it is a possible and interesting way to look at it. If one pretends one has prior
information, but does not really have it but “steals” it from the data, this “fraud”
can still be successful.
Another interpretation is that these estimators are shrunk versions of unbiased
estimators, and unbiased estimators always get better if one shrinks them a little.
The only problem is that one cannot shrink them too much, and in the case of the
normal distribution, the amount by which one has to shrink them depends on the
unknown parameters. If one estimates the shrinkage factor, one usually does not
know if the noise introduced by this estimated factor is greater or smaller than the

savings. B ut in the case of the Stein rule, the noise is smaller than the savings.
Problem 406. 0 points Return to the “orthonormal” model y = Xβ + ε
ε
ε with
ε
ε
ε ∼ N (o, σ
2
I) and X

X = I. With the usual assumption of nonrandom β (and
38. STEIN RULE ESTIMATORS 887
no prior information about β), show that the F -statistic for the hypothesis β = o is
F =
ˆ
β

ˆ
β/k
(y

y−
ˆ
β

ˆ
β)/(n−k)
.
Answer. SSE
r

= y

y, SSE
u
= y

y −
ˆ
β

ˆ
β as shown above, number of constraints is k.
Use equation . . . for the test statistic. 
• a. 0 points Now look at the following “pre-test estimator”: Your estimate of
β is the null vector o if the value of the F -statistic for the test β = o is equal to or
smaller than 1, and your estimate of β is the OLS estimate
ˆ
β if the test statistic has
a value bigger than 1. Mathematically, this estimator can be written in the form
(38.0.74)
ˆ
β
P T
= I(F )
ˆ
β,
where F is the F statistic derived in part (1) of this question, and I(F ) is the “in-
dicator function” for F > 1, i.e., I(F) = 0 if F ≤ 1 and I(F ) = 1 if F > 1. Now
modify this pre-test estimator by using the following function I(F ) instead: I(F ) = 0
if F ≤ 1 and I(F ) = 1 − 1/F if F > 1. This is no longer an indicator function,

but can be considered a continuous approximation to one. Since the discontinuity is
removed, one can expect that it has, under certain circumstances, better properties
than the indicator function itself. Write down the formula for this modiﬁed pre-test
estimator. How does it diﬀer from the Stein rule estimator (38.0.73) (with the value
888 38. STEIN RULE ESTIMATORS
for c coming from the empirical Bayes approach)? Which estimator would you expect
to be better, and why?
Answer. This modiﬁed pre-test estimator has the form
(38.0.75)
ˆ
β
JS+
=

o if 1 − c
y

y−
ˆ
β

ˆ
β
ˆ
β

ˆ
β
< 0
ˆ

β(1 − c
y

y−
ˆ
β

ˆ
β
ˆ
β

ˆ
β
) otherwise
It is equal to the Stei n-rul e estimator (38.0.73) when the estimated shrinkage factor ( 1−c
y

y−
ˆ
β

ˆ
β
ˆ
β

ˆ
β
)

is positive, but the shrinkage factor is set 0 instead of turning negative. This is why it is commonly
called the “positive part” Stein-rule estimator. Stein conjectured early on, and Baranchik [Bar64]
showed that it is uniformly better than the Stein rule estimator: 
• b. 0 points Which lessons can one draw about pre-test estimators in general
from this exercise?
Stein rule estimators have not been used very much, they are not equivariant
and the shrinkage seems arbitrary. Discussing them here brings out two things : the
formulas for random constraints etc. are a pattern according to which one can build
good operational estimators. And some widely used but seemingly ad-hoc procedures
like pre-testing may have deeper foundations and better properties than the halfways
sophisticated researcher may think.
38. STEIN RULE ESTIMATORS 889
Problem 407. 6 points Why was it somewhat a sensation when Charles Stein
came up with an estimator which is uniformly better than the OLS? Discuss the Stein
rule estimator as empirical Bayes, shrinkage estimator, and discuss the “positive
part” Stein rule estimator as a modiﬁed pretest est imator.

Class Notes in Statistics and Econometrics Part 19 pdf

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về