Tải bản đầy đủ (.pdf) (38 trang)

Class Notes in Statistics and Econometrics Part 16 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (466.7 KB, 38 trang )

CHAPTER 31
Residuals: Standardized, Predictive, “Studentized”
31.1. Three Decisions about Plotting Residuals
After running a regression it is always advisable to look at the residuals. Here
one has to make three decisions.
The first decision is whether to look at the ordinary residuals
(31.1.1) ˆε
i
= y
i
− x

i
ˆ
β
(x

i
is the ith row of X), or the “predictive” residuals, which are the residuals
computed using the OLS estimate of β gained from all the other data except the
data point where the residual is taken. If one writes
ˆ
β(i) for the OLS estimate
without the ith observation, the defining equation for the ith predictive residual,
795
796 31. RESIDUALS
which we call ˆε
i
(i), is
(31.1.2) ˆε
i


(i) = y
i
− x

i
ˆ
β(i).
The second decision is whether to standardize the residuals or not, i.e., whether
to divide them by their estimated standard deviations or not. Since ˆε = My, the
variance of the ith ordinary residual is
(31.1.3) var[ˆε
i
] = σ
2
m
ii
= σ
2
(1 − h
ii
),
and regarding the predictive residuals it will be shown below, see (31.2.9), that
(31.1.4) var[ˆε
i
(i)] =
σ
2
m
ii
=

σ
2
1 − h
ii
.
Here
(31.1.5) h
ii
= x

i
(X

X)
−1
x
i
.
(Note that x
i
is the ith row of X written as a column vector.) h
ii
is the ith diagonal
element of the “hat matrix” H = X(X

X)
−1
X

, the projector on the column

space of X. This projector is called “hat matrix” because
ˆy = Hy, i.e., H puts the
“hat” on y.
31.1. T HREE DECISIONS ABOUT PLOTTING RESIDUALS 797
Problem 362. 2 points Show that the ith diagonal element of the “hat matrix”
H = X(X

X)
−1
X

is x

i
(X

X)
−1
x
i
where x
i
is the ith row of X written as a
column vector.
Answer. In terms of e
i
, the n-vector with 1 on the ith place and 0 everywhere else, x
i
=
X


e
i
, and the ith diagonal element of the hat matrix is e

i
He
i
= e

i
X
i
(X

X)
−1
X

e
i
=
x

i
(X

X)
−1
x

i
. 
Problem 363. 2 points The variance of the ith disturbance is σ
2
. Is the variance
of the ith residual bigger than σ
2
, smaller than σ
2
, or equal to σ
2
? (Before doing the
math, first argue in words what you would expect it to be.) What about the variance
of the predictive residual? Prove your answers mathematically. You are allowed to
use (31.2.9) without proof.
Answer. Here is only the math part of the answer: ˆε = My. Since M = I −H is idem potent
and symmetric, we get
V
[My] = σ
2
M, in particular this m eans var[ˆε
i
] = σ
2
m
ii
where m
ii
is the
ith diagonal elements of M . Then m

ii
= 1 −h
ii
. Since all diagonal elements of projection matrices
are between 0 and 1, the answer is: the variances of the ordinary residuals cannot be bigger than
σ
2
. Regarding predictive residuals, if we plug m
ii
= 1 −h
ii
into (31.2.9) it becomes
ˆε
i
(i) =
1
m
ii
ˆε
i
therefore var[ˆε
i
(i)] =
1
m
2
ii
σ
2
m

ii
=
σ
2
m
ii
(31.1.6)
which is bigger than σ
2
. 
798 31. RESIDUALS
Problem 364. Decide in the following situations whether you want predictive
residuals or ordinary residuals, and whether you want them standardized or not.
• a. 1 point You are looking at the residuals in order to check whether the asso-
ciated data points are outliers and do perhaps not belong into the model.
Answer. Here one should use the predictive residuals. If the ith o bservation is an outlier
which should not be in the regression, then one should not use it when running the regression. Its
inclusion may have a strong influence on the regression result, and therefore the residual may not
be as conspicuous. One should standardize them. 
• b. 1 point You are looking at the residuals in order to assess whether there is
heteroskedasticity.
Answer. Here you want them standardized, but there is no reason to use the predictive
residuals. Ordinary residuals are a little more precise than predictive residuals because they are
based on more observations. 
• c. 1 point You are looking at the residuals in order to assess whether the
disturbances are autocorrelated.
Answer. Same answer as for b. 
• d. 1 point You are looking at the residuals in order to assess whether the
disturbances are normally distributed.
31.1. T HREE DECISIONS ABOUT PLOTTING RESIDUALS 799

Answer. In my view, one should make a normal QQ-plot of standardized residuals, but one
should not use the predictive residuals. To see why, let us first look at the distribution of the
standardized residuals before division by s. Each ˆε
i
/

1 − h
ii
is normally distributed with m ean
zero and standard deviation σ. (But different such residua ls are not independent.) If one takes a
QQ-plot of those residuals against the normal distribution, one will get in the limit a straight line
with slope σ. If one divides every residual by s, the slope will be close to 1, but one will again get
something approximating a straight line. The fact that s is random does not affect the relation
of the residuals to each other, and this relation is what determines whether or not the QQ-plot
approximates a straight line.
But Belsley, Kuh, and Welsch on [BKW80, p. 43] draw a normal probability plot of the
studentized, not the standardized, residuals. They give no justification for their choice. I think it
is the wrong choice.

• e. 1 point Is there any situation in which you do not want to standardize the
residuals?
Answer. Standardization is a mathematical procedure which is justified when certain con-
ditions hold. But there is no guarantee that these conditions acutally hold, and in order to get
a more immediate impression of the fit of the curve one may want to look at the unstandardized
residuals. 
800 31. RESIDUALS
The third decision is how to plot the residuals. Never do it against y. Either
do it against the predicted ˆy, or make several plots against all the columns of the
X-matrix.
In time series, also a plot of the residuals against time is called for.

Another option are the partial residual plots, see about this also (30.0.2). Say
ˆ
β[h] is the estimated parameter vector, which is estimated with the full model, but
after estimation we drop the h-th parameter, and X[h] is the X-matrix without
the hth column, and x
h
is the hth column of the X-matrix. Then by (30.0.4), the
estimate of the hth slope parameter is the same as that in the simple regression of
y − X[h]
ˆ
β[h] on x
h
. The plot of y − X[h]
ˆ
β[h] against x
h
is called the hth partial
residual plot.
To understand this better, start out with a regression y
i
= α + βx
i
+ γz
i
+ ε
i
;
which gives you the fitted values y
i
= ˆα+

ˆ
βx
i
+ˆγz
i
+ˆε
i
. Now if you regress y
i
−ˆα−
ˆ
βx
i
on x
i
and z
i
then the intercept will be zero and the estimated coefficient of x
i
will
be zero, and the estimated coefficient of z
i
will be ˆγ, and the residuals will be ˆε
i
.
The plot of y
i
− ˆα −
ˆ
βx

i
versus z
i
is the partial residuals plot for z.
31.2. Relationship between Ordinary and Predictive Residuals
In equation (31.1.2), the ith predictive residuals was defined in terms of
ˆ
β(i),
the parameter estimate from the regression of y on X with the ith observation left
31.2. REL ATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 801
out. We will show now that there is a very simple mathematical relationship between
the ith predictive residual and the ith ordinary residual, namely, equation (31.2.9).
(It is therefore not necessary to run n different regressions to get the n predictive
residuals.)
We will write y(i) for the y vector with the ith element deleted, and X(i) is the
matrix X with the ith row deleted.
Problem 365. 2 points Show that
X(i)

X(i) = X

X − x
i
x

i
(31.2.1)
X(i)

y(i) = X


y −x
i
y
i
.(31.2.2)
Answer. Write (31.2.2) as X

y = X(i)

y(i) + x
i
y
i
, and observe that with our definition of
x
i
as column vectors representing the rows of X, X

=

x
1
··· x
n

. Therefore
(31.2.3) X

y =


x
1
. . . x
n



y
1
.
.
.
y
n


= x
1
y
1
+ ··· + x
n
y
n
.

802 31. RESIDUALS
An important stepping stone towards the proof of (31.2.9) is equation (31.2.8),
which gives a relationship between h

ii
and
(31.2.4) h
ii
(i) = x

i
(X(i)

X(i))
−1
x
i
.
ˆy
i
(i) = x

i
ˆ
β(i) has variance σ
2
h
ii
(i). The following problems give the steps neces-
sary to prove (31.2.8). We begin with a simplified version of theorem A.8.2 in the
Mathematical Appendix:
Theorem 31.2.1. Let A be a nonsingular k ×k matrix, δ = 0 a scalar, and b a
k ×1 vector with b


A
−1
b + δ = 0. Then
(31.2.5)

A +
bb

δ

−1
= A
−1

A
−1
bb

A
−1
δ + b

A
−1
b
.
Problem 366. Prove (31.2.5) by showing that the product of the matrix with its
alleged inverse is the unit matrix.
Problem 367. As an application of (31.2.5) show that
(31.2.6)

(X

X)
−1
+
(X

X)
−1
x
i
x

i
(X

X)
−1
1 − h
ii
is th e inverse of X(i)

X(i).
Answer. This is (31.2.5), or (A.8.20), with A = X

X, b = x
i
, and δ = −1.

31.2. REL ATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 803

Problem 368. Using (31.2.6) show that
(31.2.7) (X(i)

X(i))
−1
x
i
=
1
1 − h
ii
(X

X)
−1
x
i
,
and using (31.2.7) show that h
ii
(i) is related to h
ii
by th e equation
(31.2.8) 1 + h
ii
(i) =
1
1 − h
ii
[Gre97, (9-37) on p. 445] was apparently not aware of this relationship.

Problem 369. Prove the following mathematical relationship between predictive
residuals and ordinary residuals:
(31.2.9) ˆε
i
(i) =
1
1 − h
ii
ˆε
i
which is the same as (28.0.29), only in a different notation.
804 31. RESIDUALS
Answer. For this we have to a pply the above mathematical tools. With the help of (31.2.7)
(transpose it!) and (31 .2.2), (31 .1.2) becomes
ˆε
i
(i) = y
i
− x

i
(X(i)

X(i))
−1
X(i)

y(i)
= y
i


1
1 − h
ii
x

i
(X

X)
−1
(X

y − x
i
y
i
)
= y
i

1
1 − h
ii
x

i
ˆ
β +
1

1 − h
ii
x

i
(X

X)
−1
x
i
y
i
= y
i

1 +
h
ii
1 − h
ii


1
1 − h
ii
x

i
ˆ

β
=
1
1 − h
ii
(y
i
− x

i
ˆ
β)
This is a little tedious but simplifies extremely nicely at the end. 
The relationship (31.2.9) is so simple because the estimation of η
i
= x

i
β can be
done in two steps. First collect the information which the n − 1 observations other
than the ith contribute to the estimation of η
i
= x

i
β is contained in ˆy
i
(i). The
information from all observations except the ith can be written as
(31.2.10) ˆy

i
(i) = η
i
+ δ
i
δ
i
∼ (0, σ
2
h
ii
(i))
Here δ
i
is the “sampling error” or “estimation error” ˆy
i
(i) −η
i
from the regression of
y(i) on X(i). If we combine this compound “observation” with the ith observation
31.2. REL ATIONSHIP BETWEEN ORDINARY AND PREDICTIVE RESIDUALS 805
y
i
, we get
(31.2.11)

ˆy
i
(i)
y

i

=

1
1

η
i
+

δ
i
ε
i
 
δ
i
ε
i




0
0

, σ
2


h
ii
(i) 0
0 1


This is a regression model similar to model (18.1.1), but this time with a nonspherical
covariance matrix.
Problem 370. Show that the BLUE of η
i
in model (31.2.11) is
(31.2.12) ˆy
i
= (1 −h
ii
)ˆy
i
(i) + h
ii
y
i
= ˆy
i
(i) + h
ii
ˆε
i
(i)
Hint: apply (31.2.8). Use this to prove (31.2.9).
Answer. As shown in problem 206, the BLUE in this sit uatio n is the weighted average of the

observations with the weights proportional to the inverses of the variances. I.e., the first observation
has weight
(31.2.13)
1/h
ii
(i)
1/h
ii
(i) + 1
=
1
1 + h
ii
(i)
= 1 −h
ii
.
Since the sum of the weights must be 1, the weight of the second observation is h
ii
.
806 31. RESIDUALS
Here is an alternative solution, using formula (26.0.2) for the BLUE, which reads here
ˆy
i
=


1 1



h
ii
1−h
ii
0
0 1

−1

1
1


−1

1 1


h
ii
1−h
ii
0
0 1

−1

ˆy
i
(i)

y
i

=
= h
ii

1 1


1−h
ii
h
ii
0
0 1

ˆy
i
(i)
y
i

= (1 −h
ii
)ˆy
i
(i) + h
ii
y

i
.
Now subtract this last formula from y
i
to get y
i
− ˆy
i
= (1 −h
ii
)(y
i
− ˆy
i
(i)), which is (31.2.9). 
31.3. Standardization
In this section we will show that the standardized predictive residual is what is
sometimes called the “studentized” residual. It is recommended not to use the term
“studentized residual” but say “standardized predictive residual” instead.
The standardization of the ordinary residuals has two steps: every ˆε
i
is divided
by its “relative” standard deviation

1 − h
ii
, and then by s, an estimate of σ, the
standard deviation of the true disturbances. In formulas,
(31.3.1) the ith standardized ordinary residual =
ˆε

i
s

1 − h
ii
.
Standardization of the ith predictive residual has the same two steps: first divide
the predictive residual (31.2.9) by the relative standard deviation, and then divide by
31.3. STANDARDIZATION 807
s(i). But a look at formula (31.2.9) shows that the ordinary and the predictive resid-
ual differ only by a nonrandom factor. Therefore the first step of the standardization
yields exactly the same result whether one starts with an ordinary or a predictive
residual. Standardized predictive residuals differ therefore from standardized ordi-
nary residuals only in the second step:
(31.3.2) the ith standardized predictive residual =
ˆε
i
s(i)

1 − h
ii
.
Note that equation (31.3.2) writes the standardized predictive residual as a function
of the ordinary res idual, not the predictive residual. The standardized predictive
residual is sometimes called the “studentized” residual.
Problem 371. 3 points The ith predictive residual has the formula
(31.3.3) ˆε
i
(i) =
1

1 − h
ii
ˆε
i
You do not have to prove this formula, but you are asked to derive the standard
deviation of ˆε
i
(i), and to derive from it a formula for the standardized ith predictive
residual.
808 31. RESIDUALS
This similarity b e tween these two formulas has lead to w idespread confusion.
Even [BKW80] seem to have been unaware of the significance of “studentization”;
they do not work with the c oncept of predictive residuals at all.
The standardized predictive residuals have a t-distribution, because they are
a normally distributed variable divided by an independent χ
2
over its degrees of
freedom. (But note that the joint distribution of all standardized predictive residuals
is not a multivariate t.) Therefore one can use the quantiles of the t-distribution to
judge, from the size of these residuals, whether one has an extreme observation or
not.
Problem 372. Following [DM93, p. 34], we will use (30.0.3) and the other
formulas regarding additional regressors to prove the following: If you add a dummy
variable which has the value 1 for the ith observation and the value 0 for all other
observations to your regression, then the coefficient estimate of this dummy is the ith
predictive residual, and the coefficient estimate of the other parameters after inclusion
of this dummy is equal to
ˆ
β(i). To fix notation (and without loss of generality),
assume the ith observation is the last observation, i.e., i = n, and put the dummy

variable first in the regression:
(31.3.4)

y(n)
y
n

=

o X(n)
1 x

n

α
β

+

ˆε(i)
ˆε
n

or y =

e
n
X



α
β

+ ε
ε
ε
31.3. STANDARDIZATION 809
• a. 2 points With the definition X
1
= e
n
=

o
1

, write M
1
= I−X
1
(X

1
X
1
)
−1
X

1

as a 2 × 2 partitioned matrix.
Answer.
(31.3.5) M
1
=

I o
o

1



o
1


o

1

=

I o
o

0

;


I o
o

0

z(i)
z
i

=

z(i)
0

i.e., M
1
simply annulls the last element. 
• b. 2 points Either show mathematically, perhaps by evaluating (X

2
M
1
X
2
)
−1
X

2
M

1
y,
or give a good heuristic argument (as [DM93] do), that regressing M
1
y on M
1
X
gives the same parameter estimate as regressing y on X with the nth observation
dropped.
Answer. (30.0.2) reads here
(31.3.6)

y(n)
0

=

X(n)
o


ˆ
β(i) +

ˆε(i)
0

in other words, the estimate of β is indeed
ˆ
β(i), and the first n − 1 elements of the residual are

indeed the residuals one gets in the regression without the ith observation. This is so ugly because
the singularity shows here in the zeros of the last row, usually it does not show so much. But this
way one also sees that it gives zero as the last residual, and this is what one needs to know!
810 31. RESIDUALS
To have a mathematical proof that the last row with zeros does not affect the estimate, evaluate
(30.0.3)
ˆ
β
2
= (X

2
M
1
X
2
)
−1
X

2
M
1
y
=


X(n)

x

n


I o
o

0

X(n)
x

n


−1

X(n)

x
n


I o
o

0

y(n)
y
n


= (X(n)

X(n))
−1
X(n)

y(n) =
ˆ
β(n)

• c. 2 points Use the fact that the residuals in the regression of M
1
y on M
1
X
are the same as the residuals in the full regression (31.3.4) to show that ˆα is the nth
predictive residual.
Answer. ˆα is obtained from that last row, which reads y
n
= ˆα+x

n
ˆ
β(i), i.e., ˆα is the predictive
residual. 
• d. 2 points Use (30.0.3) with X
1
and X
2

interchanged to get a formula for ˆα.
Answer. ˆα = (X

1
MX
1
)
−1
X

1
My =
1
m
nn
ˆε
n
=
1
1−h
nn
ˆε
n
, here M = I − X(X

X)
−1
X

.


31.3. STANDARDIZATION 811
• e. 2 points From (30.0.4) follows that also
ˆ
β
2
= (X

2
X
2
)
−1
X

2
(y − X
1
ˆ
β
1
).
Use this to prove
(31.3.7)
ˆ
β −
ˆ
β(i) = (X

X)

−1
x
i
ˆε
i
1
1 − h
ii
which is [DM93, equation (1.40) on p. 33].
Answer. For this we also need to show that one gets the right
ˆ
β(i) if one regresses y − e
n
ˆα,
or, in other words y −e
n
ˆε
n
(n), on X. In other words,
ˆ
β(n) = (X

X)
−1
X

(y −e
n
ˆε
n

(n)), which
is exactly (32.4.1). 

CHAPTER 32
Regression Diagnostics
“Regression Diagnostics” can either concentrate on observations or on variables.
Regarding observations, it lo oks for outliers or influential data in the dataset. Re-
garding variables, it checks whether there are highly collinear variables, or it keeps
track of how much each variable contributes to the MSE of the regression. Collinear-
ity is discussed in [DM93, 6.3] and [Gre97, 9.2]. Regression diagnostics needs five
to ten times more computer resources than the regression itself, and often relies on
graphics, therefore it has only recently become part of the standard procedures.
Problem 373. 1 point Define multicollinearity.
• a. 2 points What are the symptoms of multicollinearity?
• b. 2 points How can one detect multicollinearity?
813
814 32. REGRESSION DIAGNOSTICS
• c. 2 points How can one remedy multicollinearity?
32.1. Missing Observations
First case: data on y are missing. If you use a least squares predictor then this
will not give any change in the estimates and although the computer will think it is
more efficient it isn’t.
What other schemes are there? Filling in the missing y by the arithmetic mean
of the observed y does not give an unbiased estimator.
General conclusion: in a single-equation context, filling in missing y not a good
idea.
Now missing values in the X-matrix.
If there is only one regressor and a constant term, then the zero order filling in
of ¯x “results in no changes and is equivalent with dropping the incomplete data.”
The alternative: filling it with zeros and adding a dummy for the data with

missing observation amounts to exactly the same thing.
The only case where filling in missing data makes sense is: if you have multiple
regression and you can predict the missing data in the X matrix from the other data
in the X matrix.
32.3. INFLU ENTIAL OBSERVATIONS AND OUTLIERS 815
32.2. Grouped Data
If single observations are replaced by arithmetic means of groups of observations,
then the error variances vary with the size of the group. If one takes this into
consideration, GLS still has good properties, although having the original data is of
course more efficient.
32.3. Influential Observations and Outliers
The following discussion focuses on diagnostics regarding observations. To be
more precise, we will investigate how each single observation affects the fit established
by the other data. (One may also ask how the addition of any two observations affects
the fit, etc.)
32.3.1. The “Leverage”. The ith diagonal element h
ii
of the “hat matrix”
is called the “leverage” of the ith observation. The leverage satisfies the following
identity
(32.3.1) ˆy
i
= (1 −h
ii
)ˆy
i
(i) + h
ii
y
i

h
ii
is therefore is the weight which y
i
has in the least squares estimate ˆy
i
of η
i
= x

i
β,
compared with all other observations, which contribute to ˆy
i
through ˆy
i
(i). The
larger this weight, the more strongly this one observation will influence the estimate
816 32. REGRESSION DIAGNOSTICS
of η
i
(and if the estimate of η
i
is affected, then other parameter estimates may be
affected too).
Problem 374. 3 points Explain the meanings of al l the terms in equation (32.3.1)
and use that equation to explain why h
ii
is called the “leverage” of the ith observa-
tion. Is every observation with high leverage also “influential” (in the sense that its

removal would greatly change the regression estimates)?
Answer. ˆy
i
is the fitted value for the ith observation, i.e., it is the BLUE of η
i
, of the expected
value of the ith obse rvation. It is a weighted average of two quantities: the actual observation y
i
(which has η
i
as expected value), and ˆy
i
(i), which is the BLUE of η
i
based on all the other
observations except the ith. The weight of the ith observation in this weighted average is called the
“leverage” of the ith observation. The sum of all leverages is always k, the number of parameters
in the regression. If the leverage of one individual point is much greater than k/n, then this point
has much more in flue nce on its own fitted value than one should expect just based on the number
of observations,
Leverage is not the same as influence; if an observation has high leverage, bu t by accident
the observed value y
i
is very close to ˆy
i
(i), then removal of this observation will not change the
regression results much. Leverage is potential influence. Leverage does not depend on any of the
observations, one only needs the X matrix to compute it. 
Those observations whose x-values are away from the other observations have
“leverage” and can therefore potentially influence the regression results more than the

32.3. INFLU ENTIAL OBSERVATIONS AND OUTLIERS 817
others. h
ii
serves as a measure of this distance. Note that h
ii
only depends on the X-
matrix, not on y, i.e., points may have a high leverage but not be influential, because
the associated y
i
blends well into the fit established by the other data. However,
regardless of the observed value of y, observations with high leverage always affect
the covariance matrix of
ˆ
β.
(32.3.2) h
ii
=
det(X

X) − det(X(i)

X(i))
det(X

X)
,
where X(i) is the X-matrix without the ith observation.
Problem 375. Prove equation (32.3.2).
Answer. Since X


(i)X(i) = X

X − x
i
x

i
, use theorem A.7.3 with W = X

X, α = −1,
and d = x
i
. 
Problem 376. Prove the following facts about the diagonal elements of the so-
called “hat matrix” H = X(X

X)
−1
X

, which has its name because Hy = ˆy,
i.e., it puts the hat on y.
• a. 1 point H is a projection matrix, i.e., it is symmetric and idempotent.
Answer. Symmetry follows from the laws for the transposes of products: H

= (ABC)

=
C


B

A

= H where A = X, B = (X

X)
−1
which is symmetric, and C = X

. Idempotency
X(X

X)
−1
X

X(X

X)
−1
X

= X(X

X)
−1
X

. 

818 32. REGRESSION DIAGNOSTICS
• b. 1 point Prove that a symmetric idempotent matrix is nonnegative definite.
Answer. If H is symmetric and idempotent, then for arbitrary g, g

Hg = g

H

Hg =
Hg
2
≥ 0. But g

Hg ≥ 0 for all g is the criterion which makes H nonnegative definite. 
• c. 2 points Show that
(32.3.3) 0 ≤ h
ii
≤ 1
Answer. If e
i
is the vector with a 1 on the ith place and zeros everywhere else, then e

i
He
i
=
h
ii
. From H nonnegative definite follows therefore that h
ii

≥ 0. h
ii
≤ 1 follows because I − H is
symmetric and idempotent (and therefore nonnegative definite) as well: it is the projection on the
orthogonal complement. 
• d. 2 points Show: the average value of the h
ii
is

h
ii
/n = k/n, where k is
the number of columns of X. (Hint: for this you must compute the trace tr H.)
Answer. The average can b e writt en as
1
n
tr(H) =
1
n
tr(X(X

X)
−1
X

) =
1
n
tr(X


X(X

X)
−1
) =
1
n
tr(I
k
) =
k
n
.
Here we used tr BC = tr CB (Theorem A.1.2). 
• e. 1 point Show that
1
n
ιι

is a projection matrix. Here ι is the n-vector of
ones.
• f. 2 points Show: If the regression has a constant term, then H −
1
n
ιι

is a
projection matrix.
32.3. INFLU ENTIAL OBSERVATIONS AND OUTLIERS 819
Answer. If ι, the vector of ones, is one of the columns of X (or a linear combination

of these colu mns ), this means there is a vector a with ι = Xa. From this follows Hιι

=
X(X

X)
−1
X

Xaι

= Xaι

= ιι

. One can use this to show that H −
1
n
ιι

is idempotent:
(H −
1
n
ιι

)(H −
1
n
ιι


) = HH −H
1
n
ιι


1
n
ιι

H +
1
n
ιι

1
n
ιι

= H −
1
n
ιι


1
n
ιι


+
1
n
ιι

=
H −
1
n
ιι

. 
• g. 1 point Show: If the regression has a constant term, then one can sharpen
inequality (32.3.3) to 1/n ≤ h
ii
≤ 1.
Answer. H −ιι

/n is a projection matrix, therefore nonnegative definite, therefore its diag-
onal elements h
ii
− 1/n are nonnegative. 
• h. 3 points Why is h
ii
called the “leverage” of the ith observation? To get full
points, you must give a really good verbal explanation.
Answer. Use equation (31.2.12). Effect on any other linear combination of
ˆ
β is less than the
effect on ˆy

i
. Distinguish from influence. Leverage depends only on X matrix, not on y. 
h
ii
is closely related to the test statistic testing whether the x
i
comes from the
same multivariate normal distribution as the other rows of the X-matrix. Belsley,
Kuh, and Welsch [BKW80, p. 17] say those observations i with h
ii
> 2k/n, i.e.,
more than twice the average, should be considered as “leverage points” which might
deserve some attention.

×