Tải bản đầy đủ (.pdf) (34 trang)

Book Econometric Analysis of Cross Section and Panel Data By Wooldridge - Chapter 4 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (230.53 KB, 34 trang )

II LINEAR MODELS
In this part we begin our econometric analysis of linear models for cross section and
panel data. In Chapter 4 we review the single-equation linear model and discuss
ordinary least squares estimation. Alth ough this material is, in principle, review, the
approach is likely to be di¤erent from an introductory linear models course. In ad-
dition, we cover several topics that are not traditionally covered in texts but that have
proven useful in empirical work. Chapter 5 discusses instrumental variables estima-
tion of the linear model, and Chapter 6 covers some remaining topics to round out
our treatment of the single-equation model.
Chapter 7 begins our analysis of systems of equations. The general setup is that the
number of population e quations is small relative to the (cross section) sample size.
This allows us to cover seemingly unrelated regression models for cross section data
as well as begin our analysis of panel data. Chapter 8 builds on the framework from
Chapter 7 but considers the case where some explanatory variables may be uncorre-
lated with the error terms. Generalized method of moments estimation is the unifying
theme. Chapter 9 applies the methods of Chapter 8 to the estimation of simultaneous
equations models, with an emphasis on the conceptual issues that arise in applying
such models.
Chapter 10 explicitly introduces unobserved-e¤ects linear panel data models. Under
the assumption that the explanatory variables are strictly exogenous conditional on
the unobserved e¤ect, we study several estimation methods, including fixed e¤ects,
first di¤erencing, and random e¤ects. The last method assumes, at a minimum,
that the unobserved e¤ect is uncorrelated with the explanatory variables in all time
periods. Chapter 11 considers extensions of the basic panel data model, including
failure of the strict exogeneity assumption.
4 The Single-Equation Linear Model and OLS Estimation
4.1 Overview of the Single-Equatio n Linear Model
This and the next couple of chapters cover what is still the workhorse in empirical
economics: the single-equation linear model. Though you are assumed to be com-
fortable with ordinary least squares (OLS) estimation, we begin with OLS for a
couple of reasons. First, it provides a bridge between more traditional approaches


to econometrics—which treats explanatory variables as fixed—and the current ap-
proach, which is based on random sampling with stochastic explanatory variables.
Second, we cover some topics that receive at best cursory treatment in first-semester
texts. These topics, such as proxy variable solutions to the omitted variable problem,
arise often in applied work.
The population model we study is linear in its parameters,
y ¼ b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K
x
K
þ u ð4:1Þ
where y; x
1
; x
2
; x
3
; ; x
K
are observable random scalars (that is, we can observe

them in a random sample of the population), u is the unobservable random distur-
bance or error, and b
0
; b
1
; b
2
; ; b
K
are the parameters (constants) we would like to
estimate.
The error form of the model in equation (4.1) is useful for presenting a unified
treatment of the statistical properties of various econometric procedures. Neverthe-
less, the steps one uses for getting to equation (4.1) are just as important. Goldberger
(1972) defines a structural model as one representing a causal relationship, as opposed
to a relationship that simply captures statistical associations. A structura l equation
can be obtained from an economic model, or it can be obtained through informal
reasoning. Sometimes the structural model is directly estimable. Other times we must
combine auxiliary assumptions about other variables with algebraic manipulations
to arrive at an estimable model. In addition, we will often have reasons to estimate
nonstructural equations, sometimes as a precursor to estimating a structural equation.
The error term u can consist of a variety of things, including omitted variables
and measurement error (we will see some examples shortly). The parameters b
j
hopefully c orrespond to the parameters of interest, that is, the parameters in an un-
derlying structural model. Whether this is the case depends on the application and the
assumptions made.
As we will see in Section 4.2, the key condition needed for OLS to consi stently
estimate the b
j

(assuming we have available a random sample from the population) is
that the error (in the population) has mean zero and is uncorrelated with each of the
regressors:
EðuÞ¼0; Covðx
j
; uÞ¼0; j ¼ 1; 2; ; K ð4:2Þ
The zero-mean assumption is for free when an intercept is included, and we will
restrict attention to that case in what follows. It is the zero covariance of u with each
x
j
that is important. From Chapter 2 we know that equation (4.1) and assumption
(4.2) are equivalent to defining the linear projection of y onto ð1; x
1
; x
2
; ; x
K
Þ as
b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K

x
K
.
Su‰cient for assumption (4.2) is the zero conditional mean assumption
Eðu jx
1
; x
2
; ; x
K
Þ¼Eðu jxÞ¼0 ð4:3Þ
Under equation (4.1) and assumption (4.3) we have the population regression function
Eðy jx
1
; x
2
; ; x
K
Þ¼b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K

x
K
ð4:4Þ
As we saw in Chapter 2, equation (4.4) includes the case where the x
j
are nonlinear
functions of underlying explanatory variables, such as
Eðsavings jincome; size; age; collegeÞ¼b
0
þ b
1
logðincomeÞþb
2
size þb
3
age
þ b
4
college þb
5
collegeÁage
We will study the asymptotic properties of OLS primarily under assumption (4.2),
since it is weaker than assumption (4.3). As we discussed in Chapter 2, assumption
(4.3) is natural when a structural model is directly estimable because it ensures that
no additional functions of the explanatory variables help to explain y.
An explanatory variable x
j
is said to be endogenous in equation (4.1) if it is corre-
lated with u. You should not rely too much on the mean ing of ‘‘endogenous’’ from
other branches of economics. In traditional usage, a variable is endogenous if it is

determined within the context of a model. The usage in econometrics, while related to
traditional definitions, is used broadly to describe any situation where an explanatory
variable is correlated with the disturbance. If x
j
is uncorrelated with u, then x
j
is said
to be exogenous in equation (4.1). If assumption (4.3) holds, then each explanatory
variable is necessarily exogenous.
In applied econometrics, endogeneity usually arises in one of three ways:
Omitted Variables Omitted variables appear when we would like to control for one
or more additional variables but, usually because of data unavailability, we cannot
include them in a regression model. Specifically, suppose that Eðy jx; qÞ is the con-
ditional expectation of interest, which can be written as a function linear in parame-
ters and additive in q.Ifq is unobserved, we can always estimate Eðy jxÞ, but this
need have no particular relationship to Eðy jx; qÞ when q and x are allowed to be
correlated. One way to represent this situation is to write equation (4.1) where q is
part of the error term u.Ifq and x
j
are correlated, then x
j
is endogenous. The cor-
Chapter 450
relation of explanatory variables with unobservables is often due to self-selection:if
agents choose the value of x
j
, this might depend on factors ðqÞ that are unobservable
to the analyst. A good example is omitted ability in a wage equation, where an indi-
vidual’s years of schooling are likely to be correlated with unobserved ability. We
discuss the omitted variables problem in detail in Section 4.3.

Measurement Error In this case we would like to measure the (partial) e¤ect of a
variable, say x
Ã
K
, but we can observe only an imperfect measure of it, say x
K
. When
we plug x
K
in for x
Ã
K
—thereby arriving at the estimable equation (4.1)—we neces-
sarily put a measurement error into u . Depending on assumptions about how x
Ã
K
and x
K
are related, u and x
K
may or may not be correlated. For example, x
Ã
K
might
denote a marginal tax rate, but we can only obtain data on the average tax rate. We
will study the measurement error problem in Section 4.4.
Simultaneity Simultaneity arises when at least one of the explanatory variables is
determined simultaneously along with y. If, say, x
K
is determined partly as a function

of y, then x
K
and u are generally correlated. For example, if y is city murder rate
and x
K
is size of the police force, size of the police force is partly determined by the
murder rate. Conceptually, this is a more di‰cult situation to analyze, because we
must be able to think of a situation where we could vary x
K
exogenously, even though
in th e data that we collect y and x
K
are generated simultaneously. Chapter 9 treats
simultaneous equations models in detail.
The distinctions among the three possible forms of endogeneity are not always
sharp. In fact, an equation can have more than one source of endogen eity. For ex-
ample, in looking at the e¤ect of alcohol consumption on worker productivity (as
typically measured by wages), we would worry that alcohol usage is correlated with
unobserved factors, possibly related to family background, that also a¤ect wage; this
is an omitted variables problem. In addition, alcohol demand would generally de-
pend on income, which is largely determined by wage; this is a simultaneity problem.
And measurement error in alcohol usage is always a possibility. For an illuminating
discussion of the three kinds of endogeneity as they arise in a particular field, see
Deaton’s (1995) survey chapter on econometric issues in development economics.
4.2 Asymptotic Properties of OLS
We now briefly review the asymptotic properties of OLS for random samples from a
population, focusing on inference. It is convenient to write the population equation
of interest in vector form as
The Single-Equation Linear Model and OLS Estimation 51
y ¼ xb þ u ð4:5Þ

where x is a 1 ÂK vector of regressors and b 1 ðb
1
; b
2
; ; b
K
Þ
0
is a K Â1 vector.
Since most equations contain an intercept, we will just assume that x
1
1 1, as this
assumption makes interpreting the conditions easier.
We assume that we can obtain a random sample of size N from the population in
order to estimate b; thus, fðx
i
; y
i
Þ: i ¼ 1; 2; ; Ng are treated as independent, iden-
tically distributed random variables, where x
i
is 1 ÂK and y
i
is a scalar. For each
observation i we have
y
i
¼ x
i
b þ u

i
ð4:6Þ
which is convenient for deriving statistical properties of estimators. As for stating and
interpreting assumptions, it is easiest to focus on the population model (4.5).
4.2.1 Consistency
As discussed in Section 4.1, the key assumption for OLS to consistently estimate b is
the population orthogonality condition:
assumption OLS.1: Eðx
0
uÞ¼0.
Because x contains a constant, Assumption OLS.1 is equivalent to saying that u
has mean zero and is uncorrelated with each regressor, which is how we will refer to
Assumption OLS.1. Su‰cient for Assumption OLS.1 is the zero conditional mean
assumption (4.3).
The other assumption needed for consistency of OLS is that the expected outer
product matrix of x has full rank, so that there are no exact linear relationships
among the regressors in the population. This is stated succinctly as follows:
assumption OLS.2: rank Eðx
0
xÞ¼K.
As with Assumption OLS.1, Assumption OLS.2 is an assumption about the popu-
lation. Since Eðx
0
xÞ is a symmetric K Â K matrix, Assumption OLS.2 is equivalent
to assuming that Eðx
0
xÞ is positive definite. Since x
1
¼ 1, Assumption OLS.2 is also
equivalent to saying that the (pop ulation) variance matrix of the K À 1 nonconstant

elements in x is nonsingular. This is a standard assumption, which fails if and only if
at least one of the regressors can be written as a linear function of the other regressors
(in the population). Usually Assumption OLS.2 holds, but it can fail if the population
model is improperly specified [for example, if we include too many dummy variables
in x or mistakenly use something like logðageÞ and log ðage
2
Þ in the same equation].
Under Assumptions OLS.1 and OLS.2, the parameter vector b is identified. In the
context of models that are linear in the par ameters under random sampling, identi-
Chapter 452
fication of b simply means that b can be written in terms of population moments
in observable variables. (Later, when we consider nonlinear models, the notion of
identification will have to be more general. Also, special issues arise if we cannot
obtain a random sample from the population, something we treat in Chapter 17.) To
see that b is identified under Assumptions OLS.1 and OLS.2, premultiply equation
(4.5) by x
0
, take expectations, and solve to get
b ¼½Eðx
0
xÞ
À1
Eðx
0

Because ðx; yÞ is observed, b is identified. The analogy principle for choosing an esti-
mator says to turn the population problem into its sample counterpart (see Gold-
berger, 1968; Manski, 1988). In the current application this step leads to the method
of moments: replace the population moments Eðx
0

xÞ and Eðx
0
yÞ with the corre-
sponding sample averages. Doing so leads to the OLS estimator:
^
bb ¼ N
À1
X
N
i¼1
x
0
i
x
i
!
À1
N
À1
X
N
i¼1
x
0
i
y
i
!
¼ b þ N
À1

X
N
i¼1
x
0
i
x
i
!
À1
N
À1
X
N
i¼1
x
0
i
u
i
!
which can be written in full matrix form as ðX
0

À1
X
0
Y, where X is the N ÂK data
matrix of regressors with ith row x
i

and Y is the N Â1 data vector with ith element
y
i
. Under Assumption OLS.2, X
0
X is nonsingular with probability approaching one
and plim½ðN
À1
P
N
i¼1
x
0
i
x
i
Þ
À1
¼A
À1
, where A 1 Eðx
0
xÞ (see Corollary 3.1). Further,
under Assumption OLS.1, plimðN
À1
P
N
i¼1
x
0

i
u
i
Þ¼Eðx
0
uÞ¼0. Therefore, by Slutsky’s
theorem (Lemma 3.4), plim
^
bb ¼ b þA
À1
Á 0 ¼ b. We summarize with a theorem:
theorem 4.1 (Consistency of OLS): Under Assumptions OLS.1 and OLS.2, the
OLS estimator
^
bb obtained from a random sample following the population model
(4.5) is consistent for b.
The simplicity of the proof of Theorem 4.1 should not undermine its usefulness.
Whenever an equation can be put into the form (4.5) and Assumptions OLS.1 and
OLS.2 hold, OLS using a random sample consistently estimates b. It does not matter
where this equation comes from, or what the b
j
actually represent. As we will see in
Sections 4.3 and 4.4, often an estimable equation is obtained only after manipulating
an underlying structural equation. An important point to remember is that, once
the linear (in parameters) equation has been specified with an additive error and
Assumptions OLS.1 and OLS.2 are verified, there is no need to reprove Theorem 4.1.
Under the assumptions of Theorem 4.1, xb is the linear projection of y on x. Thus,
Theorem 4.1 shows that OLS consistently estimates the parameters in a linear pro-
jection, subject to the rank condition in Assumption OLS.2. This is very general, as it
places no restrictions on the nature of y—for example, y could be a binary variable

The Single-Equation Linear Model and OLS Estimation 53
or some other variable with discrete characteristics. Since a conditional expectation
that is linear in parameters is also the linear projection, Theorem 4.1 also shows that
OLS consistently estimates conditional expectations that are linear in parameters. We
will use this fact often in later sections.
There are a few final points worth emphasizing. First, if either Assumption OLS.1
or OLS.2 fails, then b is not identified (unless we make other assumptions, as in
Chapter 5). Usually it is correlation between u and one or more elements of x that
causes lack of identification. Second, the OLS estimator is not necessarily unbiased
even under Assumptions OLS.1 and OLS.2. However, if we impose the zero condi-
tional mean assumption (4.3), then it can be shown that Eð
^
bb jXÞ¼b if X
0
X is non-
singular; see Problem 4.2. By iterated expectations,
^
bb is then also unconditionally
unbiased, provided the expected value Eð
^
bbÞ exists.
Finally, we have not made the much more restrictive assumption that u and x are
independent.IfEðuÞ¼0 and u is independent of x, then assumption (4.3) holds, but
not vice versa. For example, Varðu jxÞ is entirely unrestricted under assumption (4.3),
but Varðu jxÞ is necessarily constant if u and x are independent.
4.2.2 Asymptotic Inference Using OLS
The asymptotic distribution of the OLS estimator is derived by writing
ffiffiffiffiffi
N
p

ð
^
bb À bÞ¼ N
À1
X
N
i¼1
x
0
i
x
i
!
À1
N
À1=2
X
N
i¼1
x
0
i
u
i
!
As we saw in Theorem 4.1, ðN
À1
P
N
i¼1

x
0
i
x
i
Þ
À1
À A
À1
¼ o
p
ð1Þ.Also,fðx
0
i
u
i
Þ:i ¼
1; 2; g is an i.i.d. sequence with zero mean, and we assume that each element
has finite variance. Then the central limit theorem (Theorem 3.2) implies that
N
À1=2
P
N
i¼1
x
0
i
u
i
!

d
Normalð0; BÞ, where B is the K Â K matrix
B 1 Eðu
2
x
0
xÞð4:7Þ
This implies N
À1=2
P
N
i¼1
x
0
i
u
i
¼ O
p
ð1Þ, and so we can write
ffiffiffiffiffi
N
p
ð
^
bb À bÞ¼A
À1
N
À1=2
X

N
i¼1
x
0
i
u
i
!
þ o
p
ð1Þð4:8Þ
since o
p
ð1ÞÁO
p
ð1Þ¼o
p
ð1Þ. We can use equation (4.8) to immediately obtain the
asymptotic distribution of
ffiffiffiffiffi
N
p
ð
^
bb À bÞ.Ahomoskedastic ity assumption simplifies the
form of OLS asymptotic variance:
assumption OLS.3: Eðu
2
x
0

xÞ¼s
2
Eðx
0
xÞ, where s
2
1 Eðu
2
Þ.
Chapter 454
Because EðuÞ¼0, s
2
is also equal to VarðuÞ. Assumption OLS.3 is the weakest form
of the homoskedasticity assumption. If we write out the K Â K matrices in Assump-
tion OLS.3 element by element, we see that Assumption OLS.3 is equivalent to
assuming that the squared error, u
2
, is uncorrelated with each x
j
, x
2
j
, and all cross
products of the form x
j
x
k
. By the law of iterated expectations, su‰cient for As-
sumption OLS.3 is Eðu
2

jxÞ¼s
2
, which is the same as Varðu jxÞ¼s
2
when
Eðu jxÞ¼0. The constant conditional variance assumption for u given x is the easiest
to interpret, but it is stronger than needed.
theorem 4.2 (Asymptotic Normality of OLS): Under Assumptions OLS.1–OLS.3,
ffiffiffiffiffi
N
p
ð
^
bb À bÞ @
a
Normalð0; s
2
A
À1
Þð4:9Þ
Proof: From equation (4.8) and definition of B, it follows from Lemma 3.7 and
Corollary 3.2 that
ffiffiffiffiffi
N
p
ð
^
bb À bÞ @
a
Normalð0; A

À1
BA
À1
Þ
Under Assumption OLS.3, B ¼ s
2
A, which proves the result.
Practically speaking, equation (4.9) allows us to treat
^
bb as approximately normal
with mean b and variance s
2
½Eðx
0
xÞ
À1
=N. The usual estimator of s
2
,
^
ss
2
1 SSR=
ðN ÀKÞ, where SSR ¼
P
N
i¼1
^
uu
2

i
is the OLS sum of squared residuals, is easily shown
to be consistent. (Using N or N ÀK in the denominator does not a¤ect consistency.)
When we also replace Eðx
0
xÞ with the sample average N
À1
P
N
i¼1
x
0
i
x
i
¼ðX
0
X=NÞ,we
get
Av
^
aarð
^
bbÞ¼
^
ss
2
ðX
0


À1
ð4:10Þ
The right-hand side of equation (4.10) should be familiar: it is the usual OLS variance
matrix estimator under the classical linear model assumptions. The bottom line of
Theorem 4.2 is that, under Assumptions OLS.1–OLS.3, the usual OLS standard
errors, t statistics, and F statistics are asymptotically valid. Showing that the F sta-
tistic is approximately valid is done by deriving the Wald test for linear restrictions of
the form R b ¼ r (see Chapter 3). Then the F statistic is simply a degrees-of-freedom-
adjusted Wald statistic, which is where the F distribution (as opposed to the chi-
square distribution) arises.
4.2.3 Heteroskedasticity-Robust Inference
If Assumption OLS.1 fails, we are in potentially serious trouble, as OLS is not even
consistent. In the next chapter we discuss the important method of instrumental
variables that can be used to obtain consistent estimators of b when Assumption
The Single-Equation Linear Model and OLS Estimation 55
OLS.1 fails. Assumption OLS.2 is also needed for consistency, but there is rarely any
reason to examine its failu re.
Failure of Assumption OLS.3 has less serious consequences than failure of As-
sumption OLS.1. As we have already seen, Assumption OLS.3 has nothing to do
with consistency of
^
bb. Further, the proof of asymptotic normality based on equation
(4.8) is still valid without Assumption OLS.3, but the final asymptotic variance is
di¤erent. We have assumed OLS.3 for deriving the limiting distribution because it
implies the asymptotic validity of the usual OLS standard errors and test statistics.
All regression packages assume OLS.3 as the default in reporting statistics.
Often th ere are reasons to believe that Assumption OLS.3 might fail, in which case
equation (4.10) is no longer a valid estimate of even the asymptotic variance matrix.
If we make the zero conditional mean assumption (4.3), one solution to violation
of Assumption OLS.3 is to specify a model for Varðy jxÞ, estimate this model, and

apply weighted least squares (WLS): for observation i, y
i
and every element of x
i
(including unity) are divided by an estimate of the conditional standard deviation
½Varðy
i
jx
i
Þ
1=2
, and OLS is applied to the weighted data (see Wooldridge, 2000a,
Chapter 8, for details). This procedure leads to a di¤erent estimator of b. We discuss
WLS in the more general context of nonlinear regression in Chapter 12. Lately, it
has become more popular to estimate b by OLS even when heteroskedasticity is sus-
pected but to adjust the standard errors and test statistics so that they are valid in the
presence of arbitrary heteroskedasticity. Since these standard errors are valid whether
or not Assumption OLS.3 holds, this method is much easier than a weighted least
squares procedure. What we sacrifice is potential e‰ciency gains from weighted least
squares (WLS) (see Chapter 14). But, e‰ciency gains from WLS are guaranteed only
if the model for Varðy jxÞ is correct. Further, WLS is generally inconsistent if
Eðu jxÞ0 0 but Assumption OLS.1 holds, so WLS is inappropriate for estimating
linear projections. Especially with large sample sizes, the presence of heteroskeda-
sticity need not a¤ect one’s ability to perform accurate inference using OLS. But we
need to compute standard errors and test statistics appropriately.
The adjustment needed to the asymptotic variance follows from the proof of The-
orem 4.2: without OLS.3, the asymptotic variance of
^
bb is Avarð
^

bbÞ¼A
À1
BA
À1
=N,
where the K Â K matrices A and B were defined earlier. We already know how
to consistently estimate A. Estimation of B is also straightforward. First, by the law
of large numbers, N
À1
P
N
i¼1
u
2
i
x
0
i
x
i
!
p
Eðu
2
x
0
xÞ¼B. Now, since the u
i
are not
observed, we replace u

i
with the OLS residual
^
uu
i
¼ y
i
À x
i
^
bb. This leads to the con-
sistent estimator
^
BB 1 N
À1
P
N
i¼1
^
uu
2
i
x
0
i
x
i
. See White (1984) and Problem 4.5.
The heteroskedasticity-robust variance matrix estimator of
^

bb is
^
AA
À1
^
BB
^
AA
À1
=N or,
after cancellations,
Chapter 456
Av
^
aarð
^
bbÞ¼ðX
0

À1
X
N
i¼1
^
uu
2
i
x
0
i

x
i
!
ðX
0

À1
ð4:11Þ
This matrix was introduced in econometrics by White (1980b), although some attri-
bute it to either Eicker (1967) or Huber (1967), statisticians who discovered robust
variance matrices. The square roots of the diagonal elements of equation (4.11) are
often called the White standard errors or Huber standard errors, or some hyphenated
combination of the names Eicker, Huber, and White. It is probably best to just call
them heteroskedasticity-robust standard errors, since this term des cribes their purpose.
Remember, these standard errors are asymptotically valid in the presence of any kind
of heteroskedasticity, including homoskedasticity.
Robust standard errors are of ten reported in applied cross-sectional work, espe-
cially when the sample size is large. Sometimes they are reported along with the usual
OLS standard errors; sometimes they are presented in place of them. Several regres-
sion packages now report these standard errors as an option, so it is easy to obtain
heteroskedasticity-robust standard errors.
Sometimes, as a degrees-of-freedom correction, the matrix in equation (4.11) is
multiplied by N=ðN ÀKÞ . This procedure guarantees that, if the
^
uu
2
i
were constant
across i (an unlikely event in practice, but the strongest evidence of homoskedasticity
possible), then the usual OLS standard errors would be obtained. There is some evi-

dence that the degrees-of-freedom adjustment improves finite sample performance.
There are other ways to adjust equation (4.11) to improve its small-sample properties—
see, for example, MacKinnon and White (1985)—but if N is large relative to K, these
adjustments typically make little di¤erence.
Once standard errors are obtained, t statistics are computed in the usual way.
These are robust to heteroskedasticity of unknown form, and can be used to test
single restrictions. The t statistics computed from heteroskedasticity robust standard
errors are heteroskedasticity-robust t statistics. Confidence intervals are also obtained
in the usual way.
When Assumption OLS.3 fails, the usual F statistic is not valid for testing multiple
linear restrictions, even asymptotically. Some packages allow robust testing with a
simple command, while others do not. If the hypoth eses are written as
H
0
: Rb ¼ r ð4:12Þ
where R is Q ÂK and has rank Q a K, and r is Q Â 1, then the heteroskedasticity-
robust Wald statistic for testing equation (4.12) is
W ¼ðR
^
bb À rÞ
0
ðR
^
VVR
0
Þ
À1
ðR
^
bb À rÞð4:13Þ

The Single-Equation Linear Model and OLS Estimation 57
where
^
VV is given in equation (4.11). Under H
0
, W @
a
w
2
Q
. The Wald statistic can be
turned into an approximate F
Q; NÀK
random variable by dividing it by Q (and usu-
ally making the degrees-of-freedom adjustment to
^
VV). But there is nothing wrong
with using equation (4.13) directly.
4.2.4 Lagrange Multiplier (Score) Tests
In the partitioned model
y ¼ x
1
b
1
þ x
2
b
2
þ u ð4:14Þ
under Assumptions OLS.1–OLS.3, where x

1
is 1 Â K
1
and x
2
is 1 Â K
2
, we know that
the hypothesis H
0
: b
2
¼ 0 is easily tested (asymptotically) using a standard F test.
There is another approach to testing such hypotheses that is sometimes useful, espe-
cially for computing heteroskedasticity-robust tests and for nonlinear models.
Let
~
bb
1
be the estimator of b
1
under the null hypothesis H
0
: b
2
¼ 0; this is called
the estimator from the restri cted model. Define the restricted OLS residuals as
~
uu
i

¼
y
i
À x
i1
~
bb
1
, i ¼ 1; 2; ; N. Under H
0
, x
i2
should be, up to sample variation, uncor-
related with
~
uu
i
in the sample. The Lagrange multiplier or score principle is based on
this observation. It turns out that a valid test statistic is obtained as follows: Run the
OLS regression
~
uu on x
1
; x
2
ð4:15Þ
(where the observation index i has been suppressed). Assuming that x
1
contains a
constant (that is, the null model contains a constant), let R

2
u
denote the usual R-
squared from the regression (4.15). Then the Lagrange multiplier (LM) or score sta-
tistic is LM 1 NR
2
u
. These names come from di¤erent features of the constrained
optimization problem; see Rao (1948), Aitchison and Silvey (1958), and Chapter
12. Because of its form, LM is also referred to as an N-R-squared test. Under H
0
,
LM @
a
w
2
K
2
, where K
2
is the number of restrictions being tested. If NR
2
u
is su‰-
ciently large, then
~
uu is significantly correlated with x
2
, and the null hypothesis will be
rejected.

It is important to inclu de x
1
along with x
2
in regression (4.15). In other words, the
OLS residuals from the null model should be regressed on all explanatory variables,
even though
~
uu is orthogonal to x
1
in the sample. If x
1
is excluded, then the resulting
statistic generally does not have a chi-square distri bution when x
2
and x
1
are corre-
lated. If Eðx
0
1
x
2
Þ¼0, then we can exclude x
1
from regression (4.15), but this ortho-
gonality rarely holds in applications. If x
1
does not include a constant, R
2

u
should be
the uncentered R-squared: the total sum of squares in the denominator is obtained
Chapter 458
without demeaning the dependent variable,
~
uu. When x
1
includes a constant, the usual
centered R-squared and uncentered R-squared are identical because
P
N
i¼1
~
uu
i
¼ 0.
Example 4.1 (Wage Equation for Married, W orking Women): Consider a wage
equation for married, working women:
logðwageÞ¼b
0
þ b
1
exper þb
2
exper
2
þ b
3
educ

þ b
4
age þb
5
kidslt6 þb
6
kidsge6 þu ð4:16Þ
where the last three variables are the woman’s age, number of children less than six,
and number of children at least six years of age, respectively. We can test whether,
after the produ ctivity variables experience and education are controlled for, women
are paid di¤erently depending on their age and number of children. The F statistic for
the hypothesis H
0
: b
4
¼ 0; b
5
¼ 0; b
6
¼ 0isF ¼½ðR
2
ur
À R
2
r
Þ=ð1 ÀR
2
ur
Þ Á½ðN À7Þ=3,
where R

2
ur
and R
2
r
are the unrestricted and restricted R-squareds; under H
0
(and
homoskedasticity), F @ F
3; NÀ7
. To obtain the LM statistic, we estimate the equation
without age, kidslt6, and kidsge6;let
~
uu denote the OLS residuals. Then, the LM sta-
tistic is NR
2
u
from the regression
~
uu on 1, exper, exper
2
, educ, age, kidslt6, and kidsge6,
where the 1 denotes that we include an intercept. Under H
0
and homoskedasticity,
NR
2
u
@
a

w
2
3
.
Using the data on the 428 working, married women in MROZ.RAW (from Mroz,
1987), we obtain the following estimated equation:
logð
^
wwageÞ¼À:421
ð:317Þ
½:316
þ :040
ð:013Þ
½:015
exper À :00078
ð:00040Þ
½:00041
exper
2
þ :108
ð:014Þ
½:014
educ
À :0015
ð:0053Þ
½:0059
age À :061
ð:089Þ
½:105
kidslt6 À :015

ð:028Þ
½:029
kidsge6; R
2
¼ :158
where the quantities in brackets are the heteroskedasticity-robust standard errors.
The F statistic for joint significance of age, kidslt6, and kidsge6 turns out to be about
.24, which gives p-value A :87. Regressing the residuals
~
uu from the restricted model
on all exogenous variables gives an R-squared of .0017, so LM ¼ 428ð:0017Þ¼:728,
and p-valueA :87. Thus, the F and LM tests give virtually identical results.
The test from regression (4.15) maintains Assumption OLS.3 under H
0
, just like
the usual F test. It turns out to be easy to obtain a heteroskedasticity-robust LM
The Single-Equation Linear Model and OLS Estimation 59
statistic. To see how to do so, let us look at the formula for the LM statistic from
regression (4.15) in more detail. After some algebra we can write
LM ¼ N
À1=2
X
N
i¼1
^
rr
0
i
~
uu

i
!
0
~
ss
2
N
À1
X
N
i¼1
^
rr
0
i
^
rr
i
!
À1
N
À1=2
X
N
i¼1
^
rr
0
i
~

uu
i
!
where
~
ss
2
1 N
À1
P
N
i¼1
~
uu
2
i
and each
^
rr
i
is a 1 ÂK
2
vector of OLS residuals from the
(multivariate) regression of x
i2
on x
i1
, i ¼ 1; 2; ; N. This statistic is not robust to
heteroskedasticity because the matrix in the middle is not a consistent estimator of
the asymptotic variance of ð N

À1=2
P
N
i¼1
^
rr
0
i
~
uu
i
Þ under heteroskedasticity. Following the
reasoning in Section 4.2.3, a heteroskedasticity-robust statistic is
LM ¼ N
À1=2
X
N
i¼1
^
rr
0
i
~
uu
i
!
0
N
À1
X

N
i¼1
~
uu
2
i
^
rr
0
i
^
rr
i
!
À1
N
À1=2
X
N
i¼1
^
rr
0
i
~
uu
i
!
¼
X

N
i¼1
^
rr
0
i
~
uu
i
!
0
X
N
i¼1
~
uu
2
i
^
rr
0
i
^
rr
i
!
À1
X
N
i¼1

^
rr
0
i
~
uu
i
!
Dropping the i subscript, this is easily obtained, as N ÀSSR
0
from the OLS regres-
sion (without an intercept)
1on
~
uu Á
^
rr ð4:17Þ
where
~
uu Á
^
rr ¼ð
~
uu Á
^
rr
1
;
~
uu Á

^
rr
2
; ;
~
uu Á
^
rr
K
2
Þ is the 1 ÂK
2
vector obtained by multiplying
~
uu
by each element of
^
rr and SSR
0
is just the usual sum of squared residuals from re-
gression (4.17). Thus, we first regress each element of x
2
onto all of x
1
and collect the
residuals in
^
rr. Then we form
~
uu Á

^
rr (observation by observation) and run the regression
in (4.17); N À SSR
0
from this regression is distributed asymptotically as w
2
K
2
. (Do not
be thrown o¤ by the fact that the dependent variable in regression (4.17) is unity for
each observation; a nonzero sum of squared residuals is reported when you run OLS
without an intercept.) For more details, see Davidson and MacKi nnon (1985, 1993)
or Wooldridge (1991a, 1995b).
Example 4.1 (continued): To obtain the heteroskedasticity-robust LM statistic for
H
0
: b
4
¼ 0; b
5
¼ 0; b
6
¼ 0 in equation (4.16), we estimate the restricted model as
before and obtain
~
uu. Then, we run the regressions (1) age on 1, exper, exper
2
, educ;
(2) kidslt6 on 1, exper, exper
2

, educ; (3) kidsge6 on 1, exper, exper
2
, educ; and obtain
the residuals
^
rr
1
,
^
rr
2
, and
^
rr
3
, respectively. The LM statistic is N ÀSSR
0
from the re-
gression 1 on
~
uu Á
^
rr
1
,
~
uu Á
^
rr
2

,
~
uu Á
^
rr
3
, and N ÀSSR
0
@
a
w
2
3
.
Chapter 460
When we apply this result to the data in MROZ.RAW we get LM ¼ :51, which
is very small for a w
2
3
random variable: p-valueA :92. For comparison, the hetero-
skedasticity-robust Wald statistic (scaled by Stata
9
to have an approximate F distri-
bution) also yields p-value A :92.
4.3 OLS Solutions to the Omitted Variables Problem
4.3.1 OLS Ignoring the Omitted Variables
Because it is so prevalent in applied work, we now consider the omitted variables
problem in more detail. A model that assumes an additive e¤ect of the omitted vari-
able is
Eðy jx

1
; x
2
; ; x
K
; qÞ¼b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K
x
K
þ gq ð4:18Þ
where q is the omitted factor. In particular, we are interested in the b
j
, which are the
partial e¤ects of the observed explanatory variables holding the other explanatory
variables constant, including the unobservable q. In the context of this additive
model, there is no point in allowing for more than one unobservable; any omitted
factors are lumped into q. Henceforth we simply refer to q as the omitted variable.
A good example of equation (4.18) is seen when y is logðwageÞ and q includes
ability. If x
K

denotes a measure of education, b
K
in equation (4.18) measures the
partial e¤ect of education on wages controlling for—or holding fixed—the level of
ability (as well as other observed characteristics). This e¤ect is most interesting from
a policy perspective because it provides a causal interpretation of the return to edu-
cation: b
K
is the expected proportionate increase in wage if someone from the work-
ing population is exogenously given another year of education.
Viewing equation (4.18) as a structural model, we can always write it in error form
as
y ¼ b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K
x
K
þ gq þv ð4:19Þ
Eðv jx
1
; x

2
; ; x
K
; qÞ¼0 ð4:20Þ
where v is the structural error. One way to handle the nonobservability of q is to put
it into the error term. In doing so, nothing is lost by assuming EðqÞ¼0 because an
intercept is included in equation (4.19). Putting q into the error term means we re-
write equation (4.19) as
y ¼ b
0
þ b
1
x
1
þ b
2
x
2
þÁÁÁþb
K
x
K
þ u ð4:21Þ
The Single-Equation Linear Model and OLS Estimation 61
u 1 gq þv ð4:22Þ
The error u in equation (4.21) consists of two parts. Under equation (4.20), v has zero
mean and is uncorrelated with x
1
; x
2

; ; x
K
(and q). By normalization, q also has
zero mean. Thus, EðuÞ¼0. However, u is uncorrelated with x
1
; x
2
; ; x
K
if and only
if q is uncorrelated with each of the observable regressors. If q is correlated with any
of the regressors, then so is u, and we have an endogen eity problem. We cannot ex-
pect OLS to consistently estimate any b
j
. Although Eðu jxÞ0 EðuÞ in equation (4.21),
the b
j
do have a structural interpretation because they appear in equation (4.19).
It is easy to characterize the plims of the OLS estimators when the omitted variable
is ignored; we will call this the OLS omitted variables inconsistency or OLS omitted
variables bias (even though the latter term is not always precise). Write the linear
projection of q onto the observable explanatory variables as
q ¼ d
0
þ d
1
x
1
þÁÁÁþd
K

x
K
þ r ð4:23Þ
where, by definition of a linear projection, EðrÞ¼0, Covðx
j
; rÞ¼0, j ¼ 1; 2; ; K.
Then we can easily infer the plim of the OLS estimators from regressing y onto
1; x
1
; ; x
K
by finding a n equation that does satisfy Assumptions OLS.1 and OLS.2.
Plugging equation (4.23) into equation (4.19) and doing simple algrebra g ives
y ¼ðb
0
þ gd
0
Þþðb
1
þ gd
1
Þx
1
þðb
2
þ gd
2
Þx
2
þÁÁÁþðb

K
þ gd
K
Þx
K
þ v þ gr
Now, the error v þgr has zero mean and is uncorrelated with each regressor. It fol-
lows that we can just read o¤ the plim of the OLS estimators from the regression of y
on 1; x
1
; ; x
K
: plim
^
bb
j
¼ b
j
þ gd
j
. Sometimes it is assumed that most of the d
j
are
zero. When the correlation between q and a particular variable, say x
K
, is the focus,
a common (usually implicit) assumption is that all d
j
in equation (4.23) except the
intercept and coe‰cient on x

K
are zero. Then plim
^
bb
j
¼ b
j
, j ¼ 1; ; K À 1, and
plim
^
bb
K
¼ b
K
þ g½Covðx
K
; qÞ=Varðx
K
Þ ð4:24Þ
[since d
K
¼ Covðx
K
; qÞ=Varðx
K
Þ in this case]. This formula gives us a simple way
to determine the sign, and perhaps the magnitude, of the inconsistency in
^
bb
K

.Ifg > 0
and x
K
and q are positively correlated, the asymptotic bias is positive. The other
combinations are easily worked out. If x
K
has substantial variation in the population
relative to the covariance between x
K
and q, then the bias can be small. In the general
case of equation (4.23), it is di‰cult to sign d
K
because it measures a partial correla-
tion. It is for this reason that d
j
¼ 0, j ¼ 1 ; ; K À 1 is often maintained for deter-
mining the likely asymptotic bias in
^
bb
K
when only x
K
is endogenous.
Chapter 462
Example 4.2 (Wage Equation with Unobserved Ability): Write a structural wage
equation explicitly as
logðwageÞ¼b
0
þ b
1

exper þb
2
exper
2
þ b
3
educ þg abil þ v
where v has the structural error property Eðv jexper; educ; abilÞ¼0. If abil is uncor-
related with exper and exper
2
once educ has been partialed out—that is, abil ¼ d
0
þ
d
3
educ þr with r uncorrelated with exper and exper
2
—then plim
^
bb
3
¼ b
3
þ gd
3
.Un-
der these assumptions the coe‰cients on exper and exper
2
are consistently estimated
by the OLS regression that omits ability. If d

3
> 0 then plim
^
bb
3
> b
3
(because g > 0
by definition), and the return to education is likely to be overestimated in large samples.
4.3.2 The Proxy Variable–OLS Solution
Omitted variables bias can be eliminated, or at least mitigate d, if a proxy variable is
available for the unobserved variable q. There are two formal requirements for a
proxy variable for q. The first is that the proxy variable should be redundant (some-
times called ignorable) in the structural equation. If z is a proxy variable for q, then
the most natural statement of redundancy of z in equation (4.18) is
Eðy jx; q; zÞ¼Eðy jx; qÞð4:25Þ
Condition (4.25) is easy to interpret: z is irrelevant for explaining y, in a conditional
mean sense, once x and q have been controlled for. This assumption on a proxy
variable is virtually always made (sometimes only implicitly), and it is rarely contro-
versial: the only reason we bother with z in the first place is that we cannot get data
on q. Anyway, we cannot get very far without condition (4.25). In the wage-education
example, let q be ability and z be IQ score. By definition it is ability that a¤ects wage:
IQ would not matter if true ability were known.
Condition (4.25) is somewhat stronger than needed when unobservab les appear
additively as in equation (4.18); it su‰ces to assume that v in equation (4.19) is
simply uncorrelated with z. But we will focus on condition (4.25) because it is natu-
ral, and because we need it to cover models where q interacts with some observed
covariates.
The second requirement of a good proxy variable is more complicated. We require
that the correlation between the omitted variable q and each x

j
be zero once we par-
tial out z. This is easily stated in terms of a linear projection:
Lðq j1; x
1
; ; x
K
; zÞ¼Lðq j1; zÞð4:26Þ
It is also helpful to see this relationship in terms of an equation with an unobserved
error. Write q as a linear function of z and an error term as
The Single-Equation Linear Model and OLS Estimation 63
q ¼ y
0
þ y
1
z þr ð4:27Þ
where, by definition, EðrÞ¼0 and Covðz; rÞ¼0 because y
0
þ y
1
z is the linear pro-
jection of q on 1, z.Ifz is a reasonable proxy for q, y
1
0 0 (and we usually think in
terms of y
1
> 0). But condition (4.26) assumes much more: it is equivalent to
Covðx
j
; rÞ¼0; j ¼ 1; 2; ; K

This condition requires z to be closely enough related to q so that once it is included
in equation (4.27), the x
j
are not partially correlated with q.
Before showing why these two proxy variable requirements do the trick, we should
head o¤ some possible confusion. The definition of proxy variable here is not uni-
versal. While a proxy variable is always assumed to satisfy the redundancy condition
(4.25), it is not always assumed to have the second property. In Chapter 5 we will use
the notion of an indicator of q, which satisfies condition (4.25) but not the second
proxy variable assumption.
To obtain an estimable equation, replace q in equation (4.19) with equation (4.27)
to get
y ¼ðb
0
þ gy
0
Þþb
1
x
1
þÁÁÁþb
K
x
K
þ gy
1
z þðgr þvÞð4:28Þ
Under the assumptions made, the comp osite error term u 1 gr þ v is uncorrelated
with x
j

for all j; redundancy of z in equation (4.18) means that z is uncorrelated with
v and, by definition, z is uncorrelated with r. It follows immediately from Theorem
4.1 that the OLS regression y on 1; x
1
; x
2
; ; x
K
, z produces consistent estimators of
ðb
0
þ gy
0
Þ; b
1
; b
2
; ; b
K
, and gy
1
. Thus, we can estimate the partial e¤ect of each of
the x
j
in equation (4.18) under the proxy variable assumptions.
When z is an imperfect proxy, then r in equation (4.27) is correlated with one or
more of the x
j
. Generally, when we do not impose condition (4.26) and write the
linear projection as

q ¼ y
0
þ r
1
x
1
þÁÁÁþr
K
x
K
þ y
1
z þr
the proxy variable regression gives plim
^
bb
j
¼ b
j
þ gr
j
. Thus, OLS with an imperfect
proxy is inconsistent. The hope is that the r
j
are smaller in magnitude than if z were
omitted from the linear projection, and this can usually be argued if z is a reasonable
proxy for q.
If including z induces substan tial collinearity, it might be better to use OLS with-
out the proxy variable. However, in making these decisions we must recognize that
including z reduces the error variance if y

1
0 0: Varðgr þ vÞ < Varðgq þ vÞ because
VarðrÞ < VarðqÞ, and v is uncorrelated with both r and q. Including a proxy variable
can actually reduce asymptotic variances as well as mitigate bias.
Chapter 464
Example 4.3 (Using IQ as a Proxy for Ability): We apply the proxy va riable
method to the data on working men in NLS80.RAW, which was used by Blackburn
and Neumark (1992), to estimate the structural model
logðwageÞ¼b
0
þ b
1
exper þb
2
tenure þb
3
married
þ b
4
south þb
5
urban þb
6
black þ b
7
educ þg abil þv ð4:29Þ
where exper is labor market experience, married is a dummy variable equal to unity if
married, south is a dummy variable for the southern region, urban is a dummy vari-
able for living in an SMSA, black is a race indicator, and educ is years of schooling.
We assume that IQ satisfies the proxy variable assumptions: in the linear projection

abil ¼ y
0
þ y
1
IQ þr, where r has zero mean and is uncorrelated with IQ, we also
assume that r is uncorrelated with experience, tenure, education, and other factors
appearing in equation (4.29). The estimated equations without and with IQ are
logð
^
wwageÞ¼ 5:40
ð0:11Þ
þ :014
ð:003Þ
exper þ :012
ð:002Þ
tenure þ :199
ð:039Þ
married
À :091
ð:026Þ
south þ :184
ð:027Þ
urban À :188
ð:038Þ
black þ :065
ð:006Þ
educ
N ¼ 935; R
2
¼ :253

logð
^
wwageÞ¼ 5:18
ð0:13Þ
þ :014
ð:003Þ
exper þ :011
ð:002Þ
tenure þ :200
ð:039Þ
married
À :080
ð:026Þ
south þ :182
ð:027Þ
urban À :143
ð:039Þ
black þ :054
ð:007Þ
educ
þ :0036
ð:0010Þ
IQ
N ¼ 935; R
2
¼ :263
Notice how the return to schooling has fallen from about 6.5 percent to about 5.4
percent when IQ is added to the regression. This is what we expect to happen if
ability and schooling are (partially) positively correlated. Of course, these are just
the findings from one sample. Adding IQ explains only one percentage point more of

the variation in logðwageÞ, and the equation predicts that 15 more IQ points (one
standard deviation) increases wage by about 5.4 percen t. The standard error on the
return to education has increased, but the 95 percent confidence interval is still fairly
tight.
The Single-Equation Linear Model and OLS Estimation 65
Often the outcome of the dependent variable from an earlier time period can be a
useful proxy variable.
Example 4.4 (E¤ects of Job Training Grants on Worker Productivity): The data in
JTRAIN1.RAW are for 157 Michigan manufacturing firms for the years 1987, 1988,
and 1989. These data are from Holzer, Block, Cheatham, and Knott (1993). The goal
is to determine the e¤ectiveness of job training grants on firm productivity. For this
exercise, we use only the 54 firms in 1988 which reported nonmissing values of the
scrap rate (number of items out of 100 that must be scrapped). No firms were
awarded grants in 1987; in 1988, 19 of the 54 firms were awarded grants. If the
training grant has the intended e¤ect, the average scrap rate should be lower among
firms receiving a grant. The problem is that the grants were not randomly assigned:
whether or not a firm received a grant could be related to other factors unobservable
to the econometrician that a¤ect productivity. In the simplest case, we can write (for
the 1988 cross section)
logðscrapÞ¼b
0
þ b
1
grant þgq þv
where v is orthogonal to grant but q contains unobserved productivity factors that
might be correlated with grant, a binary variable equal to unity if the firm received a
job training grant. Since we have the scrap rate in the previous year, we can use
logðscrap
À1
Þ as a proxy variable fo r q:

q ¼ y
0
þ y
1
logðscrap
À1
Þþr
where r has zero mean and, by definition, is uncorrelated with logðscrap
À1
Þ. We hope
that r has no or little correlation with grant. Plugging in for q gives the estimable model
logðscrapÞ¼d
0
þ b
1
grant þgy
1
logðscrap
À1
Þþr þv
From this equation, we see that b
1
measures the proportionate di¤erence in scrap
rates for two firms having the same scrap rates in the previous year, but where one
firm received a grant and the other did not. This is intuitively appealing. The esti-
mated equations are
logðs
^
ccrapÞ¼ :409
ð:240Þ

þ :057
ð:406Þ
grant
N ¼ 54; R
2
¼ :0004
logðs
^
ccrapÞ¼ :021
ð:089Þ
À :254
ð:147Þ
grant þ :831
ð:044Þ
logðscrap
À1
Þ
N ¼ 54; R
2
¼ :873
Chapter 466
Without the lagged scrap rate, we see that the grant appears, if anything, to reduce
productivity (by increasing the scrap rate), although the coe‰cient is statistically in-
significant. When the lagged dependent variable is included, the coe‰cient on grant
changes signs, becomes eco nomically large—firms awarded grants have scrap rates
about 25.4 percent less than those not given grants—and the e¤ect is significant at the
5 percent level against a one-sided alternative. [The more accurate estimate of the
percentage e¤ect is 100 Á½expðÀ:254ÞÀ1¼À22:4%; see Problem 4.1(a).]
We can always use more than one proxy for x
K

. For example, it might be that
Eðq jx; z
1
; z
2
Þ¼Eðq jz
1
; z
2
Þ¼y
0
þ y
1
z
1
þ y
2
z
2
, in which case including both z
1
and
z
2
as regressors along with x
1
; ; x
K
solves the omitted variable problem. The
weaker condition that the error r in the equation q ¼ y

0
þ y
1
z
1
þ y
2
z
2
þ r is uncor-
related with x
1
; ; x
K
also su‰ces.
The data set NLS80.RAW also contains each man’s score on the knowledge of
the world of work (KWW ) test. Problem 4.11 asks you to reestimate equation (4.29 )
when KWW and IQ are both used as proxies for ability.
4.3.3 Models with Interactions in Unobservables
In some cases we might be concer ned about interactions between unobservables and
observable explanatory variables. Obtaining consistent estimators is more di‰cult in
this case, but a good proxy variable can again solve the problem.
Write the structural model with unobservable q as
y ¼ b
0
þ b
1
x
1
þÁÁÁþb

K
x
K
þ g
1
q þg
2
x
K
q þv ð4:30Þ
where we make a zero conditional mean assumption on the structural error v:
Eðv jx; qÞ¼0 ð4:31Þ
For simplicity we have interacted q with only one explanatory variable, x
K
.
Before discussing estimation of equation (4.30), we should have an interpretation
for the parameters in this equation, as the interaction x
K
q is unobservable. (We dis-
cussed this topic more generally in Section 2.2.5.) If x
K
is an essentially continuous
variable, the partial e¤ect of x
K
on Eðy jx; qÞ is
qEðy jx; qÞ
qx
K
¼ b
K

þ g
2
q ð4:32Þ
Thus, the partial e¤ect of x
K
actually depends on the level of q. Because q is not
observed for anyone in the population, equation (4.32) can never be estimated, even
if we could estimate g
2
(which we cannot, in general). But we can average equation
The Single-Equation Linear Model and OLS Estimation 67
(4.32) across the population distribution of q. Assuming EðqÞ¼0, the average partial
e¤ect (APE )ofx
K
is
Eðb
K
þ g
2
qÞ¼b
K
ð4:33Þ
A similar interpreta tion holds for discrete x
K
. For example, if x
K
is binary, then
Eðy jx
1
; ; x

KÀ1
; 1; qÞÀEðy jx
1
; ; x
KÀ1
; 0; qÞ¼b
K
þ g
2
q, and b
K
is the average
of this di¤erence over the distribution of q. In this case, b
K
is called the average
treatment e¤ect (ATE). This name derives from the case where x
K
represents receiv-
ing some ‘‘treatment,’’ such as participation in a job training program or partici-
pation in an income maintenence program. We will consider the binary treatment
case further in Chapter 18, where we introduce a counterfactual framework for esti-
mating average treatment e¤ects.
It turns out that the assumption EðqÞ¼0 is without loss of generality. Using sim-
ple algebra we can show that, if m
q
1 EðqÞ0 0, then we can consistently estimate
b
K
þ g
2

m
q
, which is the average partial e¤ect.
If the elements of x are exogenous in the sense that Eðq jxÞ¼0, then we can con-
sistently estimate each of the b
j
by an OLS regression, where q and x
K
q are just part
of the error term. This result follows from iterated expectations applied to equation
(4.30), which shows that Eðy jxÞ¼b
0
þ b
1
x
1
þÁÁÁþb
K
x
K
if Eðq jxÞ¼0. The
resulting equation probably has heteroskedasticity, but this is easily dealt with. Inci-
dentally, this is a case where only assuming that q and x are uncorrelated would not
be enough to ensure consistency of OLS: x
K
q and x can be correlated even if q and x
are uncorrelated.
If q and x are correlated, we can consistently estimate the b
j
by OLS if we have a

suitable proxy variable for q. We still assume that the proxy variable, z, satisfies the
redundancy condition (4.25). In the current model we must make a stronger proxy
variable assumption than we did in Section 4.3.2:
Eðq jx; zÞ¼Eðq jzÞ¼y
1
z ð4:34Þ
where now we assume z has a zero mean in the population. Under these two proxy
variable assumptions, iterated expectations gives
Eðy jx; zÞ¼b
0
þ b
1
x
1
þÁÁÁþb
K
x
K
þ g
1
y
1
z þg
2
y
1
x
K
z ð4:35Þ
and the parameters are consistently estimated by OLS.

If we do not define our proxy to have zero mean in the population, then estimating
equation (4.35) by OLS does not consistently estimate b
K
.IfEðzÞ0 0, then we would
have to write Eðq jzÞ¼y
0
þ y
1
z, in which case the coe‰cient on x
K
in equation
(4.35) would be b
K
þ y
0
g
2
. In practice, we may not know the population mean of the
Chapter 468
proxy variable, in which case the proxy variable should be demeaned in the sample
before interacting it with x
K
.
If we maintain homoskedasticity in the structural model—that is, Varðy jx; q; zÞ¼
Varðy jx; qÞ¼s
2
—then there must be heteroskedasticity in Varðy jx; zÞ. Using
Property CV.3 in Appendix 2A, it can be shown that
Varðy jx; zÞ¼s
2

þðg
1
þ g
2
x
K
Þ
2
Varðq jx; zÞ
Even if Var ðq jx; zÞ is constant, Varðy jx; zÞ depends on x
K
. This situation is most
easily dealt with by computing heteroskedasticity-robust statistics, which allows for
heteroskedasticity of arbitrary form.
Example 4.5 (Return to Education Depends on Ability): Consider an extension of
the wage equation (4.29):
logðwageÞ¼b
0
þ b
1
exper þb
2
tenure þb
3
married þb
4
south
þ b
5
urban þb

6
black þ b
7
educ þg
1
abil þ g
2
educÁabil þv ð4:36Þ
so that educ and abil have separate e¤ects but also have an interactive e¤ect. In this
model the return to a year of schooling depends on abil: b
7
þ g
2
abil. Normalizing abil
to have zero population mean, we see that the average of the return to education is
simply b
7
. We estimate this equat ion under the assumption that IQ is redundant
in equation (4.36) and Eðabil jx; IQÞ¼Eðabil jIQÞ¼y
1
ðIQ À100Þ1 y
1
IQ
0
, where
IQ
0
is the population-demeaned IQ (IQ is constructed to have mean 100 in the pop-
ulation). We can estimate the b
j

in equation (4.36) by replacing abil with IQ
0
and
educÁabil with educÁIQ
0
and doing OLS.
Using the sample of men in NLS80.RAW gives the following:
logð
^
wwageÞ¼ÁÁÁþ : 052
ð:007Þ
educ À :00094
ð:00516Þ
IQ
0
þ :00034
ð:00038Þ
educ ÁIQ
0
N ¼ 935; R
2
¼ :263
where the usual OLS standard errors are reported (if g
2
¼ 0, homoskedasticity may
be reasonable). The interaction term educÁIQ
0
is not statistically significant, and the
return to education at the average IQ, 5.2 percent, is similar to the estimate when the
return to education is assumed to be constant. Thus there is little evidence for an in-

teraction between education and ability. Incidentally, the F test for joint significance
of IQ
0
and educÁIQ
0
yields a p-value of about .0011, but the interaction term is not
needed.
The Single-Equation Linear Model and OLS Estimation 69
In this case, we happen to know the population mean of IQ, but in most cases we
will not know the population mean of a proxy variable. Then, we should use the
sample average to demean the proxy before interacting it with x
K
; see Problem 4.8.
Technically, using the sample average to estimate the population average should be
reflected in the OLS standard errors. But, as you are asked to show in Problem 6.10
in Chapter 6 , the adjustments generally have very small impacts on the standard
errors and can safely be ignored.
In his study on the e¤ects of computer usage on the wage structure in the United
States, Krueger (1993) uses computer usage at home as a proxy for unobservables
that might be correlated with computer usage at work; he also includes an interaction
between the two computer usage dummies. Krueger does not demean the ‘‘uses
computer at home’’ dummy before constructing the interaction, so his estimate on
‘‘uses a computer at work ’’ does not have an average treatment e¤ect interpreta-
tion. However, just as in Example 4.5, Krueger found that the interaction term is
insignificant.
4.4 Properties of OLS under Measurement Error
As we saw in Section 4.1, another way that endogenous explanatory variables can
arise in economic applications occurs when one or more of the variables in our model
contains measurement error. In this section, we derive the consequences of measure-
ment error for ordinary least squares estimation.

The measurement error problem has a statistical structure similar to the omitted
variable–proxy variable problem discussed in the previous section. However, they are
conceptually very di¤erent. In the proxy variable case, we are looking for a variable
that is somehow associat ed with the unobserved variable. In the measurement error
case, the variable that we do not observe has a well-defined, quantitative meaning
(such as a marginal tax rate or annual income), but our measures of it may contain
error. For example, reported annual income is a measure of actual annual income,
whereas IQ score is a proxy for ability.
Another important di¤erence between the proxy variable and measurement error
problems is that, in the latter case, often the mismeasured explanatory variable is the
one whose e¤ect is of primary interest. In the proxy variable case, we cannot estimate
the e¤ect of the omitted variable.
Before we turn to the analysis, it is important to remember that measurement error
is an issue only when the variables on which we can collect data di¤er from the vari-
ables that influence decisions by individuals, families, firms, and so on. For example,
Chapter 470
suppose we are estimating the e¤ec t of peer group behavior on teenage drug usage,
where the behavior of one’s peer group is self-reported. Self-reporting may be a mis-
measure of actual peer group behavior, but so what? W e are probably more inter-
ested in the e¤ects of how a teenager perceives his or her peer group.
4.4.1 Measurement Error in the Dependent Variable
We begin with the case where the dependent variable is the only variable measured
with error. Let y
Ã
denote the variable (in the population, as always) that we would
like to explain. For example, y
Ã
could be annual family saving. The regression model
has the usual linear form
y

Ã
¼ b
0
þ b
1
x
1
þÁÁÁþb
K
x
K
þ v ð4:37Þ
and we assume that it satisfies at least Assumptions OLS.1 and OLS.2. Typically, we
are interested in Eðy
Ã
jx
1
; ; x
K
Þ.Welety represent the observable measure of y
Ã
where y 0 y
Ã
.
The population measurement error is defined as the di¤erence between the ob-
served value and the actual value:
e
0
¼ y À y
Ã

ð4:38Þ
For a random draw i from the population, we can write e
i0
¼ y
i
À y
Ã
i
, but what is
important is how the measurement error in the population is related to other factors.
To obtain an estimable model, we write y
Ã
¼ y Àe
0
, plug this into equation (4.37 ),
and rearrange:
y ¼ b
0
þ b
1
x
1
þÁÁÁþb
K
x
K
þ v þ e
0
ð4:39Þ
Since y; x

1
; x
2
; ; x
K
are observed, we can estimate this model by OLS. In e¤ect, we
just ignore the fact that y is an imperfect measure of y
Ã
and proceed as usual.
When does OLS with y in place of y
Ã
produce consistent estimators of the b
j
?
Since the original model (4.37) satisfies Assumption OLS.1, v has zero mean and is
uncorrelated with each x
j
. It is only natural to assume that the measurement error
has zero mean; if it does not, this fact only a¤ects estimation of the intercept, b
0
.
Much more important is what we assume about the relationship between the mea-
surement error e
0
and the explanatory variables x
j
. The usual assumption is that
the measurement error in y is statistically independent of each explanatory variable,
which implies that e
0

is uncorrelated with x. Then, the OLS estimators from equation
(4.39) are consistent (and possibly unbiased as well). Further, the usual OLS infer-
ence procedures (t statistics, F statistics, LM statistics) are asymptotically valid under
appropriate homoskedasticity assumptions.
The Single-Equation Linear Model and OLS Estimation 71
If e
0
and v are uncorrelated, as is usually assumed, then Varðv þe
0
Þ¼s
2
v
þ s
2
0
>
s
2
v
. Therefore, measurement error in the dependent variable results in a larger error
variance than when the dependent variable is not measured with error. This result is
hardly surprising and translates into larger asymptotic variances for the OLS esti-
mators than if we could observe y
Ã
. But the larger error variance violates none of the
assumptions needed for OLS estimation to have its desirable large-sample properties.
Example 4.6 (Saving Function with Measurement Error): Consider a saving function
Eðsav
Ã
jinc; size; educ; ageÞ¼b

0
þ b
1
inc þb
2
size þb
3
educ þb
4
age
but where actual saving ðsav
Ã
Þ may deviate from reported saving (sav). The question
is whether the size of the measurement error in sav is systematically related to the
other variables. It may be reasonable to assume that the measurement error is not
correlated with inc, size, educ, and age, but we might expect that families with higher
incomes, or more education, report their saving more accurately. Unfortunately,
without more information, we cannot know whether the measurement error is cor-
related with inc or educ.
When the dependent variable is in logarithmic form, so that logðy
Ã
Þ is the depen-
dent variable, a natural measurement error equation is
logðyÞ¼logðy
Ã
Þþe
0
ð4:40Þ
This follows from a multiplicative measurement error for y: y ¼ y
Ã

a
0
where a
0
> 0
and e
0
¼ logða
0
Þ.
Example 4.7 (Measurement Error in Firm Scrap Rates): In Example 4.4, we might
think that the firm scrap rate is mismeasured, leading us to postulate the model
logðscrap
Ã
Þ¼b
0
þ b
1
grant þv, where scrap
Ã
is the true scrap rate. The measurement
error equation is logðscrapÞ¼logðscrap
Ã
Þþe
0
. Is the measurement error e
0
inde-
pendent of whether the firm receives a grant? Not if a firm receiving a grant is more
likely to underreport its scrap rate in order to make it look as if the grant had the

intended e¤ect. If underreporting occurs, then, in the estimable equation logðscrapÞ¼
b
0
þ b
1
grant þv þe
0
, the error u ¼ v þ e
0
is negatively correlated with grant. This
result would produce a downward bias in b
1
, tending to make the training program
look more e¤ective than it actually was.
These examples show that measurement error in the dependent variable can cause
biases in OLS if the measurement error is systematically related to one or more of the
explanatory variables. If the measurement error is uncorrelated with the explanatory
variables, OLS is perfectly appropriate.
Chapter 472

×