Tải bản đầy đủ (.pdf) (36 trang)

Book Econometric Analysis of Cross Section and Panel Data By Wooldridge - Chapter 13 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (253.54 KB, 36 trang )

13Maximum Likelihood Methods
13.1 Introduction
This chapter contains a general treatment of maximum likelihood estimation (MLE)
under random sampling. All the models we considered in Part I could be estimated
without making full distributional assumptions about the endogenous variables
conditional on the exogenous variables: maximum likelihood methods were not
needed. Instead, we focused primarily on zero-covariance and zero-conditional-mean
assumptions, and secondarily on assumptions about conditional variances and co-
variances. These assumptions were su‰cient for obtaining consistent, asymptotically
normal estimators, some of which were shown to be e‰cient within certain classes of
estimators.
Some texts on advanced eco nometrics take maximum likelihood estimation as the
unifying theme, and then most models are estimated by maximum likelihood. In ad-
dition to providing a unified approach to estimation, MLE has some desirable e‰-
ciency propert ies: it is generally the most e‰cient estimation procedure in the class of
estimators that use information on the distribution of the endogenous variables given
the exogenous variables. (We formalize the e‰ciency of MLE in Section 14.5.) So
why not always use MLE?
As we saw in Part I, e‰ciency usually comes at the price of nonrobustness, and this
is certainly the case for maximum likelihood. Maximum likelihood estimators are
generally inconsistent if some part of the specified distribution is misspecified. As an
example, consider from Section 9.5 a simultaneous equations model that is linear in
its parameters but nonlinear in some endogenous variables. There, we discussed esti-
mation by instrumental variables methods. We could estimate SEMs nonlinear in
endogenous variables by maximum likelihood if we assumed independence between
the structural errors and the exogenous variables and if we assumed a particul ar dis-
tribution for the structural errors, say, multivariate normal. The MLE would be
asymptotically more e‰cient than the best GMM estimator, but failure of normality
generally results in inconsistent estimators of all parameters.
As a second example, suppose we wish to estimate Eðy jxÞ, where y is bounded
between zero and one. The logistic function, expðxb Þ=½1 þexpðxbÞ, is a reasonable


model for Eðy jxÞ, and, as we discussed in Section 12.2, nonlinear least squares
provides consistent,
ffiffiffiffiffi
N
p
-asymptotically normal estimators under weak regularity
conditions. We can easily make inference robust to arbitrary heteroskedasticity in
Varðy jxÞ. An alternative approach is to model the density of y given x—which, of
course, implies a particular model for Eðy jxÞ—and use maximum likelihood esti-
mation. As we will see, the strength of MLE is that, under correct specification of the
density, we would have the asymptotically e‰cient estimators, and we would be able
to estimate any feature of the conditional distribution, such as Pðy ¼ 1 jxÞ. The
drawback is that, except in special cases, if we have misspecified the density in any
way, we will not be able to consistently estimate the conditional mean.
In most applications, specifying the distribution of the endogenous variables con-
ditional on exogenous variables must have a component of arbitrariness, as economic
theory rarely provides guidance. Our perspective is that, for robustness reasons, it
is desirable to make as few assumptions as possi ble—at least until relaxing them
becomes practically di‰cult. There are cases in which MLE turns out to be robust to
failure of certain assumptions, but these must be examined on a case-by-case basis, a
process that detracts from the unifying theme provided by the MLE approach. (One
such example is nonlinear regression under a homoskedastic normal assumption; the
MLE of the parameters b
o
is identical to the NLS estimator, and we know the latter
is consistent and asymptotically normal qui te generally. We will cover some other
leading cases in Chapter 19.)
Maximum likelihood plays an important role in modern econometric analysis, for
good reason. There are many problems for which it is indispensable. For example, in
Chapters 15 and 16 we study various limited dependent variable models, and MLE

plays a central role.
13.2 Preliminaries and Examples
Traditional maximum likelihood theory for independent, identically distributed
observations fy
i
A R
G
: i ¼ 1; 2; g starts by specifying a family of densities for y
i
.
This is the framework used in intro ductory statistics courses, where y
i
is a scalar with
a normal or Poisson distribution. But in almost all economic applications, we are
interested in estimating parameters in conditional distributions. Therefo re, we assume
that each random draw is partitioned as ðx
i
; y
i
Þ, where x
i
A R
K
and y
i
A R
G
, and we
are interested in estimating a mod el for the conditional distribution of y
i

given x
i
.We
are not interested in the distribution of x
i
, so we will not specify a model for it.
Consequently, the method of this chapter is properly called conditional maximum
likelihood estimation (CMLE). By taking x
i
to be null we cover unconditional MLE
as a special case.
An alternative to viewing ðx
i
; y
i
Þ as a random draw from the population is to treat
the conditioning variables x
i
as nonrandom vectors that are set ahead of time and that
appear in the unconditional distribution of y
i
. (This is analogous to the fixed regres-
sor assumption in classical regression analysis.) Then, the y
i
cannot be identically
distributed, and this fact complicates the asymptotic analysis. More importantly,
Chapter 13386
treating the x
i
as nonrandom is much too restrictive for all uses of maximum likeli-

hood. In fact, later on we will cover methods where x
i
contains what are endogenous
variables in a structural model, but where it is convenient to obtain the distribution of
one set of endogenous variables conditional on another set. Once we know how to
analyze the general CMLE case, applications follow fairly directly.
It is important to understand that the subsequent results apply any time we have
random sampling in the cross section dimension. Thus, the general theory applies to
system estimation, as in Chapters 7 and 9, provided we are willing to assume a dis-
tribution for y
i
given x
i
. In addition, panel data settings with large cross sections and
relatively small time periods are encompassed, since the appropriate asymptotic
analysis is with the time dimension fixed and the cross section dimension tending to
infinity.
In order to perform maximum likelihood analysis we need to specify, or derive
from an underlying (structural) model, the density of y
i
given x
i
. We assume this
density is known up to a finite number of unknown parameters, with the result that
we have a parametric model of a conditional density. The vector y
i
can be continuous
or discrete, or it can have both discrete and continuous characteristics. In many of
our applications, y
i

is a scalar, but this fact does not simplify the general treatment.
We will carry along two examples in this chapter to illustrate the general theory of
conditional maximum likelihood. The first example is a binary response model, spe-
cifically the probit model. We postpone the uses and interepretation of binary response
models until Chapter 15.
Example 13.1 (Probit): Suppose that the latent variable y
Ã
i
follows
y
Ã
i
¼ x
i
y þe
i
ð13:1Þ
where e
i
is independent of x
i
(which is a 1 ÂK vector with first element equal to unity
for all i), y is a K Â1 vector of parameters, and e
i
@ Normal(0,1). Instead of
observing y
Ã
i
we observe only a binary variable indicating the sign of y
Ã

i
:
y
i
¼
1ify
Ã
i
> 0
0ify
Ã
i
a 0

(13.2)
(13.3)
To be succinct, it is useful to write equations (13.2) and (13.3) in terms of the indi-
cator function, denoted 1½Á. This function is unity whenever the statement in brackets
is true, and zero otherwise. Thus, equations (13.2) and (13.3) are equivalently written
as y
i
¼ 1½y
Ã
i
> 0. Because e
i
is normally distributed, it is irrelevant whether the strict
inequality is in equation (13.2) or (13.3).
Maximum Likelihood Methods 387
We can easily obtain the distribution of y

i
given x
i
:
Pðy
i
¼ 1 jx
i
Þ¼Pðy
Ã
i
> 0 jx
i
Þ¼Pðx
i
y þe
i
> 0 jx
i
Þ
¼ Pðe
i
> Àx
i
y jx
i
Þ¼1 ÀFðÀx
i
yÞ¼Fðx
i

yÞð13:4Þ
where FðÁÞ denotes the standard normal cumulative distribution function (cdf ). We
have used Property CD.4 in the chapter appendix along with the symmetry of the
normal distribution. Therefore,
Pðy
i
¼ 0 jx
i
Þ¼1 ÀFðx
i
yÞð13:5Þ
We can combine equations (13.4) and (13.5) into the density of y
i
given x
i
:
f ðy jx
i
Þ¼½Fðx
i
yÞ
y
½1 ÀFðx
i
yÞ
1Ày
; y ¼ 0; 1 ð13:6Þ
The fact that f ðy jx
i
Þ is zero when y B f0; 1g is obvious, so we will not be explicit

about this in the future.
Our second example is useful when the variable to be explained takes on non-
negative integer values. Such a variable is called a count variable. We will discuss the
use and interpretation of count data models in Chapter 19. For now, it su‰ces to
note that a linear model for Eðy jxÞ when y takes on nonnegative integer values is
not ideal because it can lead to negative predicted values. Further, since y can take on
the value zero with positive probability, the transformation logðyÞ cannot be used to
obtain a model with constant elasticities or constant semielasticities. A functional
form well s uited for Eðy jxÞ is expðxy Þ. We could estimate y by using nonlinear least
squares, but all of the standard distributions for count variables imply hetero-
skedasticity (see Chapter 19). Thus, we can hope to do better. A traditional approach
to regression models with count data is to assume that y
i
given x
i
has a Poisson
distribution.
Example 13.2 (Poisson Regression): Let y
i
be a nonnegative count variable; that is,
y
i
can take on integer values 0; 1; 2; : Denote the conditional mean of y
i
given the
vector x
i
as Eðy
i
jx

i
Þ¼mðx
i
Þ. A natural distribution for y
i
given x
i
is the Poisson
distribution:
f ðy jx
i
Þ¼exp½Àmðx
i
Þfmðx
i
Þg
y
=y!; y ¼ 0; 1; 2; ð13:7Þ
(We use y as the dummy a rgument in the density, not to be confused with the random
variable y
i
.) Once we choose a form for the conditional mean function, we have
completely determined the distribution of y
i
given x
i
. For example, from equation
(13.7), Pðy
i
¼ 0 jx

i
Þ¼exp½Àmðx
i
Þ. An important feature of the Poisson distribu-
Chapter 13388
tion is that the variance equals the mean: Varðy
i
jx
i
Þ¼Eðy
i
jx
i
Þ¼mðx
i
Þ. The usual
choice for mðÁÞ is mðxÞ¼expðxyÞ, where y is K Â1 and x is 1 ÂK with first element
unity.
13.3 General Framework for Conditional MLE
Let p
o
ðy jxÞ denote the conditional density of y
i
given x
i
¼ x, where y and x are
dummy arguments. We index this density by ‘‘o’’ to emphasize that it is the true
density of y
i
given x

i
, and not just one of many candidates. It will be useful to let
X H R
K
denote the possible values for x
i
and Y denote the possible values of y
i
; X
and Y are called the supports of the random vectors x
i
and y
i
, respectively.
For a general treatment, we assume that, for all x A X, p
o
ðÁjxÞ is a density with
respect to a s-finite measure, denoted nðdyÞ. Defining a s-finite measure would take
us too far afield. We will say little more about the measure nðdyÞ because it does
not play a crucial role in applications. It su‰ces to know that nðdyÞ can be chosen to
allow y
i
to be discrete, continuous, or some mixture of the two. When y
i
is discrete,
the measure nðdyÞ simply turns all integrals into sums; when y
i
is purely continuous,
we obtain the usual Riemann integrals. Even in more complicated cases—where, say,
y

i
has both discrete and continuous characteristics—we can get by with tools from
basic probability without ever explicitly defining nðdyÞ. For more on measures and
general integrals, you are referred to Billingsley (1979) and Davidson (1994, Chapters
3 and 4).
In Chapter 12 we saw how nonlinear least squares can be motivated by the fact
that m
o
ðxÞ1 Eðy jxÞ minimizes Ef½y ÀmðxÞ
2
g for all other functions mðxÞ with
Ef½mðxÞ
2
g < y. Conditional maximum likelihood has a similar motivation. The
result from probabil ity that is crucial for applying the analogy principle is the con-
ditional Kullback-Leibler information inequality. Although there are more general
statements of this inequality, the following su‰ces for our purpose: for any non-
negative function f ðÁjxÞ such that
ð
Y
f ðy jxÞnðdyÞ¼1; all x A X ð13:8Þ
Property CD.1 in the chapter appendix implies that
Kðf ; xÞ1
ð
Y
log½p
o
ðy jxÞ=f ðy jxÞp
o
ðy jxÞnðdyÞb 0; all x A X ð13:9Þ

Because the integral is identically zero for f ¼ p
o
, expression (13.9) says that, for
each x, Kðf ; xÞ is minimized at f ¼ p
o
.
Maximum Likelihood Methods 389
We can apply inequality (13.9) to a parametric model for p
o
ðÁjxÞ,
ff ðÁjx; y Þ; y A Y; Y H R
P
gð13:10Þ
which we assume satisfies condition (13.8) for each x A X and each y A Y; if it does
not, then f ðÁjx; y Þ does not integrate to unity (with respect to the measure n), and as
a result it is a very poor candidate for p
o
ðy jxÞ. Model (13.10) is a correctly specified
model of the conditional density, p
o
ðÁjÁÞ, if, for some y
o
A Y,
f ðÁjx; y
o
Þ¼p
o
ðÁjxÞ; all x A X ð13:11Þ
As we discussed in Chapter 12, it is useful to use y
o

to distinguish the true value of the
parameter from a generic element of Y. In particular examples, we will not bother
making this distinction unless it is needed to make a point.
For each x A X, Kðf ; xÞ can be written as Eflog½p
o
ðy
i
jx
i
Þjx
i
¼ xgÀ
Eflog½ f ðy
i
jx
i
Þjx
i
¼ xg. Therefore, if the parametric model is correctly specified,
then Eflog½ f ðy
i
jx
i
; y
o
Þjx
i
gb Eflog½ f ðy
i
jx

i
; yÞjx
i
g,or
E½l
i
ðy
o
Þjx
i
b E½l
i
ðyÞjx
i
; y A Y ð13:12Þ
where
l
i
ðyÞ1 lðy
i
; x
i
; yÞ1 log f ðy
i
jx
i
; yÞð13:13Þ
is the conditional log likelihood for observation i. Note that l
i
ðyÞ is a random function

of y, since it depends on the random vector ðx
i
; y
i
Þ. By taking the expected value of
expression (13.12) and using iterated expectations, we see that y
o
solves
max
y A Y
E½l
i
ðyÞ ð13:14Þ
where the expectation is with respect to the joint distribution of ðx
i
; y
i
Þ. The sample
analogue of expression (13.14) is
max
y A Y
N
À1
X
N
i¼1
log f ðy
i
jx
i

; yÞð13:15Þ
A solution to problem (13.15), assuming that one exists, is the conditional maximum
likelihood estimator (CMLE) of y
o
, which we denote as
^
yy. We will sometimes drop
‘‘conditional’’ when it is not needed for clarity.
The CMLE is clearly an M-estimator, since a maximization problem is easily
turned into a minimization problem: in the notation of Chapter 12, take w
i
1 ðx
i
; y
i
Þ
and qð w
i
; yÞ1 Àlog f ðy
i
jx
i
; yÞ. As long as we keep track of the minus sign in front
of the log likelihood, we can apply the results in Chapter 12 directly.
Chapter 13390
The motivation for the conditional MLE as a solution to problem (13.15) may
appear backward if you learned about maximum likelihood estimation in an intro-
ductory statistics course. In a traditional framework, we would treat the x
i
as con-

stants appearing in the distribution of y
i
, and we would define
^
yy as the solution to
max
y A Y
Y
N
i¼1
f ðy
i
jx
i
; yÞð13:16Þ
Under independence, the product in expression (13.16) is the model for the joint
density of ðy
1
; ; y
N
Þ, evaluated at the data. Because maximizing the fu nction in
(13.16) is the same as maximizing its natural log, we are led to problem (13.15).
However, the arguments explaining why solving (13.16) should lead to a good esti-
mator of y
o
are necessarily heuristic. By contrast, the analogy principle applies directly
to problem (13.15), and we need not assume that the x
i
are fixed.
In our two examples, the conditional log likelihoods are fairly simple.

Example 13.1 (continued): In the probit example, the log likelihood for observation
i is l
i
ðyÞ¼y
i
log Fðx
i
yÞþð1 À y
i
Þ log½1 ÀFðx
i
yÞ.
Example 13.2 (continued): In the Poisson example, l
i
ðyÞ¼Àexpðx
i
yÞþy
i
x
i
y À
logðy
i
!Þ. Normally, we would drop the last term in defining l
i
ðyÞ because it does not
a¤ect the maximization problem.
13.4 Consistency of Conditional MLE
In this section we state a formal consistency result for the CMLE, whic h is a special
case of the M-estimator consistency result Theorem 12.2.

theorem 13.1 (Consistency of CMLE): Let fðx
i
; y
i
Þ: i ¼ 1; 2; g be a random sam-
ple with x
i
A X H R
K
, y
i
A Y H R
G
. Let Y H R
P
be the parameter set and denote the
parametric model of the conditional density as ff ðÁjx; y Þ: x A X; y A Yg. Assume
that (a) f ðÁjx; yÞ is a true density with respect to the measure nðdyÞ for all x and y,so
that condition (13.8) holds; (b) for some y
o
A Y, p
o
ðÁjxÞ¼f ðÁjx; y
o
Þ,allx A X, and
y
o
is the unique solution to problem (13.14); (c) Y is a compact set; (d) for each y A Y,
lðÁ; yÞ is a Borel measurable function on Y  X; (e) for each ðy; xÞ A Y ÂX, lðy; x; ÁÞ
is a continuous function on Y; and (f ) jlðw; y Þja bðwÞ,ally A Y, and E½bðw Þ < y.

Then there exists a solution to problem (13.15), the CMLE
^
yy, and plim
^
yy ¼ y
o
.
As we discussed in Chapter 12, the measurability assumption in part d is purely
technical and does not need to be checked in practice. Compactness of Y can be
Maximum Likelihood Methods 391
relaxed, but doing so usually requires considerable work. The continuity assumption
holds in most econometric applications, but there are cases where it fails, such as
when estimating ce rtain models of auctions—see Donald and Paarsch (1996). The
moment assumption in part f typically restricts the distribution of x
i
in some way, but
such restrictions are rarely a serious concern. For the most part, the key assumptions
are that the parametric model is correctly specified, that y
o
is identified, and that the
log-likelihood function is continuous in y.
For the probit and Poisson examples, the log likelihoods are clearly continuous in
y. We can verify the moment condition (f ) if we bound certain moments of x
i
and
make the parameter space compact. But our primary concern is that densities are
correctly specified. For example, in the probit case, the density for y
i
given x
i

will be
incorrect if the latent error e
i
is not independent of x
i
and normally distributed, or if
the latent variable model is not linear to begin with. For identification we must rule
out perfect collinearity in x
i
. The Poisson CMLE turns out to have desirable prop-
erties even if the Poisson distributional assumption does not hold, but we postpone a
discussion of the robustness of the Poisson CMLE until Chapter 19.
13.5 Asymptotic Normality and Asymptotic Variance Estimation
Under the di¤erentiability and moment assumptions that allow us to apply the the-
orems in Chapter 12, we can show that the MLE is generally asymptotically normal.
Naturally, the computational methods discussed in Section 12.7, including concen-
trating parameters out of the log likelihood, apply directly.
13.5.1 Asymptotic Normality
We can derive the limiting distribution of the MLE by applying Theorem 12.3. We
will have to assume the regularity conditions there; in particular, we assume that y
o
is
in the interior of Y, and l
i
ðyÞ is twice continuously di¤erentiable on the interior of Y.
The score of the log likelihood for observation i is simpl y
s
i
ðyÞ1 ‘
y

l
i
ðyÞ
0
¼
ql
i
qy
1
ðyÞ;
ql
i
qy
2
ðyÞ; ;
ql
i
qy
P
ðyÞ

0
ð13:17Þ
a P Â1 vector as in Chapter 12.
Example 13.1 (continued): For the probit case, y is K Â1 and

y
l
i
ðyÞ¼y

i
fðx
i
yÞx
i
Fðx
i


Àð1 À y
i
Þ
fðx
i
yÞx
i
½1 ÀFðx
i
yÞ

Transposing this equation, and using a little algebra, gives
Chapter 13392
s
i
ðyÞ¼
fðx
i
yÞx
0
i

½y
i
À Fðx
i
yÞ
Fðx
i
yÞ½1 À Fðx
i
yÞ
ð13:18Þ
Recall that x
0
i
is a K Â1 vector.
Example 13.2 (continued): The score for the Poisson case, where y is again K Â1, is
s
i
ðyÞ¼Àexpðx
i
yÞx
0
i
þ y
i
x
0
i
¼ x
0

i
½y
i
À expðx
i
yÞ ð13:19Þ
In the vast majority of cases, the score of the log-likelihood function has an im-
portant zero conditional mean property:
E½s
i
ðy
o
Þjx
i
¼0 ð13:20Þ
In other words, when we evaluate the P Â1 score at y
o
, and take its expectation with
respect to f ðÁjx
i
; y
o
Þ, the expectation is zero. Under condition (13.20), E½s
i
ðy
o
Þ ¼ 0,
which was a key condition in deriving the asymptotic normality of the M-estimator
in Chapter 12.
To show condition (13.20) generally, let E

y
½Ájx
i
 denote conditional expectation
with respect to the density f ðÁjx
i
; yÞ for any y A Y. Then, by definition,
E
y
½s
i
ðyÞjx
i
¼
ð
Y
sðy; x
i
; yÞf ðy jx
i
; yÞnðdyÞ
If integration and di¤erentation can be interchanged on intðYÞ—that is, if

y
ð
Y
f ðy jx
i
; yÞnðdyÞ


¼
ð
Y

y
f ðy jx
i
; yÞnðdyÞð13:21Þ
for all x
i
A X, y A intðYÞ—then
0 ¼
ð
Y

y
f ðy jx
i
; yÞnðdyÞð13:22Þ
since
Ð
Y
f ðy jx
i
; yÞnðdyÞ is unity for all y, and therefore the partial derivatives with
respect to y must be identically zero. But the right-hand side of equation (13.22) can
be written as
Ð
Y
½‘

y
lðy; x
i
; yÞf ðy jx
i
; yÞnðdyÞ. Putting in y
o
for y and transposing
yields condition (13.20).
Example 13.1 (continued): Define u
i
1 y
i
À Fðx
i
y
o
Þ¼y
i
À Eðy
i
jx
i
Þ. Then
s
i
ðy
o
Þ¼
fðx

i
y
o
Þx
0
i
u
i
Fðx
i
y
o
Þ½1 ÀFðx
i
y
o
Þ
and, since Eðu
i
jx
i
Þ¼0, it follows that E½s
i
ðy
o
Þjx
i
¼0.
Maximum Likelihood Methods 393
Example 13.2 (continued): Define u

i
1 y
i
À expðx
i
y
o
Þ. Then s
i
ðy
o
Þ¼x
0
i
u
i
and so
E½s
i
ðy
o
Þjx
i
¼0.
Assuming that l
i
ðyÞ is twice continuously di¤erentiable on the interior of Y,let
the Hessian for observation i be the P Â P matrix of second partial derivatives of
l
i

ðyÞ:
H
i
ðyÞ1 ‘
y
s
i
ðyÞ¼‘
2
y
l
i
ðyÞð13:23Þ
The Hessian is a symmetric matrix that generally depends on ðx
i
; y
i
Þ. Since MLE is a
maximization problem, the expected value of H
i
ðy
o
Þ is negative definite. Thus, to
apply the theory in Chapter 12, we define
A
o
1 ÀE½H
i
ðy
o

Þ ð13:24Þ
which is generally a positive definite matrix when y
o
is identified. Under standard
regularity conditions, the asymptotic normality of the CMLE follows from Theorem
12.3:
ffiffiffiffiffi
N
p
ð
^
yy Ày
o
Þ@
a
Normalð0; A
À1
o
B
o
A
À1
o
Þ, where B
o
1 Var½s
i
ðy
o
Þ1 E½s

i
ðy
o
Þs
i
ðy
o
Þ
0
.
It turns out that this general form of the asymptotic variance matrix is too compli-
cated. We now show that B
o
¼ A
o
.
We must assume enough smoothness such that the following interchange of inte-
gral and derivative is valid (see Newey and McFadden, 1994, Section 5.1, for the case
of unconditional MLE):

y
ð
Y
s
i
ðyÞf ðy jx
i
; yÞnðdyÞ

¼

ð
Y

y
½s
i
ðyÞf ðy jx
i
; yÞnðdyÞð13:25Þ
Then, taking the derivative of the identity
ð
Y
s
i
ðyÞf ðy jx
i
; yÞnðdyÞ1 E
y
½s
i
ðyÞjx
i
¼0; y A intðYÞ
and using equation (13.25), gives, for all y A intðYÞ,
ÀE
y
½H
i
ðyÞjx
i

¼Var
y
½s
i
ðyÞjx
i

where the indexing by y denotes expectation and variance when f ðÁjx
i
; yÞ is the
density of y
i
given x
i
. When evaluated at y ¼ y
o
we get a very important equality:
ÀE½H
i
ðy
o
Þjx
i
¼E½s
i
ðy
o
Þs
i
ðy

o
Þ
0
jx
i
ð13:26Þ
where the expectation and variance are with respect to the true conditional distri-
bution of y
i
given x
i
. Equation (13.26) is called the conditional information matrix
equality (CIME). Taking the expectation of equation (13.26) (with respect to the
Chapter 13394
distribution of x
i
) and using the law of iterated expectations gives
ÀE½H
i
ðy
o
Þ ¼ E½s
i
ðy
o
Þs
i
ðy
o
Þ

0
ð13:27Þ
or A
o
¼ B
o
. This relationship is best thought of as the unconditional information
matrix equality (UIME).
theorem 13.2 (Asymptotic Normality of CMLE): Let the conditions of Theorem
13.1 hold. In addition, assume that (a) y
o
A intðYÞ; (b) for each ðy; xÞ A Y ÂX,
lðy; x; ÁÞ is twice continuously di¤erentiable on intðYÞ; (c) the interchanges of de-
rivative and integral in equations (13.21) and (13.25) hold for all y A intðYÞ;(d)
the elements of ‘
2
y
lðy; x; yÞ are bounded in absolute value by a function bðy; xÞ
with finite expectation; and (e) A
o
defined by expression (13.24) is positive definite.
Then
ffiffiffiffiffi
N
p
ð
^
yy Ày
o
Þ!

d
Normalð0; A
À1
o
Þð13:28Þ
and therefore
Avarð
^
yyÞ¼A
À1
o
=N ð13:29Þ
In standard applications, the log likelihood has many continuous partial deriva-
tives, although there are examples where it does not. Some examples also violat e the
interchange of the integral and derivative in equation (13.21) or (13.25), such as when
the conditional support of y
i
depends on the parameters y
o
. In such cases we cannot
expect the CMLE to have a limiting normal distribution; it may not even converge
at the rate
ffiffiffiffiffi
N
p
. Some progress has been made for specific models when the support
of the distribution depends on unknown parameters; see, for example, Donald and
Paarsch (1996).
13.5.2 Estimating the Asymptotic Variance
Estimating Avarð

^
yyÞ requires estimating A
o
. From the equalities derived previously,
there are at least three possible estimators of A
o
in the CMLE context. In fact, under
slight extensions of the regularity conditions in Theorem 13.2, each of the matrices
N
À1
X
N
i¼1
ÀH
i
ð
^
yyÞ; N
À1
X
N
i¼1
s
i
ð
^
yyÞs
i
ð
^

yyÞ
0
; and N
À1
X
N
i¼1
Aðx
i
;
^
yyÞð13:30Þ
converges to A
o
¼ B
o
, where
Aðx
i
; y
o
Þ1 ÀE½Hð y
i
; x
i
; y
o
Þjx
i
ð13:31Þ

Maximum Likelihood Methods 395
Thus, Ava
ˆ

^
yyÞ can be taken to be any of the three matrices
À
X
N
i¼1
H
i
ð
^
yyÞ
"#
À1
;
X
N
i¼1
s
i
ð
^
yyÞs
i
ð
^
yyÞ

0
"#
À1
; or
X
N
i¼1
Aðx
i
;
^
yyÞ
"#
À1
ð13:32Þ
and the asymptotic standard errors are the square roots of the diagonal elements of
any of the matrices. We discussed each of these estimators in the general M-estimator
case in Chapter 12, but a brief review is in order. The first estimator, based on the
Hessian of the log likelihood, requires computing second derivatives and is not guar-
anteed to be positive definite. If the estimator is not positive definite, standard errors
of some linear combinations of the parameters will not be well defined.
The second estimator in equation (13.32), based on the outer product of the score,
is always positive definite (whenever the inverse exists). This simple estimator was
proposed by Berndt, Hall, Hall, and Hausman (1974). Its primary drawback is that it
can be poorly behaved in even moderate sample sizes, as we discussed in Section
12.6.2.
If the conditional expectation Aðx
i
; y
o

Þ is in closed form (as it is in some leading
cases) or can be simulated—as discussed in Porter (1999)—then the estimator based
on Aðx
i
;
^
yyÞ has some attractive features. First, it often depends only on first deriva-
tives of a conditional mean or conditional variance function. Second, it is positive
definite when it exists because of the conditional information matrix equality (13.26).
Third, this estimator has been found to have significantly better finite sample prop-
erties than the outer product of the score estimator in some situations where Aðx
i
; y
o
Þ
can be obtained in closed form.
Example 13.1 (continued): The Hessian for the probit log-likelihood is a mess.
Fortunately, E½H
i
ðy
o
Þjx
i
 has a fairly simple form. Taking the derivative of equation
(13.18) and using the product rule gives
H
i
ðyÞ¼À
ffðx
i

yÞg
2
x
0
i
x
i
Fðx
i
yÞ½1 À Fðx
i
yÞ
þ½y
i
À Fðx
i
yÞLðx
i

where Lðx
i
yÞ is a K Â K complicated function of x
i
y that we need not find explicitly.
Now, when we evaluate this expression at y
o
and note that Ef½y
i
À Fðx
i

y
o
ÞLðx
i
y
o
Þj
x
i
g¼½Eðy
i
jx
i
ÞÀFðx
i
y
o
ÞLðx
i
y
o
Þ¼0, we have
ÀE½H
i
ðy
o
Þjx
i
¼A
i

ðy
o
Þ¼
ffðx
i
y
o
Þg
2
x
0
i
x
i
Fðx
i
y
o
Þ½1 ÀFðx
i
y
o
Þ
Thus, Ava
ˆ

^
yyÞ in probit analysis is
Chapter 13396
X

N
i¼1
ffðx
i
^
yyÞg
2
x
0
i
x
i
Fðx
i
^
yyÞ½1 À Fðx
i
^
yyÞ
!
À1
ð13:33Þ
which is always positive definite when the inverse exists. Note that x
0
i
x
i
is a K Â K
matrix for each i.
Example 13.2 (continued): For the Poisson model with exponential conditional

mean, H
i
ðyÞ¼Àexpðx
i
yÞx
0
i
x
i
. In this example, the Hessian does not depend on y
i
,
so there is no distinction between H
i
ðy
o
Þ and E½H
i
ðy
o
Þjx
i
. The positive definite es-
timate of Ava
ˆ

^
yyÞ is simply
X
N

i¼1
expðx
i
^
yyÞx
0
i
x
i
"#
À1
ð13:34Þ
13.6 Hypothesis Testing
Given the asymptotic standard errors, it is easy to form asymptotic t statistics for
testing single hypotheses. These t statistics are asymptotically distributed as standard
normal.
The three tests covered in Chapter 12 are immediately applicable to the MLE case.
Since the information matrix equality holds when the density is correctly specified, we
need only consider the simplest forms of the test statistics. The Wald statistic is given
in equation (12.63), and the conditions su‰cient for it to have a limiting chi-square
distribution are discussed in Section 12.6.1.
Define the log-likelihood function for the entire sample by LðyÞ1
P
N
i¼1
l
i
ðyÞ. Let
^
yy be the unrestricted estimator, and let

~
yy be the estimator with the Q nonredundant
constraints imposed. Then, under the regularity conditions discussed in Section
12.6.3, the likelihood ratio (LR) statistic,
LR 1 2½Lð
^
yyÞÀLð
~
yyÞ ð13:35Þ
is distributed asymptotically as w
2
Q
under H
0
. As with the Wald statistic, we cannot
use LR as approximately w
2
Q
when y
o
is on the boundary of the parameter set. The
LR statistic is very easy to compute once the restricted and unrestricted models have
been estimated, and the LR statistic is invariant to reparameterizing the conditional
density.
The score or LM test is based on the restricted estimation only. Let s
i
ð
~
yyÞ be the
P Â1 score of l

i
ðyÞ evaluated at the restricted estimates
~
yy. That is, we compute the
partial derivatives of l
i
ðyÞ with respect to each of the P parameters, but then we
Maximum Likelihood Methods 397
evaluate this vector of partials at the restricted estimates. Then, from Section 12.6.2
and the information matrix equality, the statistics
X
N
i¼1
~
ss
i
!
0
À
X
N
i¼1
~
HH
i
!
À1
X
N
i¼1

~
ss
i
!
;
X
N
i¼1
~
ss
i
!
0
X
N
i¼1
~
AA
i
!
À1
X
N
i¼1
~
ss
i
!
; and
X

N
i¼1
~
ss
i
!
0
X
N
i¼1
~
ss
i
~
ss
0
i
!
À1
X
N
i¼1
~
ss
i
!
ð13:36Þ
have limiting w
2
Q

distributions under H
0
. As we know from Section 12.6.2, the first
statistic is not invariant to reparameterizations, but the outer product statistic is.
In addition, using the conditional information matrix equality, it can be shown that
the LM statistic based on
~
AA
i
is invariant to reparameterization. Davidson and
MacKinnon (1993, Section 13.6) show invariance in the case of unconditional maxi-
mum likelihood. Invariance holds in the more general conditional ML setup, with x
i
containing any conditioning variables; see Problem 13.5. We have already used the
expected Hessian form of the LM statistic for nonlinear regression in Section 12.6.2.
We will use it in several applications in Part IV, including binary response models
and Poisson regression models. In these examples, the statistic can be computed
conveniently using auxiliary regressions based on weighted residuals.
Because the unconditional information matrix equality holds, we know from Sec-
tion 12.6.4 that the three classical statistics have the same limiting distribution under
local alternatives. Therefore, either small-sample considerations, invariance, or com-
putational issues must be used to choose among the statistics.
13.7 Specification Testing
Since MLE generally relies on its distributional assumptions, it is useful to have
available a general class of specification tests th at are simple to compute. One general
approach is to nest the model of interest within a more general model (which may be
much harder to estimate) and obtain the score test against the more general alternative.
RESET in a linear model and its extension to exponential regression models in Section
12.6.2 are examples of this approach, albeit in a non-maximum-likelihood setting.
In the context of MLE, it makes sense to test moment conditions implied by the

conditional density specification. Let w
i
¼ðx
i
; y
i
Þ and suppose that, when f ðÁjx; yÞ is
correctly specified,
H
0
:E½gðw
i
; y
o
Þ ¼ 0 ð13:37Þ
Chapter 13398
where gðw; y Þ is a Q Â1 vector. Any application implies innumerable choices for the
function g. Since the MLE
^
yy sets the sum of the score to zero, gðw; y Þ cannot contain
elements of sðw; y Þ. Generally, g should be chosen to test features of a model that are
of primary interest, such as first and second conditional moments, or various condi-
tional probabilities.
A test of hypothesis (13.37) is based on how far the sample average of gðw
i
;
^
yyÞ is
from zero. To derive the asymptotic distribution, note th at
N

À1=2
X
N
i¼1
g
i
ð
^
yyÞ¼N
À1=2
X
N
i¼1
½g
i
ð
^
yyÞÀs
i
ð
^
yyÞP
o

holds trivially because
P
N
i¼1
s
i

ð
^
yyÞ¼0, where
P
o
1 fE½s
i
ðy
o
Þs
i
ðy
o
Þ
0
g
À1
fE½s
i
ðy
o
Þg
i
ðy
o
Þ
0
g
is the P ÂQ matrix of population regression coe‰cients from regressing g
i

ðy
o
Þ
0
on
s
i
ðy
o
Þ
0
. Using a mean-value expansion about y
o
and algebra similar to that in Chap-
ter 12, we can write
N
À1=2
X
N
i¼1
½g
i
ð
^
yyÞÀs
i
ð
^
yyÞP
o

¼N
À1=2
X
N
i¼1
½g
i
ðy
o
ÞÀs
i
ðy
o
ÞP
o

þ E½‘
y
g
i
ðy
o
ÞÀ‘
y
s
i
ðy
o
ÞP
o


ffiffiffiffiffi
N
p
ð
^
yy Ày
o
Þþo
p
ð1Þ
ð13:38Þ
The key is that, when the density is correctly specified, the second term on the right-
hand side of equation (13.38) is identically zero. Here is the reason: First, equation
(13.27) implies that ½E‘
y
s
i
ðy
o
ÞfE½s
i
ðy
o
Þs
i
ðy
o
Þ
0

g
À1
¼ÀI
P
. Second, an extension of the
conditional information matrix equality (Newey, 1985; Tauchen, 1985) implies that
ÀE½‘
y
g
i
ðy
o
Þjx
i
¼E½g
i
ðy
o
Þs
i
ðy
o
Þ
0
jx
i
: ð13:39Þ
To show equation (13.39), write
E
y

½g
i
ðyÞjx
i
¼
ð
Y
gðy; x
i
; yÞf ðy jx
i
; yÞnðdyÞ¼0 ð13:40Þ
for all y. Now, if we take the derivative with respect to y and assume that the inte-
grals and derivative can be interchanged, equation (13.40) implies that
ð
Y

y
gðy; x
i
; yÞf ðy jx
i
; yÞnðdyÞþ
ð
Y
gðy; x
i
; yÞ‘
y
f ðy jx

i
; yÞnðdyÞ¼0
Maximum Likelihood Methods 399
or E
y
½‘
y
g
i
ðyÞjx
i
þE
y
½g
i
ðyÞs
i
ðyÞ
0
jx
i
¼0, where we use the fact that ‘
y
f ðy jx; yÞ¼
sðy; x; yÞ
0
f ðy jx; yÞ. Plugging in y ¼ y
o
and rearranging gives equation (13.39).
What we have shown is that

N
À1=2
X
N
i¼1
½g
i
ð
^
yyÞÀs
i
ð
^
yyÞP
o
¼N
À1=2
X
N
i¼1
½g
i
ðy
o
ÞÀs
i
ðy
o
ÞP
o

þo
p
ð1Þ
which means these standardized partial sums have the same asymptotic distribution.
Letting
^
PP 1
X
N
i¼1
^
ss
i
^
ss
0
i
!
À1
X
N
i¼1
^
ss
i
^
gg
0
i
!

it is easily seen that plim
^
PP ¼ P
o
under standard regularity conditions. Therefore,
the asymptotic variance of N
À1=2
P
N
i¼1
½g
i
ð
^
yyÞÀs
i
ð
^
yyÞP
o
¼N
À1=2
P
N
i¼1
g
i
ð
^
yyÞ is con-

sistently estimated by N
À1
P
N
i¼1
ð
^
gg
i
À
^
ss
i
^
PPÞð
^
gg
i
À
^
ss
i
^
PPÞ
0
. When we construct the qua-
dratic form, we get the Newey-Tauchen-White (NTW) statistic,
NTW ¼
X
N

i¼1
g
i
ð
^
yyÞ
"#
0
X
N
i¼1
ð
^
gg
i
À
^
ss
i
^
PPÞð
^
gg
i
À
^
ss
i
^
PPÞ

0
"#
À1
X
N
i¼1
g
i
ð
^
yyÞ
"#
ð13:41Þ
This statistic was proposed independently by Newey (1985) and Tauchen (1985), and
is an extension of White’s (1982a) information matrix (IM) test statistic.
For computational purposes it is useful to note that equation (13.41) is identical to
N À SSR
0
¼ NR
2
0
from the regression
1on
^
ss
0
i
;
^
gg

0
i
; i ¼ 1; 2; ; N ð13:42Þ
where SSR
0
is the usual sum of squared residuals. Under the null that the density is
correctly specified, NTW is distributed asymptotically as w
2
Q
, assuming that gðw; y Þ
contains Q nonredundant moment conditions. Unfortunately, the outer product form
of regression (13.42) means that the statistic can have poor finite sample properties.
In particular applications—such as nonlinear least squares, binary response analysis,
and Poisson regression, to name a few—it is best to use forms of test statistics based
on the expected Hessian. We gave the regression-based test for NLS in equation
(12.72), and we will see other examples in later chapters. For the information matrix
test statistic , Davidson and MacKinnon (1992) have suggested an alternative form of
the IM statistic that appears to have better finite sample properties.
Example 13.2 (continued): To test the specification of the conditional mean for
Poission regression, we might take gðw;yÞ¼expðxyÞx
0
½yÀexpðxyÞ¼expðxyÞsðw; yÞ,
Chapter 13400
where the score is given by equation (13.19). If Eðy jxÞ¼expðxy
o
Þ then E½gðw; y
o
Þjx
¼ expðxy
o

ÞE½sðw; y
o
Þjx¼0. To test the Poisson variance assumption, Varðy jxÞ¼
Eðy jxÞ¼expðxy
o
Þ, g can be of the form gðw; yÞ¼aðx; y Þf½yÀexpðxyÞ
2
Àexpðxy Þg,
where aðx; yÞ is a Q Â1 vector. If the Poisson assumption is true, then u ¼ y À
expðxy
o
Þ has a zero conditional mean and Eðu
2
jxÞ¼Varðy jxÞ¼expðxy
o
Þ. It fol-
lows that E½gðw; y
o
Þjx¼0.
Example 13.2 contains examples of what are known as condit ional moment tests.
As the name suggests, the idea is to fo rm orthogonality conditions based on some
key conditional moments, usually the conditional mean or conditional variance, but
sometimes conditional probabilities or higher order moments. The tests for nonlinear
regression in Chapter 12 can be viewed as conditional moment tests, and we will
see several other examples in Part IV. For reasons discussed earlier, we will avoid
computing the tests using regression (13.42) whenever possible. See Newey (1985),
Tauchen (1985), and Pagan and Vella (1989) for general treatments and applications
of conditional moment tests. White’s (1982a) information matrix test can often be
viewed as a conditional moment test; see Hall (1987) for the linear regression model
and White (1994) for a general treatment.

13.8 Partial Likelihood Methods for Panel Data and Cluster Samples
Up to this point we have assumed that the parametric model for the density of y
given x is correctly specified. This assumption is fairly general because x can contain
any observable variable. The leading case occurs when x contains variables we view
as exogenous in a structural model. In other cases, x will contain variabl es that are
endogenous in a structural model, but putting them in the conditioning set and find-
ing the new conditional density makes estimation of the structural parameters easier.
For studying various panel data models, fo r estimation using cluster samples, and
for various other applications, we need to relax the assumption that the full condi-
tional density of y given x is correctly specified. In some examples, such a model is
too complicated. Or, for robustness reasons, we do not wish to fully specify the den-
sity of y given x.
13.8.1 Setup for Panel Data
For panel data applications we let y denote a T Â 1 vector, with generic element y
t
.
Thus, y
i
is a T Â1 random draw vector from the cross section, with tth element y
it
.
As always, we are thinking of T small relative to the cross section sample size. With a
Maximum Likelihood Methods 401
slight notational change we can replace y
it
with, say, a G-vector for each t, an ex-
tension that allows us to cover general systems of equations with panel data.
For some vector x
t
containing any set of observable variables, let Dðy

t
jx
t
Þ denote
the distribution of y
t
given x
t
. The key assumption is that we have a correctly speci-
fied model for the density of y
t
given x
t
; call it f
t
ðy
t
jx
t
; yÞ, t ¼ 1; 2; ; T. The vector
x
t
can contain anything, including conditioning variables z
t
, lags of these, and lagged
values of y. The vector y consists of all parameters appearing in f
t
for any t; some or
all of th ese may appear in the density for every t, and some may appear only in the
density for a single time period.

What distinguishes partial likelihood from maximum likelihood is that we do not
assume that
Y
T
t¼1
Dðy
it
jx
it
Þð13:43Þ
is a conditional distribution of the vector y
i
given some set of conditioning variables. In
other words, even though f
t
ðy
t
jx
t
; y
o
Þ is the correct density for y
it
given x
it
for each t,
the product of these is not (necessarily) the density of y
i
given some conditioning vari-
ables. Usually, we specify f

t
ðy
t
jx
t
; yÞ because it is the density of interest for each t.
We define the partial log likelihood for each observation i as
l
i
ðyÞ1
X
T
t¼1
log f
t
ðy
it
jx
it
; yÞð13:44Þ
which is the sum of the log likelihoods across t. What makes partial likelihood
methods work is that y
o
maximizes the expected value of equation (13.44) provided
we have the densities f
t
ðy
t
jx
t

; yÞ correctly specified.
By the Kullback-Leibler information inequality, y
o
maximizes E½log f
t
ðy
it
jx
it
; yÞ
over Y for each t,soy
o
also maximizes the sum of these over t. As usual, identifica-
tion requires that y
o
be the unique maximizer of the expected value of equation
(13.44). It is su‰cient that y
o
uniquely maximizes E½log f
t
ðy
it
jx
it
; yÞ for each t, but
this assumption is not necessary.
The partial maximum likelihood estimator (PMLE)
^
yy solves
max

y A Y
X
N
i¼1
X
T
t¼1
log f
t
ðy
it
jx
it
; yÞð13:45Þ
and this problem is clearly an M-estimator problem (where the asymptotics are with
fixed T and N ! y). Therefore, from Theorem 12.2, the partial MLE is generally
consistent provided y
o
is identified.
Chapter 13402
It is also clear that the partial MLE will be asymptotically normal by Theorem
12.3 in Section 12.3. However, unless
p
o
ðy jzÞ¼
Y
T
t¼1
f
t

ðy
t
jx
t
; y
o
Þð13:46Þ
for some subvector z of x, we cannot apply the conditional information matrix
equality. A more general asymptotic variance estimator of the type covered in Sec-
tion 12.5.1 is needed, and we provide such estimators in the next two subsections.
It is useful to discuss at a general level why equation (13.46) does not nec essarily
hold in a panel data setting. First, suppose x
t
contains only contemporaneous con-
ditioning variables, z
t
; in particular, x
t
contains no lagged dependent variables. Then
we can always write
p
o
ðy jzÞ¼p
o
1
ðy
1
jzÞÁp
o
2

ðy
2
jy
1
; zÞÁÁÁp
o
t
ðy
t
jy
tÀ1
; y
tÀ2
; ; y
1
; zÞÁÁÁ
p
o
T
ðy
T
jy
TÀ1
; y
TÀ2
; ; y
1
; zÞ
where p
o

t
ðy
t
jy
tÀ1
; y
tÀ2
; ; y
1
; zÞ is the true conditional density of y
t
given y
tÀ1
,
y
tÀ2
; ; y
1
and z 1 ðz
1
; ; z
T
Þ. (For t ¼ 1, p
o
1
is the density of y
1
given z.) For
equation (13.46) to hold, we should have
p

o
t
ðy
t
jy
tÀ1
; y
tÀ2
; ; y
1
; zÞ¼f
t
ðy
t
jz
t
; y
o
Þ; t ¼ 1; ; T
which requires that, once z
t
is conditioned on, neither past lags of y
t
nor elements of
z from any other time period—past or future—appear in the conditional density
p
o
t
ðy
t

jy
tÀ1
; y
tÀ2
; ; y
1
; zÞ. Generally, this requirement is very strong, as it requires a
combination of strict exogeneity of z
t
and the absense of dynamics in p
o
t
.
Equation (13.46) is more likely to hold when x
t
contains lagged dependent vari-
ables. In fact, if x
t
contains only lagged values of y
t
, then
p
o
ðyÞ¼
Y
T
t¼1
f
t
ðy

t
jx
t
; y
o
Þ
holds if f
t
ðy
t
jx
t
; y
o
Þ¼p
o
t
ðy
t
jy
tÀ1
; y
tÀ2
; ; y
1
Þ for all t (where p
o
1
is the uncondi-
tional density of y

1
), so that all dynamics are captured by f
t
. When x
t
contains some
variables z
t
in addition to lagged y
t
, equation (13.46) requires that th e parametric
density captures all of the dynamics—that is, that all lags of y
t
and z
t
have been
properly accounted for in f ðy
t
jx
t
; y
o
Þ—and strict exogeneity of z
t
.
In most treatments of maximum likelihood estimation of dynamic mod els con-
taining additional exogenous variables, the strict exogeneity assumption is main-
Maximum Likelihood Methods 403
tained, often implicitly by taking z
t

to be nonrandom. In Chapter 7 we saw that strict
exogeneity played no role in getting consistent, asymptotically normal estimators in
linear panel data models by pooled OLS, and the same is true here. We also allow
models where the dynamics have been incompletely specified.
Example 13.3 (Probit with Panel Data): To illustrate the previous discussion, we
consider estimation of a panel data binary choice model. The idea is that, for each
unit i in the population (individual, firm, and so on) we have a binary outcome, y
it
,
for each of T time periods. For example, if t represents a year, then y
it
might indicate
whether a person was arrested for a crime during year t.
Consider the model in latent variable form:
y
Ã
it
¼ x
it
y
o
þ e
it
y
it
¼ 1½y
Ã
it
> 0ð13:47Þ
e

it
jx
it
@ Normalð0; 1Þ
The vector x
it
might contain exogenous variables z
it
, lags of these, and even lagged
y
it
(not lagged y
Ã
it
). Under the assumptions in model (13.47), we have, for each
t, Pðy
it
¼ 1 jx
it
Þ¼Fðx
it
y
o
Þ, and the density of y
it
given x
it
¼ x
t
is f ðy

t
jx
t
Þ¼
½Fðx
t
y
o
Þ
y
t
½1 ÀFðx
t
y
o
Þ
1Ày
t
.
The partial log likelihood for a cross section observation i is
l
i
ðyÞ¼
X
T
t¼1
fy
it
log Fðx
it

yÞþð1 À y
it
Þ log½1 ÀFðx
it
yÞg ð13:48Þ
and the partial MLE in this case —which simply maximizes l
i
ðyÞ summed across all
i—is the pooled probit estimator. With T fixed and N ! y, this estimator is consis-
tent and
ffiffiffiffiffi
N
p
-asymptotically normal without any assumptions other than identifica-
tion and standard regularity conditions.
It is very important to know that the pooled probit estimator works without im-
posing additional assumptions on e
i
¼ðe
i1
; ; e
iT
Þ
0
.Whenx
it
contains only exoge-
nous variables z
it
, it would be standard to assume that

e
it
is independent of z
i
1 ðz
i1
; z
i2
; ; z
iT
Þ; t ¼ 1; ; T ð13:49Þ
This is the natural strict exogeneity assumption (and is much stronger than simply
assuming that e
it
and z
it
are independent for each t). The crime example can illustrate
how strict exogeneity might fail. For example, suppose that z
it
measures the amount
of time the person has spent in prison prior to the current year. An arrest this year
ðy
it
¼ 1Þ certainly has an e¤ect on expected future values of z
it
, so that assumption
Chapter 13404
(13.49) is almost certainly false. Fortunately, we do not need assumption (13.49) to
apply partial likelihood methods.
A second standard assumption is that the e

it
, t ¼ 1; 2; ; T are serially indepen-
dent. This is especially restrictive in a static model. If we maintain this assumption in
addition to assumption (13.49), then equation (13.46) holds (becau se the y
it
are then
independent conditional on z
i
) and the partial MLE is a conditional MLE.
To relax the assumption that the y
it
are conditionally independent, we can allow
the e
it
to be correlated across t (still assuming that no lagged dependent variables
appear). A common assumption is that e
i
has a multivariate normal distribution with
a general correlation matrix. Under this assumption, we can write down the join t
distribution of y
i
given z
i
, but it is complicated, and estimation is very computation-
ally intensive (for recent discussions, see Keane, 1993, and Hajivassilou and Ruud,
1994). We will cover a special case, the random e¤ects probit model, in Chapter 15.
A nice feature of the partial MLE is that
^
yy will be consistent and asymptotically
normal even if the e

it
are arbitrarily serially correlated. This result is entirely analo-
gous to using pooled OLS in linear panel data models when the errors have arbitrary
serial correlation.
When x
it
contains lagged dependent variables, model (13.47) provides a way of
examining dynamic behavior. Or, perhaps y
i; tÀ1
is included in x
it
as a proxy for
unobserved factors, and our focus is on on policy variables in z
it
. For example, if y
it
is a binary indicator of employment, y
i; tÀ1
might be included as a control when
studying the e¤ect of a job training program (which may be a binary element of z
it
)
on the employment probability; this method controls for the fact that participation in
job training this year might depend on employm ent last year , and it captures the fact
that employment status is per sistent. In any case, provided Pðy
it
¼ 1 jx
it
Þ follows a
probit, the pooled probit estimator is consistent and asymptotically normal. The

dynamics may or may not be correctly specified (more on this topic later), and the z
it
need not be strictly exogenous (so that whether someone participates in job training
in year t can depend on the past employment history).
13.8.2 Asymptotic Inference
The most important practical di¤erence between conditional MLE and partial MLE
is in the computation of asymptotic standard errors and test statistics. In many cases,
including the pooled probit estimator, the pooled Poisson estimator (see Problem
13.6), and many other pooled procedures, standard econometrics packages can be
used to compute the partial MLEs. However, except under certain assumptions, the
usual standard errors and test stati stics reported from a pooled analysis are not valid.
Maximum Likelihood Methods 405
This situation is entirely analogous to the linear model case in Section 7.8 when the
errors are serially correlated.
Estimation of the asymptotic variance of the partial MLE is not di‰cult. In fact,
we can combine the M-estimation results from Section 12.5.1 and the results of Sec-
tion 13.5 to obtain valid estimators.
From Theorem 12.3, we have Avar
ffiffiffiffiffi
N
p
ð
^
yy Ày
o
Þ¼A
À1
o
B
o

A
À1
o
, where
A
o
¼ÀE½‘
2
y
l
i
ðy
o
Þ ¼ À
X
T
t¼1
E½‘
2
y
l
it
ðy
o
Þ ¼
X
T
t¼1
E½A
it

ðy
o
Þ
B
o
¼ E½s
i
ðy
o
Þs
i
ðy
o
Þ
0
¼E
X
T
t¼1
s
it
ðy
o
Þ
"#
X
T
t¼1
s
it

ðy
o
Þ
"#
0
()
A
it
ðy
o
Þ¼ÀE½‘
2
y
l
it
ðy
o
Þjx
it

s
it
ðyÞ¼‘
y
l
it
ðyÞ
0
There are several important features of these formulas. First, the matrix A
o

is just the
sum across t of minus the expected Hessian. Second, the matrix B
o
generally depends
on the correlation between the scores at di¤erent time periods: E½s
it
ðy
o
Þs
ir
ðy
o
Þ
0
, t 0 r.
Third, for each t, the conditional information matrix equality holds:
A
it
ðy
o
Þ¼E½s
it
ðy
o
Þs
it
ðy
o
Þ
0

jx
it

However, in general, ÀE½H
i
ðy
o
Þjx
i
0 E½s
i
ðy
o
Þs
i
ðy
o
Þ
0
jx
i
 and, more importantly,
B
o
0 A
o
. Thus, to perform inference in the context of partial MLE, we generally
need separate estimates of A
o
and B

o
. Given the structure of the partial MLE, these
are easy to obtain. Three possibilities for A
o
are
N
À1
X
N
i¼1
X
T
t¼1
À‘
2
y
l
it
ð
^
yyÞ; N
À1
X
N
i¼1
X
T
t¼1
A
it

ð
^
yyÞ; and
N
À1
X
N
i¼1
X
T
t¼1
s
it
ð
^
yyÞs
it
ð
^
yyÞ
0
ð13:50Þ
The validity of the second of these follows from a standard iterated expectations
argument, and the last of these follows from the conditional information matrix equality
for each t. In most cases, the second estimator is preferred when it is easy to compute.
Since B
o
depends on E½ s
it
ðy

o
Þs
it
ðy
o
Þ
0
 as well as cross product terms, there are also
at least three estimators available for B
o
. The simplest is
Chapter 13406
N
À1
X
N
i¼1
^
ss
i
^
ss
0
i
¼ N
À1
X
N
i¼1
X

T
t¼1
^
ss
it
^
ss
0
it
þ N
À1
X
N
i¼1
X
T
t¼1
X
r0t
^
ss
ir
^
ss
0
it
ð13:51Þ
where the second term on the right-hand side accounts for possible serial correlation
in the score. The first term on the right-hand side of equation (13.51) can be replaced
by one of the other two estimators in equation (13.50). The asymptotic variance of

^
yy is estimated, as usual, by
^
AA
À1
^
BB
^
AA
À1
=N for the chosen estimators
^
AA and
^
BB. The
asymptotic standard errors come directly from this matrix, and Wald tests for linear
and nonlinear hypotheses can be obtained directly. The robust score statistic dis-
cussed in Section 12.6.2 can also be used. When B
o
0 A
o
, the likelihood ratio statistic
computed after pooled estimation is not valid.
Because the CIME holds for each t, B
o
¼ A
o
when the scores evaluated at y
o
are

serially uncorrelated, that is, when
E½s
it
ðy
o
Þs
ir
ðy
o
Þ
0
¼0; t 0 r ð13:52Þ
When the score is serially uncorrelated, inference is very easy: the usual MLE statis-
tics computed from the pooled estimation, including likelihood ratio statistics, are
asymptotically valid. E¤ectively, we can ignore the fact that a time dimension is
present. The estimator of Avarð
^
yyÞ is just
^
AA
À1
=N, where
^
AA is one of the matrices in
equation (13.50).
Example 13.3 (continued): For the pooled probit example, a simple, general esti-
mator of the asymptotic variance is
X
N
i¼1

X
T
t¼1
A
it
ð
^
yyÞ
"#
À1
X
N
i¼1
s
i
ð
^
yyÞs
i
ð
^
yyÞ
0
"#
X
N
i¼1
X
T
t¼1

A
it
ð
^
yyÞ
"#
À1
ð13:53Þ
where
A
it
ð
^
yyÞ¼
ffðx
it
^
yyÞg
2
x
0
it
x
it
Fðx
it
^
yyÞ½1 À Fðx
it
^

yyÞ
and
s
i
ðyÞ¼
X
T
t¼1
s
it
ðyÞ¼
X
T
t¼1
fðx
it
yÞx
0
it
½y
it
À Fðx
it
yÞ
Fðx
it
yÞ½1 À Fðx
it
yÞ
The estimator (13.53) contains cross product terms of the form s

it
ð
^
yyÞs
ir
ð
^
yyÞ
0
, t 0 r,
and so it is fully robust. If the score is serially uncorrelated, then the usual probit
standard errors and test statistics from the pooled estimation are valid. We will
Maximum Likelihood Methods 407
discuss a su‰cient condition for the scores to be serially uncorrelated in the next
subsection.
13.8.3 Inference with Dynamically Complete Models
There is a very important case where condition (13.52) holds, in which case all
statistics obtained by treating l
i
ðyÞ as a standard log likelihood are valid. For any
definition of x
t
, we say that ff
t
ðy
t
jx
t
; y
o

Þ: t ¼ 1; ; Tg is a dynamically complete
conditional density if
f
t
ðy
t
jx
t
; y
o
Þ¼p
o
t
ðy
t
jx
t
; y
tÀ1
; x
tÀ1
; y
tÀ2
; ; y
1
; x
1
Þ; t ¼ 1; ; T ð13:54Þ
In other words, f
t

ðy
t
jx
t
; y
o
Þ must be the conditional density of y
t
given x
t
and the
entire past of ðx
t
; y
t
Þ.
When x
t
¼ z
t
for contemporaneous exogenous variables, equation (13.54) is very
strong: it means that, once z
t
is controlled fo r, no past values of z
t
or y
t
appear in the
conditional density p
o

t
ðy
t
jz
t
; y
tÀ1
; z
tÀ1
; y
tÀ2
; ; y
1
; z
1
Þ. When x
t
contains z
t
and
some lags—similar to a finite distributed lag model—then equation (13.54) is per-
haps more reasonable, but it still assumes that lagged y
t
has no e¤ect on y
t
once
current and lagged z
t
are controlled for. That assumption (13.54) can be false is
analogous to the omnipresence of serial correlation in static and finite distributed lag

regression models. One important feature of dynamic completeness is that it does not
require strict exogeneity of z
t
[since only current and lagged x
t
appear in equation
(13.54)].
Dynamic completeness is more likely to hold when x
t
contains lagged dependent
variables. The issue, then, is whether enough lags of y
t
(and z
t
) have been included in
x
t
to fully capture the dynamics. For example, if x
t
1 ðz
t
; y
tÀ1
Þ, then equation
(13.54) means that, along with z
t
, only one lag of y
t
is needed to capture all of the
dynamics.

Showing that condition (13.52) holds under dynamic completeness is easy. First,
for each t,E½s
it
ðy
o
Þjx
it
¼0, since f
t
ðy
t
jx
t
; y
o
Þ is a correctly specified conditional
density. But then, under assumption (13.54),
E½s
it
ðy
o
Þjx
it
; y
i; tÀ1
; ; y
i1
; x
i1
¼0 ð13:55Þ

Now consider the expected value in condition (13.52) for r < t.Sinces
ir
ðy
o
Þ is a
function of ðx
ir
; y
ir
Þ, which is in the conditioning set (13.55), the usual iterated
expectations argument shows that condition (13.52) holds. It follows that, under dy-
namic completeness, the usual maximum likelihood statistics from the pooled esti-
mation are asymptotically valid. This result is completely analogous to pooled OLS
Chapter 13408
under dynamic completeness of the conditional mean and homoskedasticity (see
Section 7.8).
If the panel data probit model is dynamically complete, any software package
that does standard probit can be used to obtain valid standard errors and test statis-
tics, provided the response probability satisfies Pðy
it
¼ 1 jx
it
Þ¼Pðy
it
¼ 1 jx
it
; y
i; tÀ1
;
x

i; tÀ1
; Þ. Without dynamic completeness the standard errors and test statistics
generally need to be adjusted for serial dependence.
Since dynamic completeness a¤ords nontrivial simplifications, does this fact mean
that we should always include lagged values of exogenous and dependent variables
until equation (13.54) appears to be satisfied? Not necessarily. Static models are
sometimes desirable even if they neglect dynamics. For example, suppose that we
have panel data on individuals in an occupation where pay is determined partly by
cumulative productivity. (Professional athletes and college professors are two ex-
amples.) An equation relating salary to the productivity measures, and possibly de-
mographic variables, is appropriate. Nothing implies that the equation would be
dynamically complete; in fact, past salary could help predict current salary, even after
controlling for observed productivity. But it does not make much sense to include
past salary in the regression equation. As we know from Chapter 10, a reasonable
approach is to include an unobserved e¤ect in the equation, and this does not lead to
a model with complete dynamics. See also Section 13.9.
We may wish to test the null hypothesis that the density is dynamically complete.
White (1994) shows how to test whether the score is serially correlated in a pure time
series setting. A similar approach can be used with panel data. A general test for
dynamic misspecification can be based on the limiting distribution of (the vectoriza-
tion of )
N
À1=2
X
N
i¼1
X
T
t¼2
^

ss
it
^
ss
0
i; tÀ1
where the scores are evaluated at the partial MLE. Rather than derive a general sta-
tistic here, we will study tests of dynamic completeness in particular applications later
(see particularly Chapters 15, 16, and 19).
13.8.4 Inference under Cluster Sampling
Partial MLE metho ds are also useful when using cluster samples. Suppose that, for
each group or cluster g, f ðy
g
jx
g
; yÞ is a cor rectly specified conditional density of y
g
given x
g
. Here, i indexes the cluster, and as before we assume a large number of
clusters N and relatively small group sizes, G
i
. The primary issue is that the y
ig
might
Maximum Likelihood Methods 409

×