Tải bản đầy đủ (.pdf) (57 trang)

SỔ TAY KINH tế LƯỢNG mô HÌNH hồi QUI PHI TUYẾN

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.89 MB, 57 trang )

Chapter 6
NON-LINEAR REGRESSION MODELS
TAKESHI AMEMIYA*
Stanford University
Contents
1. Introduction
334
2. Single equation-i.i.d. case
336
2.1. Model
336
2.2. Asymptotic properties
337
2.3. Computation
341
2.4. Tests of hypotheses
347
2.5. Confidence regions
352
3.
Single equation-non-i.i.d. case
354
3.
I.
Autocorrelated errors
354
3.2. Heteroscedastic errors
358
4. Multivariate models
359
5. Simultaneous equations models


362
5.1. Non-linear two-stage least squares estimator
362
5.2. Other single equation estimators
370
5.3. Non-linear simultaneous equations
375
5.4. Non-linear three-stage least squares estimator
376
5.5. Non-linear full information maximum likelihood estimator
379
References
385
*This work was supported by National Science Foundation Grant SE%7912965 at the Institute for
Mathematical Studies in the Social Sciences, Stanford University. The author is indebted to the
following people for valuable comments: R. C. Fair, A. R. Gallant, Z. Griliches, M. D. Intriligator,
T. E. MaCurdy, J. L. Powell, R. E. Quandt, N. E. Savin, and H. White.
Handbook of Econometrics, Volume I, Edited by Z. Griliches and M.D. Intriligator
0 North-Holland Publishing Company, 1983
334
T. Amemiya
1. Introduction
This is a survey of non-linear regression models, with an emphasis on the theory
of estimation and hypothesis testing rather than computation and applications,
although there will be some discussion of the last two topics. For a general
discussion of computation the reader is referred to Chapter 12 of this Handbook
by Quandt. My aim is to present the gist of major results; therefore, I will
sometimes omit proofs and less significant assumptions. For those, the reader
must consult the original sources.
The advent of advanced computer technology has made it possible for the

econometrician to estimate an increasing number of non-linear regression models
in recent years. Non-linearity arises in many diverse ways in econometric applica-
tions. Perhaps the simplest and best known case of non-linearity in econometrics
is that which arises as the observed variables in a linear regression model are
transformed to take account of the first-order autoregression of the error terms.
Another well-known case is the distributed-lag model in which the coefficients on
the lagged exogenous variables are specified to decrease with lags in a certain
non-linear fashion, such as geometrically declining coefficients. In both of these
cases, non-linearity appears only in parameters but not in variables.
More general non-linear models are used in the estimation of production
functions and demand functions. Even a simple Cobb-Douglas production
function cannot be transformed into linearity if the error term is added rather
than multiplied [see Bodkin and Klein (1967)]. CES [Arrow, Chenery, Minhas and
Solow (196 l)] and VES [Revankar (1971)] production functions are more highly
non-linear. In the estimation of expenditure functions, a number of highly
non-linear functions have been proposed (some of these are used in the supply
side as well)-Translog [Christensen, Jorgenson and Lau (1975)], Generalized
Leontief [Diewert (1974)], S-Branch [Brown and Heien (1972)], and Quadratic
[Howe, Pollack and Wales (1979)], to name a few. Some of these and other papers
with applications will be mentioned in various relevant parts of this chapter.
The non-linear regression models I will consider in this chapter can be written
in their most general form as
(1.1)
where y,, .x,, and (Y~ are vectors of endogenous variables, exogenous variables, and
parameters, respectively, and uif are unobservable error terms with zero mean.
Eqs. (1. l), with all generality, constitute the non-linear simultaneous equations
model, which is analyzed in Section 5. I devote most of the discussion in the
chapter to this section because this area has been only recently developed and
therefore there is little account of it in general references.
Ch. 6: Non -linear Regression Models

335
Many simpler models arising as special cases of (1.1) are considered in other
sections. In Section 2 I take up the simplest of these, which I will call the standard
non-linear regression model, defined by
Y,=f(x,&+%
t=1,2 , ,
T,
0.2)
where (u,} are scalar i.i.d. (independent and identically distributed) random
variables with zero mean and constant variance. Since this is the model which has
been most extensively analyzed in the literature, I will also devote a lot of space to
the analysis of this model. Section 3 considers the non-i.i.d. case of the above
model, and Section 4 treats its multivariate generalization.
Now, I should mention what will not be discussed. I will not discuss the
maximum likelihood estimation of non-linear models unless the model is written
in the regression form (1.1). Many non-linear models are discussed elsewhere in
this Handbook; see, for example, the chapters by Dhrymes, McFadden, and
Maddala. The reader is advised to recognize a close connection between the
non-linear least squares estimator analyzed in this chapter and the maximum
likelihood estimator studied in the other chapters; essentially the same techniques
are used to derive the asymptotic properties of the two estimators and analogous
computer algorithms can be used to compute both.
I will not discuss splines and other methods of function approximation, since
space is limited and these techniques have not been as frequently used in
econometrics as they have in engineering applications. A good introduction to the
econometric applications of spline functions can be found in Poirier (1976).
Above I mentioned the linear model with the transformation to reduce the
autocorrelation of the error terms and the distributed-lag model. I will not
specifically study these models because they are very large topics by themselves
and are best dealt with separately. (See the chapter by Hendry, Pagan, and Sargan

in this Handbook). There are a few other important topics which, although
non-linearity is involved, woud best be studied within another context, e.g.
non-linear error-in-variable models and non-linear time-series models. Regarding
these two topics, I recommend Wolter and Fuller (1978) and Priestley (1978).
Finally, I conclude this introduction by citing general references on non-linear
regression models. Malinvaud (1970b) devotes one long chapter to non-linear
regression models in which he discusses the asymptotic properties of the non-
linear least squares estimator in a multivariate model. There are three references
which are especially good in the discussion of computation algorithms, confidence
regions, and worked out examples: Draper and Smith (1966) Bard (1974) and
Judge, Griffiths, Hill and Lee (1980). Several chapters in Goldfeld and Quandt
(1972) are devoted to the discussion of non-linear regression models. Their
Chapter 1 presents an excellent review of optimization techniques which can be
used in the computation of both the non-linear least squares and the maximum
likelihood estimators. Chapter 2 discusses the construction of confidence regions
336
T. Amemiya
in the non-linear regression model and the asymptotic properties of the maximum
likelihood estimator (but not of the non-linear least squares estimator). Chapter 5
considers the Cobb-Douglas production function with both multiplicative and
additive errors, and Chapter 8 considers non-linear (only in variables) simulta-
neous equations models. There are two noteworthy survey articles: Gallant
(1975a), with emphasis on testing and computation, and Bunke, Henscheke,
Strtiby and Wisotzki (1977), which is more theoretically oriented. None of the
above-mentioned references, however, discusses the estimation of simultaneous
equations models non-linear both in variables and parameters.
2.
Single equation-i.i.d. case
2.1.
Model

In this section I consider the standard non-linear regression model
Y,=fb,Jo)+%~
t=1,2
, ,
T,
(2.1)
where y, is a scalar endogenous variable, x, is a vector of exogenous variables, &
is a K-vector of unknown parameters, and {u,} are unobservable scalar i.i.d.
random variables with Eu, = 0 and Vu, = ut, another unknown parameter. Note
that, unlike the linear model wheref(x,, &) = x&, the dimensions of the vectors
x, and &-, are not necessarily the same. We will assume that f
is
twice continuously
differentiable. As for the other assumptions on f, I will mention them as they are
required for obtaining various results in the course of the subsequent discussion.
Econometric examples of (2.1) include the Cobb-Douglas production function
with an additive error,
Q, = p, Kf2L,B3 + u,,
and the CES (constant elasticity of substitution) production function:
(2.2)
(2.3)
Sometimes I will write (2.1) in vector notation as
Y =f(Po)+%
(2.4)
where y, f( /3,-J, and u are T-vectors whose
t
th element is equal toy,, f( x,, &), and
u,, respectively. I will also use the symbolf,(&) to denote f(x,, &,)_
Ch. 6: Non -linear Regression Models
337

The non-linear least squares (NLLS) estimator, denoted p, is defined as the
value of /I that minimizes the sum of squared residuals
S,(P) = t
[Yt
-fhP)12.
(2.5)
It is important to distinguish between the p that appears in (2.5), which is the
argument of the function f(x,, m), and &, which is a fixed true value. In what
follows, I will discuss the properties of p, the method of computation, and
statistical inference based on 8.
2.2.
Asymptotic properties
2.2.1. Consistency
The consistency of the NLLS estimator is rigorously proved in Jennrich (1969)
and Malinvaud (1970a). The former proves strong consistency (j? converging to
&, almost surely) and the latter weak consistency (p converging to &, in
probability). Weak consistency is more common in the econometric literature and
is often called by the simpler name of consistency. The main reason why strong
consistency, rather than weak consistency, is proved is that the former implies the
latter and is often easier to prove. I will mainly follow Jennrich’s proof but
translate his result into weak consistency.
The consistency of b is proved by proving that plim T- ‘S,( j3) is minimized at
the true value &. Strong consistency is proved by showing the same holds for the
almost sure limit of
T- ‘S,( /3)
instead. This method of proof can be used to prove
the consistency of any other type of estimator which is obtained by either
minimizing or maximizing a random function over the parameter space. For
example, I used the same method to prove the strong consistency of the maximum
likelihood estimator (MLE) of the Tobit model in Amemiya (1973b).

This method of proof is intuitively appealing because it seems obvious that if
T-l&( /3)
is close to plim
T-k&(/3)
and if the latter is minimized at &,, then fi,
which minimizes the former, should be close to &. However, we need the
following three assumptions in order for the proof to work:
The parameter space
B
is compact (closed and bounded)
and & is its interior point.
(2.6)
S, ( @ ) is continuous in p .
(2.7)
plim
T- ‘S,(p)
exists, is non-stochastic, and its convergence is uniform in p.
(2.8)
338
T. Amemiya
The meaning of (2.8) is as follows. Define S(p) = plim
T-
‘S,(
j3).
Then, given
E, S > 0, there exists
To,
independent of /I, such that for all
T 2 To
and for all

P, PUT-‘&-(P)- S(P)1 ’ &I< 6.
It is easy to construct examples in which the violation of any single assumption
above leads to the inconsistency of 8. [See Amemiya (1980).]
I will now give a sketch of the proof of the consistency and indicate what
additional assumptions are needed as I go along. From (2.1) and (2.5), we get
=A,+A,+A,,
(2.9)
where c means CT=, unless otherwise noted. First, plim A, = ut by a law of large
numbers [see, for example, Kolmogorov Theorem 2, p. 115, in Rao (1973)].
Secondly, for fixed &, and p, plim A, = 0 follows from the convergence of
T-‘C[f,(&)- f,(p)]’
by Chebyshev’s inequality:
Since the uniform convergence of A, follows from the uniform convergence of the
right-hand side of (2.10), it suffices to assume
converges uniformly in
fi, ,
& E
B.
(2.11)
Having thus disposed of A, and
A,,
we
need only to assume that lim
A,
is
uniquely minimized at PO; namely,
lim+E[f,(&)-N)l’-o ifP*&.
(2.12)
To sum up, the non-linear least squares estimator B of the model (2.1) is
consistent if (2.6), (2.1 l), and (2112) are satisfied. I will comment on the signifi-

cance and the plausibility of these three assumptions.
The assumption of a compact parameter space (2.6) is convenient but can be
rather easily removed. The trick is to dispose of the region outside a certain
compact subset of the parameter space by assuming that in that region
T-‘~MPoF.MP)12 *
IS
sufficiently large. This is done by Malinvaud (1970a).
An essentially similar argument appears also in Wald (1949) in the proof of the
consistency of the maximum likelihood estimator.
It would be nice if assumption (2.11) could be paraphrased into separate
assumptions on the functional form off and on the properties of the exogenous
Ch. 6: Non -linear Regression Models
339
sequence {x,}, which are easily verifiable. Several authors have attempted to
obtain such assumptions. Jennrich (1969) observes that if f is bounded and
continuous, (2.11) is implied by the assumption that the empirical distribution
function of {x,} converges to a distribution function. He also notes that another
way to satisfy (2.11) is to assume that {x,} are i.i.d. with a distribution function
F,
and f is bounded uniformly in p by a function which is square integrable with
respect to
F.
Malinvaud (1970a) generalizes the first idea of Jennrich by introduc-
ing the concept of weak convergence of measure, whereas Gallant (1977) gener-
alizes the second idea of Jennrich by considering the notion of Cesaro summabil-
ity. However, it seems to me that the best procedure is to leave (2.11) as it is and
try to verify it directly.
The assumption (2.12) is comparable to the familiar assumption in the linear
model that lim
T- ‘X’X

exists and is positive definite. It can be easily proved that
in the linear model the above assumption is not necessary for the consistency of
least squares and it is sufficient to assume
(X’X)-

+ 0. This observation
suggests that assumption (2.12) can be relaxed in an analogous way. One such
result can be found in Wu (198 1).
2.2.2.
Asymptotic normality
The asymptotic normality of the NLLS estimator B is rigorously proved in
Jennrich (1969). Again, I will give a sketch of the proof, explaining the required
assumptions as I go along, rather than reproducing Jennrich’s result in a theo-
rem-proof format.
The asymptotic normality of the NLLS estimator, as in the case of the MLE,
can be derived from the following Taylor expansion:
(2.13)
where a2$/apap’ is a K
x
K
matrix of second-order derivatives and p* lies
between j? and &. To be able to write down (2.13), we must assume that f, is
twice continuously differentiable with respect to p. Since the left-hand side of
(2.13) is zero (because B minimizes S,), from (2.13) we obtain:
@(~_p,)=_
1
a2sT
[
Twl,.]‘$ %I,,-
(2.14)

Thus, we are done if we can show that (i) the limit distribution of
fi-‘(asT/a&&,
is normal and (ii)
T-
‘(
6’2ST/apap’)B*
converges in probabil-
ity to a non-singular matrix. We will consider these two statements in turn.
340
T. Amemiya
The proof of statement (i) is straightforward. Differentiating (2.5) with respect
to @, we obtain:
Evaluating (2.15) at & and dividing it by @, we have:
1
as,
=
JT
ap
&l i
aft
cu
-I
f
ap
Bo*
(2.15)
(2.16)
But it is easy to find the conditions for the asymptotic normality of (2.16) because
the summand in the right-hand side is a weighted average of an i.i.d. sequence-the
kind encountered in the least squares estimation of a linear model. Therefore, if

we assume
exists and is non-singular,
then
1
as
t
0
afi
PO
+ N(0,4&).
(2.17)
(2.18)
This result can be straightforwardly obtained from the Lindberg-Feller central
limit theorem [Rao (1973, p. 128)] or, more directly, from of Anderson (197 1,
Theorem 2.6.1, p. 23).
Proving (ii) poses a more difficult problem. Write an element of the matrix
~-l(a~s,/apap)~.
ash@*).
0
ne might think that plim hT( /3*) = plim hT( &,)
follows from the well-known theorem which says that the probability limit of a
continuous function is the function of the probability limit, but the theorem does
not apply because h, is in general a function of an increasing number of random
variables y,, j2,.
. . ,y,.
But, by a slight modification of lemma 4, p. 1003, of
Amemiya (1973b), we can show that if hr( p)
converges almost surely to a certain
non-stochastic function h( /?) uniformly in p, then plim hT( p*) = h(plim /I*) =
h( &). Differentiating (2.15) again with respect to p and dividing by

T
yields
(2.19)
We must show that each of the three terms in the right-hand side of (2.19)
Ch. 6: Non -linear Regression Models
341
converges almost surely to a non-stochastic function uniformly in p. For this
purpose the following assumptions will suffice:
converges uniformly in /I in an open neighborhood of /3,,
,
and
(2.20)
converges uniformly in p in an open neighborhood of
& .
(2.21)
Then, we obtain;
1
a$-
PlimT
apap, 8*
=2C-
Finally, from (2.14), (2.18), and (2.22) we obtain:
(2.22)
(2.23)
The assumptions we needed in proving (2.23) were (2.17), (2.20), and (2.21) as
well as the assumption that /? is consistent.
It is worth pointing out that in the process of proving (2.23) we have in effect
shown that we have, asymptotically,
(2.24)
where I have put G = ( af/&3’),0,

a F
x K
matrix. Note that (2.24) exactly holds
in the linear case. The practical consequence of the approximation (2.24) is that
all the results for the linear regression model are asymptotically valid for the
non-linear regression model if we treat G as the regressor matrix. In particular, we
can use the usual t and
F
statistics with an approximate precision, as I will
explain more fully in Sections 2.4 and 2.5 below. Since the matrix G depends on
the unknown parameters, we must in practice evaluate it at b.
2.3.
Computation
Since there is in general no explicit formula for the NLLS estimator b, the
minimization of (2.5) must usually be carried out by some iterative method. There
342
T. Amemiya
are two general types of iteration methods: general optimization methods applied
to the non-linear least squares problem in particular, and procedures which are
specifically designed to cope with the present problem. In this chapter I will
discuss two representative methods - the Newton-Raphson iteration which be-
longs to the first type and the Gauss-Newton iteration which belongs to the
second type - and a few major variants of each method. These cover a majority of
the iterative methods currently used in econometric applications. Although not
discussed here, I should mention another method sometimes used in econometric
applications, namely the so-called conjugate gradient method of Powell (1964)
which does not require the calculation of derivatives and is based on a different
principle from the Newton methods. Much more detailed discussion of these and
other methods can be found in Chapter 12 of this Handbook and in Goldfeld and
Quandt (1972, ch. 1).

2.3.1.
Newton - Raphson iteration
The Newton-Raphson method is based on the following quadratic approxima-
tion of a minimand (it also works for a maximand):
(2.25)
where B, is the initial estimate [obtained by a pure guess or by a method such as
the one proposed by Hartley and Booker (1965) described below]. The second-
round estimator & of the iteration is obtained by minimizing the right-hand side
of (2.25). Therefore,
(2.26)
The iteration is to be repeated until the sequence {&} thus obtained converges to
the desired degree of accuracy.
Inserting (2.26) into (2.25) and writing
n +
1 and
n
for 2 and 1, we obtain:
(2.27)
The above equation shows two weaknesses of the Newton-Raphson iteration. (i)
Even if (2.27) holds exactly, &(&+ ,) < S,(&) is not guaranteed unless
(~2WWV’)~”
is a positive definite matrix. (ii) Even if the matrix is positive
Ch. 6: Non -linear Regression Models
343
definite, &, +
, - fi,,
may be too large or too small- if it is too large, it overshoots
the target, and if it is too small, the speed of convergence is slow.
The first weakness may be alleviated if we modify (2.26) as
(2.28)

where
I
is the identity matrix and (Y, is a scalar to be appropriately chosen by the
researcher subject to the condition that ( a2&/apap’)jn + a,Z is positive definite.
This modification was proposed by Goldfeld, Quandt and Trotter (1966) and is
called
quadratic hill-climbing
(since they were considering maximization). See the
same article or Goldfeld and Quandt (1972, ch. 1) for a discussion of how to
choose (Y, and the convergence properties of the method.
The second weakness may be remedied by the modification:
(2.29)
where the scalar X, is to be appropriately determined. See Fletcher and Powell
(1963) for a method to determine h, by a cubic interpolation of S,(p) along the
current search direction. [This method is called the DFP iteration since Fletcher
and Powell refined the method originally proposed by Davidon (1959).] Also, see
Berndt, Hall, Hall and Hausman (1974) for another method to choose A,.
Ordinarily, the iteration (2.26) is to be repeated until convergence takes place.
However, if B, is a consistent estimator of & such that @(b, - &,) has a proper
limit distribution, the second-round estimator 8, has the same asymptotic distri-
bution as B. In this case, a further iteration does not bring any improvement so
far as the asymptotic distribution is concerned. This is shown below.
By a Taylor expansion of (a&/a/3);, around &, we obtain:
as, as,
-I
I
ab j, =
T
Bo +
where p* lies between B,

(S, -PO)>
(2.30)
8*
and &. Inserting (2.30) into (2.26) yields
dqP,-PO)= I-
( [ ~~,1’ gg~~*)m~l-~o)
(2.31)
344 T Amemiya
But, under the assumptions of Section 2.2 from which we proved the asymptotic
normality of b, we have
Therefore,
(2.32)
(2.33)
where y means that both sides of the equation have the same non-degenerate
limit distribution.
To start an iteration, we need an initial estimate. Since there may be more than
one local minima in ST, it is helpful to use the starting value as close to the true
value as possible. Thus, it would abe desirable to have available an easily
computable good estimator, such as p,; all the better if it is consistent so that we
can take advantage of the result of the preceding paragraph. Surprisingly, I know
only one such estimator - the one proposed by Hartley and Booker (1965). Their
initial estimator is obtained as follows. Let us assume for simplicity
mK = T
for
some integer
m
and partition the set of integers (1,2,.
. . , T)
into
K

non-overlap-
ping consecutive subsets !P,, !P2;,
. . . ,
!PK,
each of which contains
m
elements. If we
define
j$, =m-'ClsuyI
and
f&(/3)=m-LC,,yfi(/3),
i=l,2, ,K, the
Harley-Booker estimator is defined as the value of b that satisfies
K
equations:
Y(i)=(i)(P),
i=1,2
K.
,**.,
(2.34)
Since (2.34) cannot generally be solved explicitly for p, one still needs an
iteration to solve it. Hartley and Booker propose the minimization of EYE ,[ jjCi, -
fCi,(p)12 by an iterative method, such as one of the methods being discussed in
this section. This minimization is at least simpler than the original minimization
of (2.5) because the knowledge that the minimand is zero at /3 = & is useful.
However, if there are multiple solutions to (2.34), an iteration may lead to the
wrong solution.
Hartley and Booker proved the consistency of their estimator. Jennrich (1969)
gave a counterexample to their consistency proof; however, their proof can easily
be modified to take account of Jennrich’s counter-example. A more serious

weakness of the Hartley-Booker proof is that their assumptions are too restric-
tive: one can easily construct a benign example for which their assumptions are
violated and yet their estimator is consistent.
Ch. 6: Non -linear Regression Models
345
Gallant (1975a) suggested a simpler variation of the Hartley-Booker idea: just
select K observations appropriately and solve them for p. This estimator is
simpler to compute, but inconsistent. Nevertheless, one may obtain a good
starting value by this method, as Gallant’s example shows.
2.3.2.
Gauss-Newton iteration
This is the method specifically designed to calculate the NLLS estimator. Expand-
ing f,(p) in a Taylor series around the initial estimate fi,, we get:
Substituting the right-hand side of (2.35) for f,(p) in (2.5) yields
(2.35)
(2.36)
The second-round estimator b2 of the Gauss-Newton iteration is obtained by
minimizing the right-hand side of (2.36) with respect to p. Thus,
(2.37)
The iteration is to be repeated until convergence is obtained. By an argument
similar to the one I used in proving (2.33), we can prove that the asymptotic
distribution of 8, defined in (2.37) is the same as that of B if we use a consistent
estimator (such as the Hartley-Booker estimator) to start this iteration. An
advantage of the Gauss-Newton iteration over the Newton-Raphson iteration is
that the former requires only the first derivatives off,.
The Gauss-Newton iteration may be alternatively motivated as follows.
Evaluating the approximation (2.35) at &, and inserting it into eq. (2.1) yields
(2.38)
Then, the second-round estimator b2 can be obtained as the least squares
estimator of /3,, applied to the linear regression equation (2.38), where the whole

left-hand side is treated as the dependent variable and (af,/J/3’);, as the vector
of independent variables. Eq. (2.38) reminds us of the point raised above: namely,
the non-linear regression model asymptotically behaves like the linear regression
model if we treat (af/&‘fi’)j as the regressor matrix.
346
T A men1iya
The Gauss-Newton iteration suffers from weaknesses similar to those of the
Newton-Raphson iteration: namely, the possibility of a total or near singularity
of the matrix to be inverted in (2.37), and the possibility of too much or too little
change from /$, to &,+ ,.
In order to deal with the first weakness, Marquardt (1963) proposed a modifi-
cation:
where (Y, is a positive scalar to be appropriately determined by a rule based on the
past behavior of the algorithm.
In order to deal with the second weakness, Hartley (1961) proposed the
following modification. First, calculate
(2.40)
and, secondly, choose A, so as to minimize
$4 ii + %A,)~
O_Ih,_Il.
(2.41)
Hartley proves that under general conditions his iteration converges to a sta-
tionary point: that is, a root of the normal equation &S,/ap = 0. He also proves
(not so surprisingly) that if the iteration is started at a point sufficiently close to
b, it converges to b. See Tomheim (1963) for an alternative proof of the
convergence of the Hartley iteration. Some useful comments on Marquardt’s and
Hartley’s algorithms can be found in Gallant (1975a). The methods of determin-
ing A, in the Newton-Raphson iteration (2.29) mentioned above can be also
applied to the determination of A,, in (2.41).
Jennrich (1969) proves that if the Gauss-Newton iteration is started at a point

sufficiently close to the true value &, and if the sample size
T
is sufficiently large,
the iteration converges to &,.
This is called the asymptotic stability of the
iteration. The following is a brief sketch of Jenmich’s proof. Rewrite the Gauss-
Newton iteration (2.37) as (I have also changed 1 to n and 2 to n + 1 in the
subscript)
b”+,=h(s,>?
(2.42)
where h is a vector-valued function implicitly defined by (2.37). By a Taylor
Ch. 6: Non
-linear
Regression Models
341
expansion:
(2.43)
where /3:_ , lies between &, and &_
, .
If we define
A,, = (ah /a/3’),,
and denote
the largest characteristic root of
AkA,
by h,, we can show that A, + 0 almost
surely for all
n
as
T + 00
and hence

h, + 0 almost surely for all n as
T + CO.
(2.W
But (2.44) implies two facts. First, the iteration converges to a stationary point,
and secondly, this stationary point must lie sufficiently close to the starting value
8, since
(,i~-p,)l(~~ ,)~S’6(1+h,+hlXz+ * +h,X,.+_*),
(2.45)
where 6 = & - 8,. Therefore, this stationary point must be B if 8, is within a
neighborhood of & and if b is the unique stationary point in the same neighbor-
hood.
In closing this section I will mention several empirical papers in which the
above-mentioned and related iterative methods are used. Bodkin and Klein (1967)
estimated the Cobb-Douglas (2.2) and the CES (2.3) production functions by the
Newton-Raphson method. Charatsis (1971) estimated the CES production func-
tion by a modification of the Gauss-Newton method similar to that of Hartley
(1961) and showed that in 64 samples out of 74, it converged in six iterations.
Mizon (1977), in a paper the major aim of which was to choose among nine
production functions, including the Cobb-Douglas and CES, used the conjugate
gradient method of Powell (1964). Miion’s article is a useful compendium on the
econometric application of various statistical techniques such as sequential test-
ing, Cox’s test of separate families of hypotheses [Cox (1961, 1962)], the Akaike
Information Criterion [Akaike (1973)], the Box-Cox transformation [Box and
Cox (1964)], and comparison of the likelihood ratio, Wald, and Lagrange multi-
plier tests (see the end of Section 2.4 below). Sargent (1978) estimates a rational
expectations model (which gives rise to non-linear constraints among parameters)
by the DFP algorithm mentioned above.
2.4.
Tests of hypotheses
In this section I consider tests of hypotheses on the regression parameters p. It is

useful to classify situations into four cases depending on the nature of the
348
T. Amemiya
Table 2.1 Four cases of hypotheses tests
-
Non-normal
Linear
I II
Non-linear III
IV
hypotheses and the distribution of the error term as depicted in Table 2.1. I will
discuss the t and
F
tests in Case I and the likelihood ratio, Wald, and Rao tests in
Case IV. I will not discuss Cases II and III because the results in Case IV are a
fortiori valid in Cases II and III.
2.4.1.
Linear hypotheses under normality
Partition the parameter vector as fi’ = (&,,, &), where &,, is a K,-vector and &)
is a K,-vector. By a linear hypothesis I mean a hypothesis which specifies that &)
is equal to a certain known value PC*).
Student’s
t
test is applicable if K, = 1 and
theFtestifK,>l.
The hypothesis of the form Q/3 = c, where Q is a known K,
X K
matrix and c is
a known K,-vector, can be transformed into a hypothesis of the form described
above and therefore need not be separately considered. Assuming Q is full rank,

we can find a K,
X K
matrix
R
such that
(R’, Q’) = A’
is non-singular. If we
define (Y = A/3 and partition (Y’ = (‘Y;,), (Y{*)),
the hypothesis Q/3 = c is equivalent
to the hypothesis a(Z) = c.
As noted after eq. (2.24), all the results of the linear regression model can be
extended to the non-linear model .by treating G = ( af/&3’),0 as the regressor
matrix if the assumptions of Section 2.2 are satisfied. Since &, is unknown, we
must use G =
(af/ap)j
in practice. We will generalize the
t
and
F
statistics of
the linear model by this principle. If K, = 1, we have approximately
-qK($z,-%J _
t(T_
K)
gm

(2.46)
where L! is the last diagonal element (if &) is the i th element of p, the
i
th diagonal

element) of (&G)-’ and
t(T- K)
denotes Student’s
t
distribution with
T - K
degrees of freedom. For the case K, 2 1, we have asymptotically under the null
hypothesis:
Ch. 6: Non -linear Regression Models
349
where J’= (0, I), 0 being the
K,
X
K,
matrix of zeros and I being the identity
matrix of size
K,,
and F(
K,, T - K)
denotes the F distribution with
K,
and
T - K
degrees of freedom.
Gallant (1975a) examined the accuracy of the approximation (2.46) by a Monte
Carlo experiment using the model
f(x,,
P) =
PA + P2x2t +
P4eS+3r.

(2.48)
For each of the four parameters, the empirical distribution of the left-hand side of
(2.46) matched the distribution of
t(T - K)
reasonably well, although, as we
would suspect, the performance was the poorest for 8,.
In testing & = p(Z) when
K, 2
1, we may alternatively use the asymptotic
approximation (under the null hypothesis):
(T-K)[%(i+&@)l _
J-(K
K2wv
29
T_ K)
(2.49)
where b is the constrained non-linear least squares estimator obtained by mini-
mizing S,( /3) subject to pc2,
= pc2). Although, as is well known, the statistics (2.47)
and (2.49) are identical in the linear model, they are different in the non-linear
model.
The study of Gallant (1975~) sheds some light on the choice between (2.47) and
(2.49). He obtained the asymptotic distribution of the statistics (2.47) and (2.49)
under the alternative hypothesis as follows. Regarding S,(b), which appears in
both formulae, we have asymptotically:
S,(B) = u’[l-G(G’G)-‘G’]u,
(2.50)
where G = ( af/a/3’)s, as before. Define G, = ( af/a&,),, Then, Gallant shows
(asymptotically) that
s,(p) = (u+ a)![~- G,(G;GJ’G;](~ + a),

(2.51)
-
where 6 =
f(P&- f(P;T,,
/$)
Ilf(PckN3~,,~
42))l12.
wll
in which /?& is the value of &, that minimizes
=
x’x
for any vector x.) He also shows
(2.52)
350
T. Amemiya
where fit2j0 is the true value of &.’ The asymptotic distribution of the statistic
(2.47) under the alternative hypothesis can now be derived from (2.50) and (2.52)
and, similarly, that of (2.49) from (2.50) and (2.5 1).
Gallant (1975~) conducted a Monte Carlo study using the model (2.48) to
compare the above two tests in testing p, = 0 against p, t 0 and & = - 1 against
& * - 1. His results show that (i) the asymptotic approximation under the
alternative hypothesis matches the empirical distribution reasonably well for both
statistics but works a little better for the statistic (2.49) and (ii) the power of (2.49)
tends to be higher than that of (2.47).2 Gallant (1975a) observes that (2.49) is
easier to calculate than (2.47) except when K, = 1. All these observations indicate
a preference for (2.49) over (2.47). See Gallant (1975b) for a tabulation of the
power function of the test based on S,(/?‘)/S,(&, which is equivalent to the
test based on (2.49).
2.4.2. Non -linear
hypotheses under non -normality

Now
I consider the test of a non-linear hypothesis
h(P) = 0,
(2.53)
where
h
is a q-vector valued non-linear function such that
q < K.
If /3 are the parameters that characterize a concentrated likelihood function
L(p), where
L
may or may not be derived from the normal distribution, we can
test the hypothesis (2.53) using one of the following well-known test statistics: the
likelihood ratio test (LRT), Wald’s test [WaId (1943)], or Rao’s test [Rao (1947)]:
LRT=2[logL(j)-logL@)], (2.54)
and
(2.55)
(2.56)
‘In deriving the asymptotic approximations (2.51) and (2.52), Gallant assumes that the “distance”
between the null and alternative hypotheses is sufficiently small. More precisely, he assumes that there
exists a sequence of hypothesized values @&) and hence a sequence (/36:> such that fi( &)a - p&)
and fl(&, -P;,;)
converge to constant vectors as T goes to infinity.
*Actually, the powers of the two tests calculated either from the approximation or from the
empirical distribution are identical in testing j3, = 0. They differ only in the test of & = - 1.
Ch. 6: Non -lineur Regression Models
351
where B is the unconstrained maximum likelihood estimator and /? is the
constrained maximum likelihood estimator obtained maximizing L(p) subject to
(2.53).3 By a slight modification of the proof of Rao (1973) (a modification is

necessary since Rao deals with a likelihood function rather than a concentrated
likelihood function), it can be shown that all the three test statistics have the same
limit distribution- x*(q), &i-square with
q
degrees of freedom. For more discus-
sion of these tests, see Chapter 13 of this Handbook by Engle.
Gallant and Holly (1980) obtained the asymptotic distribution of the three
statistics under an alternative hypothesis in a non-linear simultaneous equations
model. Translated into the present simpler model, their results can be stated as
follows. As in Gallant (1975~) (see footnote l), they assume that the “distance”
between the null hypothesis and the alternative hypothesis is small: or, more
precisely, that there exists a sequence of true values {p,‘} such that 6 = limo
(PO’- &) is finite and h(&) = 0. Then, statistics (2.54), (2.55), and (2.56)
converge to
x*(q, A),
&i-square with
q
degrees of freedom and the noncentrality
parameter h,4 where
If we assume the normality of u in the non-linear regression model (2. l), we can
write (2.54), (2.55), and (2.56) as5
LRT=T[logT-‘S,(p)-logT-‘S,@)],
(2.58)
and
(2.59)
(2.60)
where (? = (af/Jp’), Since (2.58)-(2.60) are special cases of (2.54)-(2.56), all
these statistics are asymptotically distributed as
x*(q)
if u are normal. However,

3See Silvey (1959) for an interpretation of Rao’s test as a test on Lagrange multi
P
Iiers.
41f 6 is distributed as a q-vector N(0, V), then (5 + a)‘V-‘(6 + p) - x*(q,p’V- p).
‘In the following derivation I have omitted some terms whose probability limit is zero in evaluating
@‘(6’log L/a/3’) and T-‘(a* log L/6’ga/F).
352
T. Amemiya
using a proof similar to Rao’s, we can show that the statistics (2.58), (2.59) and
(2.60) are asymptotically distributed as x’(q) even if u are not normal. Thus,
these statistics can be used to test a non&near hypothesis under a
situation.
In the linear regression model we can show Wald 2 LRT 2 Rao
and Savin (1977)]. Although the inequalities do not exactly hold for
ear model, Mizon (1977) found Wald 2 LRT most of the time in his
2.5.
Confidence regions
Confidence regions on the parameter vector p or its subset can be
constructed
using any of the test statistics considered in the preceding section. In this section I
discuss some of these as well as other methods of constructing confidence regions.
A 100
X
(1 -
(u) percent confidence interval on an element of p can be obtained
from (2.46) as
non-normal
[see Bemdt
the non-lin-
samples.

(2.61)
where
t
a/2( T - K)
is the
a/2
critical value of
t(
T - K).
A confidence region - 100
x
(1 -
a) percent throughout this section- on the
whole vector j3 can be constructed using either (2.47) or (2.49). If we use (2.47) we
obtain:
(T-K)(b-P)‘~‘~‘(b-P)
<P(K T_K)
K%-(8)
a
9
3
and if we use (2.49) we obtain:
(T-K)[ST@-%@)I <
F (K T_ K)
a
7
Goldfeld and Quandt (1972, p. 53) give a striking
example in which the two
regions defined by (2.62) and (2.63) differ markedly,
even though both statistics

have the same asymptotic distribution- F(
K, T - K).
I have not come across any
reference discussing the comparative merits of the two methods.
(2.62)
(2.63)
Beale (1960) shows that the confidence region based on (2.63) gives an accurate
result - that is, the distribution of the left-hand side of (2.63) is close to F(
K, T -
K)-
if the “non-linearity” of the model is small. He defines a measure of
Ch. 6: Non
-linear
Regression Models
353
non-linearity as
‘j= ii
2
[.htbi)-h(8)-
~l,(bi-/l)]z.K(r-K)-‘s,(iR)
i=l
f=l
’ { ig, [
,$,
[hcPi~~/,(~,l’]‘)‘~
(2.64)
where
b
b
,, 2,. . . , b,,, are

m
arbitrarily chosen K-vectors of constants, and states
that (2.63) gives a good result if fi&(K,
T - K) -e
0.01, but unsatisfactory if
i?&(K,
T - K) >
1. Guttman and Meeter (1965), on the basis of their experience
in applying Beale’s measure of non-linearity to real data, observe that fi is a
useful measure if the degree of “true non-linearity” (which can be measured by
the population counterpart of Beale’s measure) is small. Also, see Bates and Watts
(1980) for a further development.
The standard confidence ellipsoid in the linear regression model can be written
as
(T- K)(Y -
xi)‘x(~x)-‘r(Y -
x@ <
F tK T_ K)
K(y - xp,+- x(xtx)-‘x’](y - x/S> a

.
(2.65)
Note that j? actually drops out of the denominator of (2.65), which makes the
computation of the confidence region simple in this case. In analogy to (2.65),
Hartley (1964) proposed the following confidence region:
(T-K)(y-f)‘Z(Z’Z)-‘Z’(y-f) <F(K T-K)
K(y-f)t[I-Z(Z’Z)-‘Z’](y-f)
a



(2.66)
where 2 is an appropriately chosen
T
X K
matrix of constants with rank K. The
computation of (2.66) is more difficult than that of (2.65) because p appears in
both the numerator and denominator of (2.66). In a simple model where f,( /3) =
P, + Pse%
Hartley suggests choosing Z such that its tth row is equal to
(1,
x,,
xf
).
This suggestion may be extended to a general recommendation that we
should choose the column vectors of Z to be those independent variables which
we believe best approximate G. Although the distribution of the left-hand side of
(2.66) is exactly
F(K, T - K)
for any Z under the null hypothesis, its power
depends crucially on the choice of Z.
354
T. Amemiya
3. Single equation-non4.i.d. case
3. I.
Autocorrelated errors
In this section we consider the non-linear regression model (2.1) where {u,} follow
a general stationary process
cc
U, =
C

YjEt-jy
j=O
(3.1)
where (Ed) are i.i.d. with
Eel = 0
and I’&, = u*,
and the y’s satisfy the condition
fl
Y+Q,
(3.2)
j=O
and where
the spectral density g(o) of ( ut} is continuous.
(3.3)
I will add whatever assumptions are needed in the course of the subsequent
discussion. The variance-covariance matrix
Euu’
will be denoted by 2.
I will indicate how to prove the consistency and the asymptotic normality of
the non-linear least squares estimator B in the present model, given the above
assumptions as well as the assumptions of Section 2.2. Changing the assumption
of independence to autocorrelation poses no more difficulties in the non-linear
model than in the linear model.
To prove consistency, we consider (2.9) as before. Since
A,
does not depend on
p and
A,
does not depend on u,, we need to be concerned with only
A,.

Since
A,
involves the vector product f’u and since E(f’u)* = f’Zf $ f’fx,(Z), where
h,(E) is the largest characteristic root of E, assumption (2.11) implies plim
A, = 0
by Chebyshev’s inequality, provided that the characteristic roots of 2 are bounded
from above. But this last condition is implied by assumption (3.3).
To prove the asymptotic normality in the present case, we need only prove the
asymptotic normality of (2.16) which, just as in the linear model, follows from
theorem 10.2.11, page 585, of Anderson (1971) if we assume
I5
IYjl
<
O”
(3.4)
j=O
in addition to all the other assumptions. Thus,
~(B-Po)-~[0,0,21imT-‘(G’G)-‘G’~G(G’G)~’],
(3.5)
Ch. 6: Non
-linear
Regression Models
355
which indicates that the linear approximation (2.24) works for the autocorrelated
model as well. Again it is safe to say that all the results of the linear model are
asymptotically valid in the non-linear model. This suggests, for example, that the
Durbin-Watson test will be approximately valid in the non-linear model, though
this has not been rigorously demonstrated.
Now, let us consider the non-linear analogue of the generalized least squares
estimator, which I will call the non-linear generalized least squares (NLGLS)

estimator.
Hannan (1971) investigated the asymptotic properties of the class of estimators,
denoted by p(A), obtained by minimizing
( y
-
f )‘A
-
‘( y
-
f
)
for some A, which
is the variance-covariance matrix of a stationary process with bounded (both
from above and from below) characteristic roots. This class contains the NLLS
estimator, B = &I), and the NLGLS estimator, &E).
Hannan actually minimized an approximation of (y - f)‘A’(y - f) ex-
pressed in the frequency domain; therefore, his estimator is analogous to his
spectral estimator proposed for the linear regression model [Hannan (1963)]. If we
define the periodograms
:
i+ytei’“Cf,eCitw,
f
(3.6)
2m 4lT
w=o,- -, ,
27r(T- 1)
T’ T
T ’
we have approximately:
where C#I( w) is the spectral density associated with A. This approximation is based

on an approximation of A by a circular matrix. [See Amemiya and Fuller (1967,
p. 527).]
Hannan proves the strong consistency of his non-linear spectral estimator
obtained by minimizing the right-hand side of (3.7) under the assumptions (2.6),
356
T. Amemiya
(2.12), and the new assumption
f
CfAcAf(r+s)(c*)
converges uniformly in
c, ,
c2 E B
for every integer S.
f
(3.8)
Note that this is a generalization of the assumption (2.11). However, the assump-
tion (3.8) is merely sufficient and not necessary. Hannan shows that in the model
y, = OL, + (Y~COS& + cr,sin&t + u,,
(3.9)
assumption (3.8) does not hold and yet b is strongly consistent if we assume (3.4)
and 0 < /?a < T. In fact, T(fi - &) converges to zero almost surely in this case.
In proving the asymptotic normality of his estimator, Hannan needs to gener-
alize (2.20) and (2.21) as follows:
+c
$i,,
*I,,
converges uniformly in c, and cZ
in an open neighborhood of &
(3.10)
and

converges uniformly in c, and c2
2
in an open neighborhood of &,
.
(3.11)
He also needs an assumption comparable to (2.17), namely
lim+G’A-‘G( = A)
exists and is non-singular.
(3.12)
Using (3.10), (3.1 l), and (3.12), as well as the assumptions needed for consistency,
Hannan proves
fi[B(A>-A,] + N(O, A-%4-‘),
(3.13)
where
B = lim
T-
‘G’A-
'2X-'G.
If we define a matrix function
F
by
I
aft
af,,,
n
-__i=
limrap ap
/
-ne
iswdF(w),

(3.14)
we can write
A
= (2?r)-‘/Y,g(w)+(o)*dF(w) and
B =
(2a)-‘/l,+(o)dF(w).
Ch. 6: Non -linear Regression Models
357
In the model (3.9), assumptions (3.10) and (3.11) are not satisfied; nevertheless,
Hannan shows that the asymptotic normality holds if one assumes (3.4) and
0 < & < 7r. In fact, J?;T(b - &) , normal in this case.
An interesting practical case is where I#B(W) = a)‘, where g(w) is a con-
sistent estimator of g(o). I will denote this estimator by b(e). Harman proves
that B(2) and b(Z) have the same asymptotic distribution if g(w) is a rational
spectral density.
Gallant and Goebel (1976) propose a NLGLS estimator of the autocorrelated
model which is constructed in the time domain, unlike Hannan’s spectral estima-
tor. In their method, they try to take account of the autocorrelation of {u,} by
fitting the least squares residuals ti, to an autoregressive model of a finite order.
Thus, their estimator is a non-linear analogue of the generalized least squares
estimator analyzed in Amemiya (1973a).
The Gallant-Goebel estimator is calculated in the following steps. (1) Obtain
the NLLS estimator 8. (2) Calculate li = y - f(b). (3) Assume that (u,} follow an
autoregressive model of a finite order and estimate the coefficients by the least
squares regression of z?, on
zi,_ ,
, zi,_ 2,. . . .
(4) Let 2 be the variance-covariance
matrix of u obtained under the assumption of an autoregressive model. Then we
can find a lower triangular matrix

R
such that 2-l =
R'R,
where
R
depends on
the coefficients of the autoregressive model.6 Calculate i? using the estimates of
the coefficients obtained in Step (3) above. (5) Finally, minimize [&y - f)]’
[ R( y - f)]
to obtain the Gallant-Goebel estimator.
Gallant and Goebel conducted a Monte Carlo study of the model y, = &eS2xr
+ U, to compare the performance of the four estimators- the NLLS, the
Gallant-Goebel AR1 (based on the assumption of a first-order autoregressive
model), the Gallant-Goebel AR2, and Hannan’s b(2) - when the true distribu-
tion of (u,} is i.i.d., ARl, AR2, or MA4 (a fourth-order moving average process).
Their major findings were as follows. (1) The Gallant-Goebel AR2 was not much
better than the AR1 version. (2) The Gallant-Goebel estimators performed far
better than the NLLS estimator and a little better than Hannan’s B(e), even
when the true model was MA4- the situation most favorable to Hamran. They
think the reason for this is that in many situations an autoregressive model
produces a better approximation of the true autocovariance function than the
circular approximation upon which Hannan’s spectral estimator is based. They
61f we assume a first-order autogressive model, for example, we obtain:

×