Class Notes in Statistics and Econometrics Part 35 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (739.54 KB, 132 trang )

CHAPTER 69
Binary Choice Models
69.1. Fisher’s Scoring and Iteratively Reweighted Least Squares
This section draws on chapter 55 about Numerical Minimization. Another im-
portant “natural” choice for the positive deﬁnite matrix R
i
in the gradient method
is available if one maximizes a likelihood function: then R
i
can be the inverse of
the information matrix for the parameter values β
i
. This is called Fisher’s Scoring
method. It is closely related to the Newton-Raphson method. T he Newton-Raphson
method uses the Hessian matrix, and the information matrix is minus the expected
value of the Hessian. Apparently Fisher ﬁrst used the information matrix as a com-
putational simpliﬁcation in the Newton-Raphson method. Today IRLS is used in
the GLIM program for generalized linear models.
1487
1488 69. BINARY CHOICE MODELS
As in chapter 56 discussing nonlinear least squares, β is the vector of param-
eters of interest, and we will work with an intermediate vector η(β) of predictors
whose dimension is comparable to that of the observations. Therefore the likelihood
function has the form L = L

y, η(β )

. By the chain rule (C.1.23) one can write the
Jacobian of the likelihood function as
∂L
∂β


(β) = u

X, where u

=
∂L
∂η

(η(β )) is
the Jacobian of L as a function of η, evaluated at η(β), and X =
∂η
∂β

(β) is the
Jacobian of η. This is the same notation as in the discussion of the Gauss-Newton
regression.
Deﬁne A =
E
[uu

]. Since X does not depend on the random variables, the
information matrix of y with respect to β is then
E
[X

uu

X] = X


AX. If one
uses the inverse of this information matrix as the R-matrix in the gradient algorithm,
one gets
(69.1.1) β
i+1
= β
i
+ α
i

X

AX

−1
X

u
The Iterated Reweighted Least Squares interpretation of this comes from rewrit-
ing (69.1.1) as
(69.1.2) β
i+1
= β
i
+

X

AX


−1
X

AA
−1
u,
i.e., one obtains the step by regressing A
−1
u on X with weighting matrix A.
69.2. BINARY DEPENDENT VARIABLE 1489
Justiﬁcations of IRLS are: the information matrix is usually analytically simpler
than the Hessian of the likelihood function, therefore it is a convenient approximation,
and one needs the information matrix anyway at the end for the covariance matrix
of the M.L. estimators.
69.2. Binary Dependent Variable
Assume each individual in the sample makes an independent random choice
between two alternatives, which can conveniently be coded as y
i
= 0 or 1. The
probability distribution of y
i
is fully determined by the probability π
i
= Pr[y
i
= 1]
of the event which has y
i
as its indicator function. Then E[y
i

] = π
i
and var[y
i
] =
E[y
2
i
] −

E[y
i
]

2
= E[y
i
] −

E[y
i
]

2
= π
i
(1 −π
i
).
It is usually assumed that the individual choices are stochastically independent

of each other, i.e., the distribution of the data is fully characterized by the π
i
. Each
π
i
is assumed to depend on a vector of explanatory variables x
i
. There are diﬀerent
approaches to modelling this dependence.
The regression model y
i
= x

i
β +ε
i
with E[ε
i
] = 0 is inappropriate because x

i
β
can take any value, whereas 0 ≤ E[y
i
] ≤ 1. Nevertheless, people have been tinkering
with it. The obvious ﬁrst tinker is based on the observation that the ε
i
are no
longer homoskedastic, but their variance, which is a function of π
i

, can be estimated,
therefore one can correct for this heteroskedasticity. But things get complicated very
1490 69. BINARY CHOICE MODELS
quickly and then the main appe al of OLS, its simplicity, is lost. This is a wrong-
headed approach, and any smart ideas which one may get when going down this road
are simply wasted.
The right way to do this is to set π
i
= E[y
i
] = Pr[y
i
= 1] = h(x

i
β) where h is
some (necessarily nonlinear) function with values between 0 and 1.
69.2.1. Logit Speciﬁcation (Logistic Regression). The logit or logistic
speciﬁcation is π
i
= e
x

i
β
/(1 + e
x

i
β

). Invert to get log(π
i
/(1 − π
i
)) = x

i
β. I.e.,
the logarithm of the odds depends linearly on the predictors. The log odds are a
natural re-scaling of probabilities to a scale which goes from −∞ to +∞, and which
is symmetric in that the log odds of the complement of an event is just the negative
of the log odds of the event itself. (See my remarks about the odds ratio in Question
222.)
Problem 560. 1 point If y = log
p
1−p
(logit function), show that p =
exp y
1+exp y
(logistic function).
Answer. exp y =
p
1−p
, now multiply by 1 − p to get exp y − p expy = p, collect terms
exp y = p(1 + exp y), now divide by 1 + exp y. 
69.2. BINARY DEPENDENT VARIABLE 1491
Problem 561. Sometimes one ﬁnds the following alternative speciﬁcation of the
logit model: π
i
= 1/(1+e

x

i
β
). What is the diﬀerence between it and our formulation
of the logit model? Are these two formulations equivalent?
Answer. It is simply a diﬀerent parametrization. They get this because they come from index
number problem. 
The logit function is also the canonical link function for the binomial distribution,
see Problem 113.
69.2.2. Probit Model. An important class of functions with values between 0
and 1 is the class of cumulative probability distribution functions. If h is a cumulative
distribution function, then one can give this speciﬁcation an interesting interpretation
in terms of an unobserved “index variable.”
The index variable model speciﬁes: there is a variable z
i
with the property that
y
i
= 1 if and only if z
i
> 0. For instance, the decision y
i
whether or not individual
i moves to a diﬀerent location can be modeled by the calculation whether the net
beneﬁt of moving, i.e., the wage diﬀerential minus the cost of relocation and ﬁnding
a new job, is positive or not. This moving example is worked out, with references,
in [Gre93, pp. 642/3].
The value of the variable z
i

is not observed, one only observes y
i
, i.e., the only
thing one knows about the value of z
i
is whether it is positive or not. But it is assumed
1492 69. BINARY CHOICE MODELS
that z
i
is the sum of a deterministic part which is speciﬁc to the individual and a
random part which has the same distribution for all individuals and is stochastically
independent between diﬀerent individuals. The deterministic part sp e ciﬁc to the
individual is assumed to depend linearly on individual i’s values of the covariates ,
with coeﬃcients which are common to all individuals. In other words, z
i
= x

i
β +ε
i
,
where the ε
i
are i.i.d. with cumulative distribution function F
ε
. Then it follows π
i
=
Pr[y
i

= 1] = Pr[z
i
> 0] = Pr[ε
i
> −x

i
β] = 1 − Pr[ε
i
≤ −x

i
β] = 1 − F
ε
(−x

i
β).
I.e., in this case, h(η) = 1 −F
ε
(−η). If the distribution of ε
i
is symmetric and has a
density, then one gets the simpler formula h(η) = F
ε
(η).
Which cumulative distribution function should be chosen?
• In practice, the probit model, in which z
i
is normal, is the only one used.

• The linear model, in which h is the line segment from (a, 0) to (b, 1), can also
be considered generated by an in index function z
i
which is here uniformly
distributed.
• An alternative possible speciﬁcation with the Cauchy distribution is pro-
posed in [DM93, p. 516]. They say that curiously only logit and probit are
being used.
In practice, the probit model is very similar to the logit model, once one has rescaled
the variables to make the variances equal, but the logit model is easier to handle
mathematically.
69.2. BINARY DEPENDENT VARIABLE 1493
69.2.3. Replicated Data. Before discussing estimation methods I want to
brieﬂy address the iss ue whether or not to write the data in replicated form [MN89,
p. 99–101]. If there are several observations for every individual, or if there are several
individuals for the same values of the covariates (which can happen if all covariates
are categorical), then one can write the data more compactly if one groups the data
into so-called “covariate classes,” i.e., groups of observations which share the same
values of x
i
, and deﬁnes y
i
to be the number of times the decision came out positive
in this group. Then one needs a second variable, m
i
, which is assumed nonrandom,
indicating how many individual decisions are combined in the respective group. This
is an equivalent formulation of the data, the only thing one loses is the order in which
the observations were made (which may be relevant if there are training or warm-up
eﬀects). The original representation of the data is a special case of the grouped form:

in the non-grouped form, all m
i
= 1. We will from now on write our formulas for
the grouped form.
69.2.4. Estimation. Maximum likeliho od is the preferred estimation method.
The likeliho od function has the form L =

π
y
i
i
(1 − π
i
)
(m
i
−y
i
)
. This likelihood
function is not derived from a density, but from a probability mass function. For
instance, in the case w ith non-replicated data, all m
i
= 1, if you have n binary
measurements, then you can have only 2
n
diﬀerent outcomes, and the probability of
the sequence y
1
, . . . y

n
= 0, 1, 0, 0, . . . , 1 is as given above.
1494 69. BINARY CHOICE MODELS
This is a highly nonlinear maximization and must be done numerically. Let us
go through the method of scoring in the example of a logit distribution.
L =

i

y
i
log π
i
+ (m
i
− y
i
) log(1 − π
i
)

(69.2.1)
∂L
∂π
i
=

y
i
π

i
−
m
i
− y
i
1 −π
i

(69.2.2)
∂
2
L
∂π
2
i
= −

y
i
π
2
i
+
m
i
− y
i
(1 −π
i

)
2

(69.2.3)
Deﬁning η = Xβ, the logit speciﬁcation can be written as π
i
= e
η
i
/(1 + e
η
i
).
Diﬀerentiation gives
∂π
i
∂η
i
= π
i
(1 −π
i
). Combine this with (69.2.2) to get
(69.2.4) u
i
=
∂L
∂η
i
=


y
i
π
i
−
m
i
− y
i
1 −π
i

π
i
(1 −π
i
) = y
i
− m
i
π
i
.
These are the elements of u in (69.1.1), and they have a very simple meaning: it is
just the observations minus their expected values. Therefore one obtains immediately
A =
E
[uu


] is a diagonal matrix with m
i
π
i
(1 −π
i
) in the diagonal.
Problem 562. 6 points Show that for the maximization of the likelihood func-
tion of the logit model, Fisher’s scoring method is equivalent to the Newton-Raphson
algorithm.
69.3. THE GENERALIZED LINEAR MODEL 1495
Problem 563. Show that in the logistic model,

m
i
ˆπ
i
=

y
i
.
69.3. The Generalized Linear Model
The binary choice models show how the linear model can be generalized. [MN89,
p. 27–32] develop a uniﬁed theory of many diﬀerent interesting models, called the
“generalized linear model.” The following few paragraphs are indebted to the elabo-
rate and useful web site about Generalized Linear Models maintained by Gordon K.
Smyth at www.maths.uq.oz.au/~gks/research/glm
In which cases is it necessary to go beyond linear models? The most important
and common situation is one in which y

i
and µ
i
= E[y
i
] are bounded:
• If y represents the amount of some physical substance then we may have
y ≥ 0 and µ ≥ 0.
• If y is binary, i.e., y = 1 if an animal survives and y = 0 if it does not, then
0 ≤ µ ≤ 1.
The linear model is inadequate here because complicated and unnatural constraints
on β would be required to make sure that µ stays in the feasible range. Generalized
linear models instead assume a link linear relationship
(69.3.1) g(µ) = Xβ
1496 69. BINARY CHOICE MODELS
where g() is some known monotonic function which acts pointwise on µ. Typically
g() is used to transform the µ
i
to a scale on which they are unconstrained. For
example we might use g(µ) = log(µ) if µ
i
> 0 or g(µ) = log

µ/(1 −µ)

if 0 < µ
i
< 1.
The same reasons which force us to abandon the linear model also force us to
abandon the as sumption of normality. If y is bounded then the variance of y must

depend on its mean. Speciﬁcally if µ is close to a boundary for y then var(y) must be
small. For example, if y > 0, then we must have var(y) → 0 as µ → 0. For this reason
strictly positive data almost always shows increasing variability with increased size.
If 0 < y < 1, then var(y) → 0 as µ → 0 or µ → 1. For this reason, generalized linear
models assume that
(69.3.2) var(y
i
) = φ ·V (µ
i
)
where φ is an unknown scale factor and V () is som e known variance function appro-
priate for the data at hand.
We therefore estimate the nonlinear regression equation (69.3.1) weighting the
observations inversely according to the variance functions V (µ
i
). This weighting
procedure turns out to be exactly equivalent to maximum likelihood estimation when
the observations actually come from an exponential family distribution.
Problem 564. Describe estimation situations in which a linear model and Nor-
mal distribution are not appropriate.
69.3. THE GENERALIZED LINEAR MODEL 1497
The generalized linear model has the following components:
• Random com ponent: Instead of being normally distributed, the compo-
nents of y have a distribution in the exponential family.
• . Introduce a new symbol η = Xβ.
• A monotonic univariate link function g so that η
i
= g(µ
i
) where µ =

E
[y].
The generalized linear model allows for a nonlinear link function g specifying
that transformation of the expected value of the response variable which depends
linearly on the predictors:
(69.3.3) g(E[y
i
]) = x

i
β,
Its random speciﬁcation is such that var[y] depends on E[y] through a variance
function φ ·V (where φ is a constant taking the place of σ
2
in the regression model:)
(69.3.4) var[y] = φ ·V (E[y])
We have seen earlier that these mean- and variance functions are not an artiﬁcial
construct, but that the distributions from the “exponential dispersion family,” see
Section 6.2, naturally give rise to such mean and variance functions. But just as
much of the theory of the linear model can be derived without the assumption that
the residuals are normally distributed, many of the results about generalized linear
1498 69. BINARY CHOICE MODELS
models do not require us to specify the whole distribution but can be derived on the
basis of the mean and variance functions alone.
CHAPTER 70
Multiple Choice Models
Discrete choice between three or more alternatives; came from choice of trans-
portation.
The outcomes of these choices should no longer be represented by a vector y, but
one needs a matrix Y with y

ij
= 1 if the ith individual chooses the jth alternative,
and 0 otherwise. Consider only three alternatives j = 1, 2, 3, and deﬁne Pr(y
ij
=
1) = π
ij
.
Conditional Logit model is a model which makes all π
ij
dependent on x
i
. It is
very simple extension of binary choice. In binary choice we had log
π
i
1−π
i
= x

i
β, log
of odds ratio. Here this is generalized to log
π
i2
π
i1
= x

i

β
2
, and log
π
i3
π
i1
= x

i
β
3
. From
1499
1500 70. MULTIPLE CHOICE MODELS
this we obtain
(70.0.5) π
i1
= 1 −π
i2
− π
i3
= 1 −π
i1
e
x

i
β
2

− π
i1
e
x

i
β
3
,
or
π
i1
=
1
1 + e
x

i
β
2
+ e
x

i
β
3
,(70.0.6)
π
i2
=

e
x

i
β
2
1 + e
x

i
β
2
+ e
x

i
β
3
,(70.0.7)
π
i3
=
e
x

i
β
3
1 + e
x


i
β
2
+ e
x

i
β
3
.(70.0.8)
One can write this as π
ij
=
e
α
j
+β
j
X
i

e
α
k
+β
k
X
i
if one deﬁnes α

1
= β
1
= 0. The only
estimation method used is MLE.
(70.0.9) L =

π
y
i1
i1
π
y
i2
i2
Π
y
i3
i3
=

(e
x

i
β
2
)
y
i2

(e
x

i
β
3
)
y
i3
1 + e
x

i
β
2
+ e
x

i
β
3
.
Note: the odds are independent of all other alternatives. Therefore the alterna-
tives must be chosen such that this independence is a good assumption. The choice
between walking, car, red buses, and blue buses does not satisfy this. See [Cra91,
p. 47] for the best explanation of this which I found till now.
APPENDIX A
Matrix Formulas
In this Appendix, eﬀorts are made to give s ome of the familiar matrix lemmas in
their most general form. The reader should be warned: the concept of a deﬁciency

matrix and the notation which uses a thick fraction line multiplication with a scalar
g-inverse are my own.
A.1. A Fundamental Matrix Decomposition
Theorem A.1.1. Every matrix B which is not the null matrix can be written
as a product of two matrices B = CD, where C has a left inverse L and D a right
inverse R, i.e., LC = DR = I. This identity matrix is r × r, where r is the rank
of B.
1501
1502 A. MATRIX FORMULAS
A proof is in [Rao73, p. 19]. T his is the fundamental theorem of algebra, that
every homomorphism can be written as a product of epimorphism and monomor-
phism, together with the fact that all epimorphisms and monomorphisms split, i.e.,
have one-sided inverses.
One such factorization is given by the singular value theorem: If B = P

ΛQ
is the svd as in Theorem A.9.2, then one might set e.g. C = P

Λ and D = Q,
consequently L = Λ
−1
P and R = Q

. In this decomposition, the ﬁrst row/column
carries the largest weight and gives the best approximation in a least squares sense,
etc.
The trace of a square matrix is deﬁned as the sum of its diagonal elements. The
rank of a matrix is deﬁned as the number of its linearly independent rows, which is
equal to the number of its linearly independent columns (row rank = column rank).
Theorem A.1.2. tr BC = tr CB.

Problem 565. Prove theorem A.1.2.
Problem 566. Use theorem A.1.1 to prove that if BB = B, then rank B =
tr B.
Answer. Premultiply the equation CD = CDCD by L and postmultiply it by R to get
D C = I
r
. This is useful for the trace: tr B = tr CD = tr DC = tr I
r
= r. I have this proof from
[Rao73, p. 28]. 
A.2. THE SPECTRAL NORM OF A MATRIX 1503
Theorem A.1.3. B = O if and only if B

B = O.
A.2. The Spectral Norm of a Matrix
The spectral norm of a matrix extends the Euclidean norm z from vectors
to matrices. Its deﬁnition is A = max
z=1
Az. This spectral norm is the
maximum singular value µ
max
, and if A is square, then


A
−1


= 1/µ
min

. It is a
true norm, i.e., A = 0 if and only if A = O, furthermore λA = |λ|·A, and the
triangle inequality A + B ≤ A+B. In addition, it obeys AB ≤ A·B.
Problem 567. Show that the spectral norm is the maximum singular value.
Answer. Use the deﬁnition
(A.2.1) A
2
= max
z

A

Az
z

z
.
Write A = P

ΛQ as in (A.9.1), Then z

A

Az = z

Q

Λ
2
Qz. Therefore we can ﬁrst show:

there is a z in the form z = Q

x which attains this maximum. Proof: for every z which has a
nonzero value in the numerator of (A.2.1), set x = Qz. Then x = o, and Q

x attains the same
value as z in the numerator of (A.2.1), and a smaller or equal value in the denominator. Therefore
one can restrict the search for the maximum argument to vectors of the form Q

x. But for them
the objective function becomes
x

Λ
2
x
x

x
, which is maximized by x = i
1
, the ﬁrst unit vector (or
column vector of the unit matrix). Therefore the squared spectral norm is λ
2
ii
, and therefore the
spectral norm itself is λ
ii
. 
1504 A. MATRIX FORMULAS

A.3. Inverses and g-Inverses of Matrices
A g-inverse of a matrix A is any matrix A
−
satisfying
(A.3.1) A = AA
−
A.
It always exists but is not always unique. If A is square and nonsingular, then A
−1
is its only g-inverse.
Problem 568. Show that a symmetric matrix Ω
Ω
Ω has a g-inverse which is also
symmetric.
Answer. Use Ω
Ω
Ω
−
Ω
Ω
ΩΩ
Ω
Ω
−

. 
The deﬁnition of a g-inverse is apparently due to [Rao62]. It is sometimes called
the “conditional inverse” [Gra83, p. 129]. This g-inverse, and not the Moore-Penrose
generalized inverse or pseudoinverse A
+

, is needed for the linear model, The Moore-
Penrose generalized inverse is a g-inverse that in addition satisﬁes A
+
AA
+
= A
+
,
and AA
+
as well as A
+
A symmetric. It always exists and is also unique, but the
additional requirements are burdensome ballast. [Gre97, pp. 44-5] also advocates
the Moore-Penrose inverse, but he does not really use it. If he were to try to use it,
he would probably soon discover that it is not appropriate. The bo ok [Alb72] does
the linear model with the Moore-Penrose inverse. It is a good demonstration of how
complicated everything gets if one uses an inappropriate mathematical tool.
A.3. INVERSES AND G-INVERSES OF MATRICES 1505
Problem 569. Use theorem A.1.1 to prove that every matrix has a g-inverse.
Answer. Simple: a null matrix has its transpos e as g-inverse, and if A = O then RL is such
a g-inverse. 
The g-inverse of a number is its inverse if the number is nonzero, and is arbitrary
otherwise. Scalar expressions written as fractions are in many cases the multiplication
by a g-inverse. We will use a fraction with a thick horizontal rule to indicate where
this is the case. In other words, by deﬁnition,
(A.3.2)
a
b
= b

−
a. Compare that with the ordinary fraction
a
b
.
This idiosyncratic notation allows to write certain theorems in a more concise form,
but it requires more work in the proofs, because one has to consider the additional
case that the denominator is zero. Theorems A.5.8 and A.8.2 are examples.
Theorem A.3.1. If B = AA
−
B holds for one g-inverse A
−
of A, then it holds
for all g-inverses. If A is symmetric and B = AA
−
B, then also B

= B

A
−
A.
If B = BA
−
A and C = AA
−
C then BA
−
C is independent of the choice of g-
inverses.

Proof. Assume the identity B = AA
+
B holds for some ﬁxed g-inverse A
+
(which may be, as the notation suggests, the Moore Penrose g-inverse, but this is
1506 A. MATRIX FORMULAS
not necessary), and let A
−
be an diﬀerent g-inverse. Then AA
−
B = AA
−
AA
+
B =
AA
+
B = B. For the second statement one merely has to take transposes and note
that a matrix is a g-inverse of a symmetric A if and only if its transpose is. For the
third statement: BA
+
C = BA
−
AA
+
AA
−
C = BA
−
AA

−
C = BA
−
C. Here
+
signiﬁes a diﬀerent g-inverse; again, it is not necessarily the Moore-Penrose one. 
Problem 570. Show that x satisﬁes x = Ba for some a if and only if x =
BB
−
x.
Theorem A.3.2. Both A

(AA

)
−
and (A

A)
−
A are g-inverses of A.
Proof. We have to show
(A.3.3) A = AA

(AA

)
−
A
which is [Rao73, (1b.5.5) on p. 26]. Deﬁne D = A −AA


(AA

)
−
A and show, by
multiplying out, that DD

= O. 
A.4. Deﬁciency Matrices
Here is again some idiosyncratic terminology and notation. It gives an explicit
algebraic formulation for something that is often done implicitly or in a geometric
paradigm. A matrix G will be called a “left deﬁciency matrix” of S, in symbols,
G ⊥ S, if GS = O, and for all Q with QS = O there is an X with Q = XG. This
A.4. DEFICIENCY MATRICES 1507
factorization property is an algebraic formulation of the geometric concept of a null
space. It is symmetric in the sense that G ⊥ S is also equivalent with: GS = O,
and for all R with GR = O there is a Y with R = SY . In other words, G ⊥ S and
S

⊥ G

are equivalent.
This symmetry follows from the following characterization of a deﬁciency matrix
which is symmetric:
Theorem A.4.1. T ⊥ U iﬀ T U = O and T

T + UU

nonsingular.

Proof. This proof here seems terribly complicated. There must be a simpler
way. Proof of “⇒”: Assume T ⊥ U. Take any γ with γ

T

T γ + γ

UU

γ =
0, i.e., T γ = o and γ

U = o

. From this one can show that γ = o: since
T γ = o, there is a ξ with γ = Uξ, therefore γ

γ = γ

Uξ = 0. To prove
“⇐” assume T U = O and T

T + UU

is nonsingular. To show that T ⊥ U
take any B with BU = O. Then B = B(T

T + UU

)(T


T + UU

)
−1
=
BT

T (T

T + UU

)
−1
. In the same way one gets T = T T

T (T

T + UU

)
−1
.
Premultiply this last equation by T

T (T

T T

T )

−
T

and use theorem A.3.2 to get
T

T (T

T T

T )
−
T

T = T

T (T

T + U U

)
−1
. Inserting this into the equation
for B gives B = BT

T (T

T T

T )

−
T

T , i.e., B factors over T . 
The R/Splus-function Null gives the transpose of a deﬁciency m atrix.
1508 A. MATRIX FORMULAS
Theorem A.4.2. If for all Y , BY = O implies AY = O, then a X exists with
A = XB.
Problem 571. Prove theorem A.4.2.
Answer. Let B ⊥ C. Choosing Y = B follows AB = O, hence X exists. 
Problem 572. Show that I −SS
−
⊥ S.
Answer. Clearly, (I − SS
−
)S = O. Now if QS = O, then Q = Q(I − SS
−
), i.e., the X
whose existence is postulated in the deﬁnition of a deﬁciency matrix is Q itself. 
Problem 573. Show that S ⊥ U if and only if S is a matrix with maximal rank
which satisﬁes SU = O. In other words, one cannot add linearly independent rows
to S in such a way that the new matrix still satisﬁes T U = O.
Answer. First assume S ⊥ U and take any additional row t

so that

S
t



U =

O
o


. Then
exists a

Q
r

such that

S
t


=

Q
r

S, i.e., SQ = S, and t

= r

S. But this last equation means
that t


is a linear combination of the rows of S with the r
i
as coeﬃcients. Now conversely, assume
S is such that one cannot add a linearly independent row t

such that

S
t


U =

O
o


, and let
P U = O. Then all rows of P must be linear combinations of rows of S (otherwise one could add
A.4. DEFICIENCY MATRICES 1509
such a row to S and get the result which was just ruled out), therefore P = SS where A is the
matrix of coeﬃcients of these linear combinations. 
The deﬁciency matrix is not unique, but we will use the concept of a deﬁciency
matrix in a formula only then when this formula remains correct for every deﬁciency
matrix. One can make deﬁciency matrices unique if one requires them to be projec-
tion matrices.
Problem 574. Given X and a symmetric nonnegative deﬁnite Ω
Ω
Ω such that X =
Ω

Ω
ΩW for some W . Show that X ⊥ U if and only if X

Ω
Ω
Ω
−
X ⊥ U .
Answer. One has to show that XY = O is equiva lent to X

Ω
Ω
Ω
−
XY = O. ⇒ clear; for
⇐ note that X

Ω
Ω
Ω
−
X = W

Ω
Ω
ΩW , therefore XY = Ω
Ω
ΩW Y = Ω
Ω
ΩW (W


Ω
Ω
ΩW )
−
W

Ω
Ω
ΩW Y =
Ω
Ω
ΩW (W

Ω
Ω
ΩW )
−
X

Ω
Ω
Ω
−
XY = O.

A matrix is said to have full column rank if all its columns are linearly indepen-
dent, and full row rank if its rows are linearly independent. The deﬁciency matrix
provides a “holistic” deﬁnition for which it is not necess ary to look at single rows
and c olumns. X has full column rank if and only if X ⊥ O, and full row rank if and

only if O ⊥ X.
Problem 575. Show that the following three statements are equivalent: (1) X
has ful l column rank, (2) X

X is nonsingular, and (3) X has a left inverse.
1510 A. MATRIX FORMULAS
Answer. Here use X ⊥ O as the deﬁnition of “full colum n rank.” Then (1) ⇔ (2) is theorem
A.4.1. Now (1) ⇒ (3): Since IO = O, a P exists with I = P X. And (3) ⇒ (1): if a P exists with
I = P X, then any Q with QO = O can be factored over X, simply say Q = QP X. 
Note that the usual solution of linear matrix equations with g-inverses involves
a deﬁciency matrix:
Theorem A.4.3. The solution of the consistent matrix equation T X = A is
(A.4.1) X = T
−
A + U W
where T ⊥ U and W is arbitrary.
Proof. Given consistency, i.e., the existence of at least one Z with T Z = A,
(A.4.1) deﬁnes indeed a solution, since T X = T T
−
T Z. Conversely, if Y satisﬁes
T Y = A, then T (Y −T
−
A) = O, therefore Y −T
−
A = U W for some W . 
Theorem A.4.4. Let L ⊥ T ⊥ U and J ⊥ HU ⊥ R; then

L O
−JHT
−

J

⊥

T
H

⊥ U R.
Proof. First deﬁciency relation: Since I−T T
−
= U W for some W , −JHT
−
T +
JH = O, therefore the matrix product is zero. Now assume

A B


T
H

= O.
A.4. DEFICIENCY MATRICES 1511
Then BHU = O, i.e., B = DJ for some D. Then AT = −DJH, which
has as general solution A = −DJHT
−
+ CL for some C. This together gives

A B


=

C D


L O
−JHT
−
J

. Now the second deﬁciency relation: clearly,
the product of the matrices is zero. If M satisﬁes T M = O, then M = U N
for some N . If M furthermore satisﬁes HM = O, then HU N = O, therefore
N = RP for some P , therefore M = U RP . 
Theorem A.4.5. Assume Ω
Ω
Ω is nonnegative deﬁnite symmetric and K is such
that KΩ
Ω
Ω is deﬁned. Then the matrix
(A.4.2) Ξ = Ω
Ω
Ω −Ω
Ω
ΩK

(K Ω
Ω
ΩK


)
−
K Ω
Ω
Ω
has the following properties:
(1) Ξ does not depend on the choice of g-inverse of KΩ
Ω
ΩK

used in (A.4.2).
(2) Any g-inverse of Ω
Ω
Ω is also a g-inverse of Ξ, i.e. ΞΩ
Ω
Ω
−
Ξ = Ξ.
(3) Ξ is nonnegative deﬁnite and symmetric.
(4) For every P ⊥ Ω
Ω
Ω fol lows

K
P

⊥ Ξ

Class Notes in Statistics and Econometrics Part 35 pot

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về