Tải bản đầy đủ (.pdf) (132 trang)

Class Notes in Statistics and Econometrics Part 35 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (739.54 KB, 132 trang )

CHAPTER 69
Binary Choice Models
69.1. Fisher’s Scoring and Iteratively Reweighted Least Squares
This section draws on chapter 55 about Numerical Minimization. Another im-
portant “natural” choice for the positive definite matrix R
i
in the gradient method
is available if one maximizes a likelihood function: then R
i
can be the inverse of
the information matrix for the parameter values β
i
. This is called Fisher’s Scoring
method. It is closely related to the Newton-Raphson method. T he Newton-Raphson
method uses the Hessian matrix, and the information matrix is minus the expected
value of the Hessian. Apparently Fisher first used the information matrix as a com-
putational simplification in the Newton-Raphson method. Today IRLS is used in
the GLIM program for generalized linear models.
1487
1488 69. BINARY CHOICE MODELS
As in chapter 56 discussing nonlinear least squares, β is the vector of param-
eters of interest, and we will work with an intermediate vector η(β) of predictors
whose dimension is comparable to that of the observations. Therefore the likelihood
function has the form L = L

y, η(β )

. By the chain rule (C.1.23) one can write the
Jacobian of the likelihood function as
∂L
∂β



(β) = u

X, where u

=
∂L
∂η

(η(β )) is
the Jacobian of L as a function of η, evaluated at η(β), and X =
∂η
∂β

(β) is the
Jacobian of η. This is the same notation as in the discussion of the Gauss-Newton
regression.
Define A =
E
[uu

]. Since X does not depend on the random variables, the
information matrix of y with respect to β is then
E
[X

uu

X] = X


AX. If one
uses the inverse of this information matrix as the R-matrix in the gradient algorithm,
one gets
(69.1.1) β
i+1
= β
i
+ α
i

X

AX

−1
X

u
The Iterated Reweighted Least Squares interpretation of this comes from rewrit-
ing (69.1.1) as
(69.1.2) β
i+1
= β
i
+

X

AX


−1
X

AA
−1
u,
i.e., one obtains the step by regressing A
−1
u on X with weighting matrix A.
69.2. BINARY DEPENDENT VARIABLE 1489
Justifications of IRLS are: the information matrix is usually analytically simpler
than the Hessian of the likelihood function, therefore it is a convenient approximation,
and one needs the information matrix anyway at the end for the covariance matrix
of the M.L. estimators.
69.2. Binary Dependent Variable
Assume each individual in the sample makes an independent random choice
between two alternatives, which can conveniently be coded as y
i
= 0 or 1. The
probability distribution of y
i
is fully determined by the probability π
i
= Pr[y
i
= 1]
of the event which has y
i
as its indicator function. Then E[y
i

] = π
i
and var[y
i
] =
E[y
2
i
] −

E[y
i
]

2
= E[y
i
] −

E[y
i
]

2
= π
i
(1 −π
i
).
It is usually assumed that the individual choices are stochastically independent

of each other, i.e., the distribution of the data is fully characterized by the π
i
. Each
π
i
is assumed to depend on a vector of explanatory variables x
i
. There are different
approaches to modelling this dependence.
The regression model y
i
= x

i
β +ε
i
with E[ε
i
] = 0 is inappropriate because x

i
β
can take any value, whereas 0 ≤ E[y
i
] ≤ 1. Nevertheless, people have been tinkering
with it. The obvious first tinker is based on the observation that the ε
i
are no
longer homoskedastic, but their variance, which is a function of π
i

, can be estimated,
therefore one can correct for this heteroskedasticity. But things get complicated very
1490 69. BINARY CHOICE MODELS
quickly and then the main appe al of OLS, its simplicity, is lost. This is a wrong-
headed approach, and any smart ideas which one may get when going down this road
are simply wasted.
The right way to do this is to set π
i
= E[y
i
] = Pr[y
i
= 1] = h(x

i
β) where h is
some (necessarily nonlinear) function with values between 0 and 1.
69.2.1. Logit Specification (Logistic Regression). The logit or logistic
specification is π
i
= e
x

i
β
/(1 + e
x

i
β

). Invert to get log(π
i
/(1 − π
i
)) = x

i
β. I.e.,
the logarithm of the odds depends linearly on the predictors. The log odds are a
natural re-scaling of probabilities to a scale which goes from −∞ to +∞, and which
is symmetric in that the log odds of the complement of an event is just the negative
of the log odds of the event itself. (See my remarks about the odds ratio in Question
222.)
Problem 560. 1 point If y = log
p
1−p
(logit function), show that p =
exp y
1+exp y
(logistic function).
Answer. exp y =
p
1−p
, now multiply by 1 − p to get exp y − p expy = p, collect terms
exp y = p(1 + exp y), now divide by 1 + exp y. 
69.2. BINARY DEPENDENT VARIABLE 1491
Problem 561. Sometimes one finds the following alternative specification of the
logit model: π
i
= 1/(1+e

x

i
β
). What is the difference between it and our formulation
of the logit model? Are these two formulations equivalent?
Answer. It is simply a different parametrization. They get this because they come from index
number problem. 
The logit function is also the canonical link function for the binomial distribution,
see Problem 113.
69.2.2. Probit Model. An important class of functions with values between 0
and 1 is the class of cumulative probability distribution functions. If h is a cumulative
distribution function, then one can give this specification an interesting interpretation
in terms of an unobserved “index variable.”
The index variable model specifies: there is a variable z
i
with the property that
y
i
= 1 if and only if z
i
> 0. For instance, the decision y
i
whether or not individual
i moves to a different location can be modeled by the calculation whether the net
benefit of moving, i.e., the wage differential minus the cost of relocation and finding
a new job, is positive or not. This moving example is worked out, with references,
in [Gre93, pp. 642/3].
The value of the variable z
i

is not observed, one only observes y
i
, i.e., the only
thing one knows about the value of z
i
is whether it is positive or not. But it is assumed
1492 69. BINARY CHOICE MODELS
that z
i
is the sum of a deterministic part which is specific to the individual and a
random part which has the same distribution for all individuals and is stochastically
independent between different individuals. The deterministic part sp e cific to the
individual is assumed to depend linearly on individual i’s values of the covariates ,
with coefficients which are common to all individuals. In other words, z
i
= x

i
β +ε
i
,
where the ε
i
are i.i.d. with cumulative distribution function F
ε
. Then it follows π
i
=
Pr[y
i

= 1] = Pr[z
i
> 0] = Pr[ε
i
> −x

i
β] = 1 − Pr[ε
i
≤ −x

i
β] = 1 − F
ε
(−x

i
β).
I.e., in this case, h(η) = 1 −F
ε
(−η). If the distribution of ε
i
is symmetric and has a
density, then one gets the simpler formula h(η) = F
ε
(η).
Which cumulative distribution function should be chosen?
• In practice, the probit model, in which z
i
is normal, is the only one used.

• The linear model, in which h is the line segment from (a, 0) to (b, 1), can also
be considered generated by an in index function z
i
which is here uniformly
distributed.
• An alternative possible specification with the Cauchy distribution is pro-
posed in [DM93, p. 516]. They say that curiously only logit and probit are
being used.
In practice, the probit model is very similar to the logit model, once one has rescaled
the variables to make the variances equal, but the logit model is easier to handle
mathematically.
69.2. BINARY DEPENDENT VARIABLE 1493
69.2.3. Replicated Data. Before discussing estimation methods I want to
briefly address the iss ue whether or not to write the data in replicated form [MN89,
p. 99–101]. If there are several observations for every individual, or if there are several
individuals for the same values of the covariates (which can happen if all covariates
are categorical), then one can write the data more compactly if one groups the data
into so-called “covariate classes,” i.e., groups of observations which share the same
values of x
i
, and defines y
i
to be the number of times the decision came out positive
in this group. Then one needs a second variable, m
i
, which is assumed nonrandom,
indicating how many individual decisions are combined in the respective group. This
is an equivalent formulation of the data, the only thing one loses is the order in which
the observations were made (which may be relevant if there are training or warm-up
effects). The original representation of the data is a special case of the grouped form:

in the non-grouped form, all m
i
= 1. We will from now on write our formulas for
the grouped form.
69.2.4. Estimation. Maximum likeliho od is the preferred estimation method.
The likeliho od function has the form L =

π
y
i
i
(1 − π
i
)
(m
i
−y
i
)
. This likelihood
function is not derived from a density, but from a probability mass function. For
instance, in the case w ith non-replicated data, all m
i
= 1, if you have n binary
measurements, then you can have only 2
n
different outcomes, and the probability of
the sequence y
1
, . . . y

n
= 0, 1, 0, 0, . . . , 1 is as given above.
1494 69. BINARY CHOICE MODELS
This is a highly nonlinear maximization and must be done numerically. Let us
go through the method of scoring in the example of a logit distribution.
L =

i

y
i
log π
i
+ (m
i
− y
i
) log(1 − π
i
)

(69.2.1)
∂L
∂π
i
=

y
i
π

i

m
i
− y
i
1 −π
i

(69.2.2)

2
L
∂π
2
i
= −

y
i
π
2
i
+
m
i
− y
i
(1 −π
i

)
2

(69.2.3)
Defining η = Xβ, the logit specification can be written as π
i
= e
η
i
/(1 + e
η
i
).
Differentiation gives
∂π
i
∂η
i
= π
i
(1 −π
i
). Combine this with (69.2.2) to get
(69.2.4) u
i
=
∂L
∂η
i
=


y
i
π
i

m
i
− y
i
1 −π
i

π
i
(1 −π
i
) = y
i
− m
i
π
i
.
These are the elements of u in (69.1.1), and they have a very simple meaning: it is
just the observations minus their expected values. Therefore one obtains immediately
A =
E
[uu


] is a diagonal matrix with m
i
π
i
(1 −π
i
) in the diagonal.
Problem 562. 6 points Show that for the maximization of the likelihood func-
tion of the logit model, Fisher’s scoring method is equivalent to the Newton-Raphson
algorithm.
69.3. THE GENERALIZED LINEAR MODEL 1495
Problem 563. Show that in the logistic model,

m
i
ˆπ
i
=

y
i
.
69.3. The Generalized Linear Model
The binary choice models show how the linear model can be generalized. [MN89,
p. 27–32] develop a unified theory of many different interesting models, called the
“generalized linear model.” The following few paragraphs are indebted to the elabo-
rate and useful web site about Generalized Linear Models maintained by Gordon K.
Smyth at www.maths.uq.oz.au/~gks/research/glm
In which cases is it necessary to go beyond linear models? The most important
and common situation is one in which y

i
and µ
i
= E[y
i
] are bounded:
• If y represents the amount of some physical substance then we may have
y ≥ 0 and µ ≥ 0.
• If y is binary, i.e., y = 1 if an animal survives and y = 0 if it does not, then
0 ≤ µ ≤ 1.
The linear model is inadequate here because complicated and unnatural constraints
on β would be required to make sure that µ stays in the feasible range. Generalized
linear models instead assume a link linear relationship
(69.3.1) g(µ) = Xβ
1496 69. BINARY CHOICE MODELS
where g() is some known monotonic function which acts pointwise on µ. Typically
g() is used to transform the µ
i
to a scale on which they are unconstrained. For
example we might use g(µ) = log(µ) if µ
i
> 0 or g(µ) = log

µ/(1 −µ)

if 0 < µ
i
< 1.
The same reasons which force us to abandon the linear model also force us to
abandon the as sumption of normality. If y is bounded then the variance of y must

depend on its mean. Specifically if µ is close to a boundary for y then var(y) must be
small. For example, if y > 0, then we must have var(y) → 0 as µ → 0. For this reason
strictly positive data almost always shows increasing variability with increased size.
If 0 < y < 1, then var(y) → 0 as µ → 0 or µ → 1. For this reason, generalized linear
models assume that
(69.3.2) var(y
i
) = φ ·V (µ
i
)
where φ is an unknown scale factor and V () is som e known variance function appro-
priate for the data at hand.
We therefore estimate the nonlinear regression equation (69.3.1) weighting the
observations inversely according to the variance functions V (µ
i
). This weighting
procedure turns out to be exactly equivalent to maximum likelihood estimation when
the observations actually come from an exponential family distribution.
Problem 564. Describe estimation situations in which a linear model and Nor-
mal distribution are not appropriate.
69.3. THE GENERALIZED LINEAR MODEL 1497
The generalized linear model has the following components:
• Random com ponent: Instead of being normally distributed, the compo-
nents of y have a distribution in the exponential family.
• . Introduce a new symbol η = Xβ.
• A monotonic univariate link function g so that η
i
= g(µ
i
) where µ =

E
[y].
The generalized linear model allows for a nonlinear link function g specifying
that transformation of the expected value of the response variable which depends
linearly on the predictors:
(69.3.3) g(E[y
i
]) = x

i
β,
Its random specification is such that var[y] depends on E[y] through a variance
function φ ·V (where φ is a constant taking the place of σ
2
in the regression model:)
(69.3.4) var[y] = φ ·V (E[y])
We have seen earlier that these mean- and variance functions are not an artificial
construct, but that the distributions from the “exponential dispersion family,” see
Section 6.2, naturally give rise to such mean and variance functions. But just as
much of the theory of the linear model can be derived without the assumption that
the residuals are normally distributed, many of the results about generalized linear
1498 69. BINARY CHOICE MODELS
models do not require us to specify the whole distribution but can be derived on the
basis of the mean and variance functions alone.
CHAPTER 70
Multiple Choice Models
Discrete choice between three or more alternatives; came from choice of trans-
portation.
The outcomes of these choices should no longer be represented by a vector y, but
one needs a matrix Y with y

ij
= 1 if the ith individual chooses the jth alternative,
and 0 otherwise. Consider only three alternatives j = 1, 2, 3, and define Pr(y
ij
=
1) = π
ij
.
Conditional Logit model is a model which makes all π
ij
dependent on x
i
. It is
very simple extension of binary choice. In binary choice we had log
π
i
1−π
i
= x

i
β, log
of odds ratio. Here this is generalized to log
π
i2
π
i1
= x

i

β
2
, and log
π
i3
π
i1
= x

i
β
3
. From
1499
1500 70. MULTIPLE CHOICE MODELS
this we obtain
(70.0.5) π
i1
= 1 −π
i2
− π
i3
= 1 −π
i1
e
x

i
β
2

− π
i1
e
x

i
β
3
,
or
π
i1
=
1
1 + e
x

i
β
2
+ e
x

i
β
3
,(70.0.6)
π
i2
=

e
x

i
β
2
1 + e
x

i
β
2
+ e
x

i
β
3
,(70.0.7)
π
i3
=
e
x

i
β
3
1 + e
x


i
β
2
+ e
x

i
β
3
.(70.0.8)
One can write this as π
ij
=
e
α
j

j
X
i

e
α
k

k
X
i
if one defines α

1
= β
1
= 0. The only
estimation method used is MLE.
(70.0.9) L =

π
y
i1
i1
π
y
i2
i2
Π
y
i3
i3
=

(e
x

i
β
2
)
y
i2

(e
x

i
β
3
)
y
i3
1 + e
x

i
β
2
+ e
x

i
β
3
.
Note: the odds are independent of all other alternatives. Therefore the alterna-
tives must be chosen such that this independence is a good assumption. The choice
between walking, car, red buses, and blue buses does not satisfy this. See [Cra91,
p. 47] for the best explanation of this which I found till now.
APPENDIX A
Matrix Formulas
In this Appendix, efforts are made to give s ome of the familiar matrix lemmas in
their most general form. The reader should be warned: the concept of a deficiency

matrix and the notation which uses a thick fraction line multiplication with a scalar
g-inverse are my own.
A.1. A Fundamental Matrix Decomposition
Theorem A.1.1. Every matrix B which is not the null matrix can be written
as a product of two matrices B = CD, where C has a left inverse L and D a right
inverse R, i.e., LC = DR = I. This identity matrix is r × r, where r is the rank
of B.
1501
1502 A. MATRIX FORMULAS
A proof is in [Rao73, p. 19]. T his is the fundamental theorem of algebra, that
every homomorphism can be written as a product of epimorphism and monomor-
phism, together with the fact that all epimorphisms and monomorphisms split, i.e.,
have one-sided inverses.
One such factorization is given by the singular value theorem: If B = P

ΛQ
is the svd as in Theorem A.9.2, then one might set e.g. C = P

Λ and D = Q,
consequently L = Λ
−1
P and R = Q

. In this decomposition, the first row/column
carries the largest weight and gives the best approximation in a least squares sense,
etc.
The trace of a square matrix is defined as the sum of its diagonal elements. The
rank of a matrix is defined as the number of its linearly independent rows, which is
equal to the number of its linearly independent columns (row rank = column rank).
Theorem A.1.2. tr BC = tr CB.

Problem 565. Prove theorem A.1.2.
Problem 566. Use theorem A.1.1 to prove that if BB = B, then rank B =
tr B.
Answer. Premultiply the equation CD = CDCD by L and postmultiply it by R to get
D C = I
r
. This is useful for the trace: tr B = tr CD = tr DC = tr I
r
= r. I have this proof from
[Rao73, p. 28]. 
A.2. THE SPECTRAL NORM OF A MATRIX 1503
Theorem A.1.3. B = O if and only if B

B = O.
A.2. The Spectral Norm of a Matrix
The spectral norm of a matrix extends the Euclidean norm z from vectors
to matrices. Its definition is A = max
z=1
Az. This spectral norm is the
maximum singular value µ
max
, and if A is square, then


A
−1


= 1/µ
min

. It is a
true norm, i.e., A = 0 if and only if A = O, furthermore λA = |λ|·A, and the
triangle inequality A + B ≤ A+B. In addition, it obeys AB ≤ A·B.
Problem 567. Show that the spectral norm is the maximum singular value.
Answer. Use the definition
(A.2.1) A
2
= max
z

A

Az
z

z
.
Write A = P

ΛQ as in (A.9.1), Then z

A

Az = z

Q

Λ
2
Qz. Therefore we can first show:

there is a z in the form z = Q

x which attains this maximum. Proof: for every z which has a
nonzero value in the numerator of (A.2.1), set x = Qz. Then x = o, and Q

x attains the same
value as z in the numerator of (A.2.1), and a smaller or equal value in the denominator. Therefore
one can restrict the search for the maximum argument to vectors of the form Q

x. But for them
the objective function becomes
x

Λ
2
x
x

x
, which is maximized by x = i
1
, the first unit vector (or
column vector of the unit matrix). Therefore the squared spectral norm is λ
2
ii
, and therefore the
spectral norm itself is λ
ii
. 
1504 A. MATRIX FORMULAS

A.3. Inverses and g-Inverses of Matrices
A g-inverse of a matrix A is any matrix A

satisfying
(A.3.1) A = AA

A.
It always exists but is not always unique. If A is square and nonsingular, then A
−1
is its only g-inverse.
Problem 568. Show that a symmetric matrix Ω

Ω has a g-inverse which is also
symmetric.
Answer. Use Ω





ΩΩ




. 
The definition of a g-inverse is apparently due to [Rao62]. It is sometimes called
the “conditional inverse” [Gra83, p. 129]. This g-inverse, and not the Moore-Penrose
generalized inverse or pseudoinverse A
+

, is needed for the linear model, The Moore-
Penrose generalized inverse is a g-inverse that in addition satisfies A
+
AA
+
= A
+
,
and AA
+
as well as A
+
A symmetric. It always exists and is also unique, but the
additional requirements are burdensome ballast. [Gre97, pp. 44-5] also advocates
the Moore-Penrose inverse, but he does not really use it. If he were to try to use it,
he would probably soon discover that it is not appropriate. The bo ok [Alb72] does
the linear model with the Moore-Penrose inverse. It is a good demonstration of how
complicated everything gets if one uses an inappropriate mathematical tool.
A.3. INVERSES AND G-INVERSES OF MATRICES 1505
Problem 569. Use theorem A.1.1 to prove that every matrix has a g-inverse.
Answer. Simple: a null matrix has its transpos e as g-inverse, and if A = O then RL is such
a g-inverse. 
The g-inverse of a number is its inverse if the number is nonzero, and is arbitrary
otherwise. Scalar expressions written as fractions are in many cases the multiplication
by a g-inverse. We will use a fraction with a thick horizontal rule to indicate where
this is the case. In other words, by definition,
(A.3.2)
a
b
= b


a. Compare that with the ordinary fraction
a
b
.
This idiosyncratic notation allows to write certain theorems in a more concise form,
but it requires more work in the proofs, because one has to consider the additional
case that the denominator is zero. Theorems A.5.8 and A.8.2 are examples.
Theorem A.3.1. If B = AA

B holds for one g-inverse A

of A, then it holds
for all g-inverses. If A is symmetric and B = AA

B, then also B

= B

A

A.
If B = BA

A and C = AA

C then BA

C is independent of the choice of g-
inverses.

Proof. Assume the identity B = AA
+
B holds for some fixed g-inverse A
+
(which may be, as the notation suggests, the Moore Penrose g-inverse, but this is
1506 A. MATRIX FORMULAS
not necessary), and let A

be an different g-inverse. Then AA

B = AA

AA
+
B =
AA
+
B = B. For the second statement one merely has to take transposes and note
that a matrix is a g-inverse of a symmetric A if and only if its transpose is. For the
third statement: BA
+
C = BA

AA
+
AA

C = BA

AA


C = BA

C. Here
+
signifies a different g-inverse; again, it is not necessarily the Moore-Penrose one. 
Problem 570. Show that x satisfies x = Ba for some a if and only if x =
BB

x.
Theorem A.3.2. Both A

(AA

)

and (A

A)

A are g-inverses of A.
Proof. We have to show
(A.3.3) A = AA

(AA

)

A
which is [Rao73, (1b.5.5) on p. 26]. Define D = A −AA


(AA

)

A and show, by
multiplying out, that DD

= O. 
A.4. Deficiency Matrices
Here is again some idiosyncratic terminology and notation. It gives an explicit
algebraic formulation for something that is often done implicitly or in a geometric
paradigm. A matrix G will be called a “left deficiency matrix” of S, in symbols,
G ⊥ S, if GS = O, and for all Q with QS = O there is an X with Q = XG. This
A.4. DEFICIENCY MATRICES 1507
factorization property is an algebraic formulation of the geometric concept of a null
space. It is symmetric in the sense that G ⊥ S is also equivalent with: GS = O,
and for all R with GR = O there is a Y with R = SY . In other words, G ⊥ S and
S

⊥ G

are equivalent.
This symmetry follows from the following characterization of a deficiency matrix
which is symmetric:
Theorem A.4.1. T ⊥ U iff T U = O and T

T + UU

nonsingular.

Proof. This proof here seems terribly complicated. There must be a simpler
way. Proof of “⇒”: Assume T ⊥ U. Take any γ with γ

T

T γ + γ

UU

γ =
0, i.e., T γ = o and γ

U = o

. From this one can show that γ = o: since
T γ = o, there is a ξ with γ = Uξ, therefore γ

γ = γ

Uξ = 0. To prove
“⇐” assume T U = O and T

T + UU

is nonsingular. To show that T ⊥ U
take any B with BU = O. Then B = B(T

T + UU

)(T


T + UU

)
−1
=
BT

T (T

T + UU

)
−1
. In the same way one gets T = T T

T (T

T + UU

)
−1
.
Premultiply this last equation by T

T (T

T T

T )


T

and use theorem A.3.2 to get
T

T (T

T T

T )

T

T = T

T (T

T + U U

)
−1
. Inserting this into the equation
for B gives B = BT

T (T

T T

T )


T

T , i.e., B factors over T . 
The R/Splus-function Null gives the transpose of a deficiency m atrix.
1508 A. MATRIX FORMULAS
Theorem A.4.2. If for all Y , BY = O implies AY = O, then a X exists with
A = XB.
Problem 571. Prove theorem A.4.2.
Answer. Let B ⊥ C. Choosing Y = B follows AB = O, hence X exists. 
Problem 572. Show that I −SS

⊥ S.
Answer. Clearly, (I − SS

)S = O. Now if QS = O, then Q = Q(I − SS

), i.e., the X
whose existence is postulated in the definition of a deficiency matrix is Q itself. 
Problem 573. Show that S ⊥ U if and only if S is a matrix with maximal rank
which satisfies SU = O. In other words, one cannot add linearly independent rows
to S in such a way that the new matrix still satisfies T U = O.
Answer. First assume S ⊥ U and take any additional row t

so that

S
t



U =

O
o


. Then
exists a

Q
r

such that

S
t


=

Q
r

S, i.e., SQ = S, and t

= r

S. But this last equation means
that t


is a linear combination of the rows of S with the r
i
as coefficients. Now conversely, assume
S is such that one cannot add a linearly independent row t

such that

S
t


U =

O
o


, and let
P U = O. Then all rows of P must be linear combinations of rows of S (otherwise one could add
A.4. DEFICIENCY MATRICES 1509
such a row to S and get the result which was just ruled out), therefore P = SS where A is the
matrix of coefficients of these linear combinations. 
The deficiency matrix is not unique, but we will use the concept of a deficiency
matrix in a formula only then when this formula remains correct for every deficiency
matrix. One can make deficiency matrices unique if one requires them to be projec-
tion matrices.
Problem 574. Given X and a symmetric nonnegative definite Ω

Ω such that X =



ΩW for some W . Show that X ⊥ U if and only if X





X ⊥ U .
Answer. One has to show that XY = O is equiva lent to X





XY = O. ⇒ clear; for
⇐ note that X





X = W



ΩW , therefore XY = Ω

ΩW Y = Ω

ΩW (W




ΩW )

W



ΩW Y =


ΩW (W



ΩW )

X





XY = O.

A matrix is said to have full column rank if all its columns are linearly indepen-
dent, and full row rank if its rows are linearly independent. The deficiency matrix
provides a “holistic” definition for which it is not necess ary to look at single rows
and c olumns. X has full column rank if and only if X ⊥ O, and full row rank if and

only if O ⊥ X.
Problem 575. Show that the following three statements are equivalent: (1) X
has ful l column rank, (2) X

X is nonsingular, and (3) X has a left inverse.
1510 A. MATRIX FORMULAS
Answer. Here use X ⊥ O as the definition of “full colum n rank.” Then (1) ⇔ (2) is theorem
A.4.1. Now (1) ⇒ (3): Since IO = O, a P exists with I = P X. And (3) ⇒ (1): if a P exists with
I = P X, then any Q with QO = O can be factored over X, simply say Q = QP X. 
Note that the usual solution of linear matrix equations with g-inverses involves
a deficiency matrix:
Theorem A.4.3. The solution of the consistent matrix equation T X = A is
(A.4.1) X = T

A + U W
where T ⊥ U and W is arbitrary.
Proof. Given consistency, i.e., the existence of at least one Z with T Z = A,
(A.4.1) defines indeed a solution, since T X = T T

T Z. Conversely, if Y satisfies
T Y = A, then T (Y −T

A) = O, therefore Y −T

A = U W for some W . 
Theorem A.4.4. Let L ⊥ T ⊥ U and J ⊥ HU ⊥ R; then

L O
−JHT


J



T
H

⊥ U R.
Proof. First deficiency relation: Since I−T T

= U W for some W , −JHT

T +
JH = O, therefore the matrix product is zero. Now assume

A B


T
H

= O.
A.4. DEFICIENCY MATRICES 1511
Then BHU = O, i.e., B = DJ for some D. Then AT = −DJH, which
has as general solution A = −DJHT

+ CL for some C. This together gives

A B


=

C D


L O
−JHT

J

. Now the second deficiency relation: clearly,
the product of the matrices is zero. If M satisfies T M = O, then M = U N
for some N . If M furthermore satisfies HM = O, then HU N = O, therefore
N = RP for some P , therefore M = U RP . 
Theorem A.4.5. Assume Ω

Ω is nonnegative definite symmetric and K is such
that KΩ

Ω is defined. Then the matrix
(A.4.2) Ξ = Ω

Ω −Ω

ΩK

(K Ω

ΩK


)

K Ω


has the following properties:
(1) Ξ does not depend on the choice of g-inverse of KΩ

ΩK

used in (A.4.2).
(2) Any g-inverse of Ω

Ω is also a g-inverse of Ξ, i.e. ΞΩ



Ξ = Ξ.
(3) Ξ is nonnegative definite and symmetric.
(4) For every P ⊥ Ω

Ω fol lows

K
P

⊥ Ξ

×