Tải bản đầy đủ (.pdf) (21 trang)

Class Notes in Statistics and Econometrics Part 10 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (362.92 KB, 21 trang )

CHAPTER 19
Digression about Correlation Coefficients
19.1. A Unified Definition of Correlation Coefficients
Correlation coefficients measure linear association. The usual definition of the
simple correlation coefficient between two variables ρ
xy
(sometimes we also use the
notation corr[x, y]) is their standardized covariance
(19.1.1) ρ
xy
=
cov[x, y]

var[x]

var[y]
.
Because of Cauchy-Schwartz, its value lies between −1 and 1.
513
514 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Problem 254. Given the constant scalars a = 0 and c = 0 and b and d arbitrary.
Show that corr[x, y] = ±corr[ax + b, cy + d], with the + sign being valid if a and c
have the same sign, and the − sign otherwise.
Answer. Start with cov[ax + b, cy + d] = ac cov[x, y] and go from there. 
Besides the simple correlation coefficient ρ
xy
between two scalar variables y and
x, one can also define the squared multiple correlation coefficient ρ
2
y(x)
between one


scalar variable y and a whole vector of variables x, and the partial correlation coef-
ficient ρ
12.x
between two scalar variables y
1
and y
2
, with a vector of other variables
x “partialled out.” The multiple c orrelation coefficient measures the strength of
a linear asso c iation between y and all components of x together, and the partial
correlation coefficient measures the strength of that part of the linear association
between y
1
and y
2
which cannot be attributed to their joint association with x. One
can also define partial multiple correlation coefficients. If one wants to measure the
linear association between two vectors, then one number is no longer enough, but
one needs several numbers, the “canonical correlations.”
The multiple or partial correlation coefficients are usually defined as simple cor-
relation coefficients involving the best linear predictor or its residual. But all these
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 515
correlation coefficients share the property that they indicate a proportionate reduc-
tion in the MSE. See e.g. [Rao73, pp. 268–70]. Problem 255 makes this point for
the simple correlation c oefficient:
Problem 255. 4 points Show that the proportionate reduction in the MSE of
the best predictor of y, if one goes from predictors of the form y

= a to predictors
of the form y


= a + bx, is equal to the squared correlation coefficient between y and
x. You are allowed to use the results of Problems 229 and 240. To set notation, call
the minimum MSE in the first prediction (Problem 229) MSE[constant term; y], and
the minimum MSE in the second prediction (Problem 240) MSE[constant term and
x; y]. Show that
(19.1.2)
MSE[constant term; y] −MSE[constant term and x; y]
MSE[constant term; y]
=
(cov[y, x])
2
var[y] var[x]
= ρ
2
yx
.
Answer. The minimum MSE with only a constant is var[y] and (18.2.32) says that MSE[constant
term and x; y] = var[y]−(cov[x, y])
2
/ var[x]. Therefore the d ifferenc e in MSE’s is (cov[x, y])
2
/ var[x],
and if on e divides by var[y] to get the relative difference, one gets exactly the squared correlation
coefficient. 
516 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Multiple Correlation Coefficients. Now assume x is a vector while y remains a
scalar. Their joint mean vector and dispersion matrix are
(19.1.3)


x
y



µ
ν

, σ
2




xx
ω
ω
ω
xy
ω
ω
ω

xy
ω
yy

.
By theorem ??, the best linear predictor of y based on x has the formula
(19.1.4) y


= ν + ω
ω
ω

xy




xx
(x − µ)
y

has the following additional extremal value property: no linear combination b

x
has a higher squared correlation with y than y

. This maximal value of the squared
correlation is called the squared multiple correlation coefficient
(19.1.5) ρ
2
y(x)
=
ω
ω
ω

xy





xx
ω
ω
ω
xy
ω
yy
The multiple correlation coefficient itself is the positive square root, i.e., it is always
nonnegative, while some other correlation coefficients may take on negative values.
The squared multiple correlation coefficient can also defined in terms of propor-
tionate reduction in MSE. It is equal to the proportionate reduction in the MSE of
the best predictor of y if one goes from predictors of the form y

= a to predictors
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS 517
of the form y

= a + b

x, i.e.,
(19.1.6) ρ
2
y(x)
=
MSE[constant term; y] −MSE[constant term and x; y]
MSE[constant term; y]

There are therefore two natural definitions of the multiple correlation coefficient.
These two definitions correspond to the two formulas for R
2
in (18.3.6).
Partial Correlation Coefficients. Now assume y =

y
1
y
2


is a vector with
two elements and write
(19.1.7)


x
y
1
y
2





µ
ν
1

ν
2


, σ
2





xx
ω
ω
ω
y1
ω
ω
ω
y2
ω
ω
ω

y1
ω
11
ω
12
ω

ω
ω

y2
ω
21
ω
22


.
Let y

be the best linear predictor of y based on x. The partial correlation coefficient
ρ
12.x
is defined to b e the s imple correlation between the residuals corr[(y
1
−y

1
), (y
2

y

2
)]. This measures the correlation between y
1
and y

2
which is “local,” i.e., which
does not follow from their association with x. Assume for instance that both y
1
and
y
2
are highly correlated with x. Then they will also have a high correlation with
each other. Subtracting y

i
from y
i
eliminates this dependency on x, therefore any
remaining correlation is “local.” Compare [Krz88, p. 475].
518 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
The partial correlation coefficient can be defined as the relative reduction in the
MSE if one adds y
2
to x as a predictor of y
1
:
(19.1.8)
ρ
2
12.x
=
MSE[constant term and x; y
2
] − MSE[constant term, x, and y

1
; y
2
]
MSE[constant term and x; y
2
]
.
Problem 256. Using the definitions in terms of MSE’s, show that the following
relationship holds between the squares of multiple and partial correlation coefficients:
(19.1.9) 1 − ρ
2
2(x,1)
= (1 −ρ
2
21.x
)(1 − ρ
2
2(x)
)
Answer. In terms of the MSE, (19.1.9) reads
(19.1.10)
MSE[constant term, x, and y
1
; y
2
]
MSE[constant term; y
2
]

=
MSE[constant term, x, and y
1
; y
2
]
MSE[constant term and x; y
2
]
MSE[constant term and x; y
2
]
MSE[constant term; y
2
]
.

From (19.1.9) follows the following weighted average formula:
(19.1.11) ρ
2
2(x,1)
= ρ
2
2(x)
+ (1 −ρ
2
2(x)

2
21.x

An alternative proof of (19.1.11) is given in [Gra76, pp. 116/17].
19.2. COR RELATION COEFFICIENTS AND THE ASSOCIATED LEAST SQUARES PROBLEM519
Mixed cases: One can also form multiple correlations coefficients with some of
the variables partialled out. The dot notation use d here is due to Yule, [Yul07]. The
notation, definition, and formula for the squared correlation coefficient is
ρ
2
y(x).z
=
MSE[constant term and z; y] −MSE[constant term, z, and x; y]
MSE[constant term and z; y]
(19.1.12)
=
ω
ω
ω

xy.z




xx.z
ω
ω
ω
xy.z
ω
yy.z
(19.1.13)

19.2. Correlation Coefficients and the Associated Least Squares Problem
One can define the correlation coefficients also as proportionate reductions in
the objective functions of the associated GLS problems. However one must reverse
predictor and predictand, i.e., one must look at predictions of a vector x by linear
functions of a scalar y.
Here it is done for multiple correlation coefficients: The value of the GLS objec-
tive function if one predicts x by the best linear predictor x

, which is the minimum
attainable when the scalar observation y is given and the vector x can b e chosen
520 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
freely, as long as it satisfies the constraint x = µ + Ω


xx
q for some q, is
(19.2.1)
SSE[y; best x] = min
xs.t

(x − µ)

(y − ν)






xx

ω
ω
ω
xy
ω
ω
ω

xy
ω
yy



x − µ
y − ν

= (y−ν)

ω

yy
(y−ν).
On the other hand, the value of the GLS objective function when one predicts
x by the best constant x = µ is
(19.2.2)
SSE[y; x = µ] =

o


(y − ν)







xx
+ Ω



xx
ω
ω
ω
xy
ω

yy.x
ω
ω
ω

xy





xx
−Ω



xx
ω
ω
ω
xy
ω

yy.x
−ω

yy.x
ω
ω
ω

xy




xx
ω

yy.x


o
y − ν

=
(19.2.3) = (y − ν)

ω

yy.x
(y − ν).
The proportionate reduction in the objective function is
(19.2.4)
SSE[y; x = µ] −SSE[y; b es t x]
SSE[y; x = µ]
=
(y − ν)
2

yy.x
− (y − ν)
2

yy
(y − ν)
2

yy.x
=
(19.2.5) =
ω

yy
− ω
yy.x
ω
yy
= ρ
2
y(x)
= 1 −
ω
yy.x
ω
yy
= 1 −
1
ω
yy
ω
yy
= ρ
2
y(x)
19.3. CA NONICA L CORRELATIONS 521
19.3. Canonical Correlations
Now what happens with the correlation coefficients if both predictor and predic-
tand are vectors? In this case one has more than one correlation coefficient. One first
finds those two linear combinations of the two vectors which have highest correlation,
then those which are uncorrelated with the first and have second highest correlation,
and so on. Here is the mathematical construction needed:
Let x and y be two column vectors consisting of p and q scalar random variables,

respectively, and let
(19.3.1)
V
[

x
y

] = σ
2




xx



xy



yx



yy

,
where Ω



xx
and Ω


yy
are nonsingular, and let r be the rank of Ω


xy
. Then there exist
two separate transformations
(19.3.2) u = Lx, v = My
such that
(19.3.3)
V
[

u
v

] = σ
2

I
p
Λ
Λ


I
q

522 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
where Λ is a (usually rectangular) diagonal matrix with only r diagonal elements
positive, and the others zero, and where these diagonal elements are sorted in de-
scending order.
Proof: One obtains the matrix Λ by a singular value decomposition of Ω


−1/2
xx



xy



−1/2
yy
=
A, say. Let A = P

ΛQ be its singular value decomposition with fully orthogonal
matrices, as in equation (A.9.8). Define L = P Ω


−1/2
xx

and M = QΩ


−1/2
yy
. Therefore
LΩ


xx
L

= I, MΩ


yy
M

= I, and LΩ


xy
M

= P Ω


−1/2
xx




xy



−1/2
yy
Q

= P AQ

=
Λ.
The next problems show how one gets from this the maximization property of
the canonical correlation coefficients:
Problem 257. Show that for every p-vector l and q-vector m,
(19.3.4)



corr(l

x, m

y)



≤ λ

1
where λ
1
is the first (and therefore biggest) diagonal element of Λ. Equality in
(19.3.4) holds if l = l
1
, t he first row in L, and m = m
1
, t he first row in M .
Answer: If l or m is the null vector, then there is nothing to prove. If neither of
them is a null vector, then one can, without loss of generality, multiply them with
appropriate scalars so that p = (L
−1
)

l and q = (M
−1
)

m satisfy p

p = 1 and
19.3. CA NONICA L CORRELATIONS 523
q

q = 1. Then
(19.3.5)
V
[


l

x
m

y

] =
V
[

p

Lx
q

My

] =
V
[

p

o

o

q



u
v

] = σ
2

p

o

o

q


I
p
Λ
Λ

I
q

p o
o q

= σ
2


p

p p

Λq
q

Λp q

q

Since the matrix at the righthand side has ones in the diagonal, it is the correlation
matrix, i.e., p

Λq = corr(l

x, m

y). Therefore (19.3.4) follows from Problem 258.
Problem 258. If

p
2
i
=

q
2
i
= 1, and λ

i
≥ 0, show that |

p
i
λ
i
q
i
| ≤ max λ
i
.
Hint: first get an upper bound for |

p
i
λ
i
q
i
| through a Cauchy-Schwartz-type argu-
ment.
Answer. (

p
i
λ
i
q
i

)
2


p
2
i
λ
i

q
2
i
λ
i
≤ (max λ
i
)
2
. 
Problem 259. Show that for every p-vector l and q-vector m such that l

x is
uncorrelated with l

1
x, an d m

y is uncorrelated with m


1
y,
(19.3.6)



corr(l

x, m

y)



≤ λ
2
where λ
2
is the second diagonal element of Λ. Equality in (19.3.6) holds if l = l
2
,
the second row in L, and m = m
2
, t he second row in M .
Answer. If l or m is the null vector, then there is nothing to prove. If neither of them is a
null vector, then one can, without loss of generality, multiply them with appropriate scalars so that
524 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
p = (L
−1
)


l and q = (M
−1
)

m satisfy p

p = 1 and q

q = 1. Now write e
1
for the first unit
vector, which has a 1 as first component and zeros everywhere else:
(19.3.7) cov[l

x, l

1
x] = cov[p

Lx, e

1
Lx] = p

Λe
1
= p

e

1
λ
1
.
This covariance is zero iff p
1
= 0. Furthermore one also needs the following, directly from the proof
of Problem 257:
(19.3.8)
V
[

l

x
m

y

] =
V
[

p

Lx
q

My


] =
V
[

p

o

o

q


u
v

] = σ
2

p

o

o

q


I
p

Λ
Λ

I
q

p o
o q

= σ
2

p

p p

Λq
q

Λp q

q

Since the matrix at the righthand side has ones in the diagonal, it is the correlation matrix, i.e.,
p

Λq = corr(l

x, m


y). Equation (19.3.6) follows from Problem 258 if one lets the subscript i
start at 2 instead of 1. 
Problem 260. (Not eligible for in-class exams) Extra credit question for good
mathematicians: Reformulate the above treatment of canonical correlations w ithout
the assumption that Ω


xx
and Ω


yy
are nonsingular.
19.4. Some Remarks about the Sample Partial Correlation Coefficients
The definition of the partial sample correlation coefficients is analogous to that of
the partial population correlation coefficients: Given two data vectors y and z, and
the matrix X (which includes a constant term), and let M = I−X(X

X)
−1
X

be
19.4. SOME REMARKS ABOUT THE SAMPLE PARTIAL CORRELATION COEFFICIENTS525
the “residual maker” with respect to X. Then the squared partial sample correlation
is the squared simple correlation between the least squares residuals:
(19.4.1) r
2
zy.X
=

(z

My)
2
(z

Mz)(y

My)
Alternatively, one can define it as the proportionate reduction in the SSE. Although
X is assumed to incorporate a constant term, I am giving it here separately, in order
to show the analogy with (19.1.8):
(19.4.2)
r
2
zy.X
=
SSE[constant term and X; y] −SSE[constant term, X, and z; y]
SSE[constant term and X; y]
.
[Gre97, p. 248] considers it unintuitive that this can be computed using t-statistics.
Our approach explains why this is so. First of all, note that the square of the t-
statistic is the F -statistic. Secondly, the formula for the F-statistic for the inclusion
of z into the regression is
(19.4.3)
t
2
= F =
SSE[constant term and X; y] −SSE[constant term, X, and z; y]
SSE[constant term, X, and z; y]/(n −k −1)

526 19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
This is very similar to the formula for the squared partial correlation coefficient.
From (19.4.3) follows
(19.4.4) F + n − k −1 =
SSE[constant term and X; y](n −k −1)
SSE[constant term, X, and z; y]
and therefore
(19.4.5) r
2
zy.X
=
F
F + n − k −1
which is [Gre97, (6-29) on p. 248].
It should also be noted here that [Gre97, (6-36) on p. 254] is the sample equiv-
alent of (19.1.11).
CHAPTER 20
Numerical Methods for computing OLS Estimates
20.1. QR Decomposition
One precise and fairly efficient method to compute the Least Squares estimates
is the QR decomposition. It amounts to going over to an orthonormal basis in
R
[X].
It uses the following mathematical fact:
Every matrix X, which has full column rank, can be decomposed in the product
of two matrices QR, where Q has the same number of rows and columns as X, and
is “suborthogonal” or “incomplete orthogonal,” i.e., it satisfies Q

Q = I. The other
factor R is upper triangular and nonsingular.

To construct the least squares estimates, make a QR decomposition of the matrix
of explanatory variables X (which is assumed to have full column rank). With
527
528 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES
X = QR, the normal equations read
X

X
ˆ
β = X

y(20.1.1)
R

Q

QR
ˆ
β = R

Q

y(20.1.2)
R

R
ˆ
β = R

Q


y(20.1.3)
R
ˆ
β = Q

y(20.1.4)
This last step can be made because R is nonsingular. (20.1.4) is a triangular system of
equations, which can be solved easily. Note that it is not necessary for this procedure
to compute the matrix X

X, which is a big advantage, since this computation is
numerically quite unstable.
Problem 261. 2 points You have a QR-decomposition X = QR, w here Q

Q =
I, and R is upper triangular and nonsingular. For an estimate of
V
[
ˆ
β] you need
(X

X)
−1
. How can this be computed without computing X

X? And why would
you wa nt to avoid computing X


X?
Answer. X

X = R

Q

QR = R

R, its inverse is therefore R
−1
R
−1

. 
20.1. QR DECOMPOSITION 529
Problem 262. Compute the QR decomposition of
(20.1.5) X =




1 1 2
1 5 −2
1 1 0
1 5 −4





Answer.
(20.1.6) Q =
1
2



1 −1 1
1 1 1
1 −1 −1
1 1 −1



R = 2

1 3 −1
0 2 −2
0 0 1

How to get it? We need a decomposition
(20.1.7)

x
1
x
2
x
3


=

q
1
q
2
q
3


r
11
r
12
r
13
0 r
22
r
23
0 0 r
33

where q

1
q
1
= q


2
q
2
= q

3
q
3
= 1 and q

1
q
2
= q

1
q
3
= q

2
q
3
= 0. First column: x
1
= q
1
r
11
and

q
1
must have unit length. This gives q

1
=

1/2 1/2 1/2 1/2

and r
11
= 2. Second column:
(20.1.8) x
2
= q
1
r
12
+ q
2
r
22
and q

1
q
2
= 0 and q

2

q
2
= 1. Premultiply (20.1.8) by q

1
to get q

1
x
2
= r
12
, i.e., r
12
= 6.
Thus we know q
2
r
22
= x
2
− q
1
· 6 =

−2 2 −2 2


. Now we have to normalize it, to get
530 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES

q
2
=

−1/2 1/2 −1/2 1/2

and r
22
= 4. The rest remains a homework problem. But I am
not sure if my numbers are right. 
Problem 263. 2 points Compute trace and determinant of


1 3 −1
0 2 −2
0 0 1


. Is
this matrix symmetric and, if so, is it nonnegative definite? Are its column vectors
linearly dependent? Compute the matrix product
(20.1.9)




1 −1 1
1 1 1
1 −1 −1
1 1 −1







1 3 −1
0 2 −2
0 0 1


20.2. The LINPACK Implementation of the QR Decomposition
This is all we need, but numerically it is possible to construct, without much
additional computing time, all the information which adds the missing orthogonal
columns to Q. In this way Q is square and R is conformable with X. This is
sometimes called the “complete” QR-decomposition. In terms of the decomposition
20.2. T HE LINPACK IMPLEMENTATION OF THE QR DECOMPOSITION 531
above, we have now
(20.2.1) X =

Q S


R
O

For every matrix X one can find an orthogonal matrix Q such that Q

X has
zeros below the diagonal, call that matrix R. Alternatively one may say: every

matrix X can be written as the product of two matrices QR, where R is conformable
with X and has zeros below the diagonal, and Q is orthogonal.
To prove this, and also for the numerical procedure, we will build Q

as the
product of several orthogonal matrices, each converting one column of X into one
with zeros below the diagonal.
First note that for every vector v, the matrix I −
2
v

v
vv

is orthogonal. Given
X, let x be the first column of X. If x = o, then go on to the next column.
Otherwise choose v =





x
11
+ σ

x

x
x

21
.
.
.
x
n1





, where σ = 1 if x
11
≥ 0 and σ = −1
otherwise. (Mathematically, either σ − +1 or σ = −1 would do; but if one gives σ
the same sign as x
11
, then the first element of v gets largest possible absolute value,
532 20. NUMERICAL METHODS FOR COMPUTING OLS ESTIMATES
which improves numerical accuracy.) Then
v

v = (x
2
11
+ 2σx
11

x


x + x

x) + x
2
21
+ ···+ x
2
n1
(20.2.2)
= 2(x

x + σx
11

x

x)(20.2.3)
v

x = x

x + σx
11

x

x(20.2.4)
therefore 2v

x/v


v = 1, and
(20.2.5) (I −
2
v

v
vv

)x = x − v =





−σ

x

x
0
.
.
.
0






.
Premultiplication of X by I−
2
v

v
vv

gets therefore the first column into the desired
shape. By the same principle one can construct a second vector w, which has a zero
in the first place, and which annihilates all elements below the second element in the
second column of X, etc. These successive orthogonal transformations will convert
X into a matrix which has zeros below the diagonal; their product is therefore Q

.
The LINPACK implementation of this starts with X and modifies its elements in
place. For each column it generates the c orresponding v vector and premultipies the
matrix by I −
2
v

v
vv

. This generates zeros below the diagonal. Instead of writing
20.2. T HE LINPACK IMPLEMENTATION OF THE QR DECOMPOSITION 533
the zeros into that matrix, it uses the “free” space to store the vector v. There is
almost enough room; the first nonzero element of v must be stored elsewhere. This
is why the QR decomposition in Splus has two main components: qr is a matrix
like a, and qraux is a vector of length ncols(a).

LINPACK does not use or store exactly the same v as given here, but uses
u = v/(σ

x

x) instead. The normalization does not affect the res ulting orthogonal
transformation; its advantage is that the leading element of each vector, that which
is stored in qraux, is at the same time equal u

u/2. In other words, qraux doubles
up as the divisor in the construction of the orthogonal matrices.
In Splus type help(qr). At the end of the help file a program is given which
shows how the Q might be constructed from the fragments qr and qraux.

×