Class Notes in Statistics and Econometrics Part 32 pot

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (423.57 KB, 42 trang )

CHAPTER 63
Independent Observations from the Same
Multivariate Population
This Chapter discusses a model that is a special case of the model in Chapter
62.2, but it goes into more depth towards the end.
63.1. Notation and Basic Statistics
Notational conventions are not uniform among the diﬀerent books about mul-
tivariate statistic. Johnson and Wichern arrange the data in a r × n matrix X.
Each column is a separate independent observation of a q vector with mean µ and
dispersion matrix Σ
Σ
Σ. T here are n observations.
1333
1334 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
We will choose an alternative notation, which is also found in the literature, and
write the matrix as a n ×r matrix Y . As before, each column represents a variable,
and each row a usually independent observation.
Decompose Y into its row vectors as follows:
(63.1.1) Y =



y

1
.
.
.
y

n




.
Each row (written as a column vector) y
i
has mean µ and dispersion matrix Σ
Σ
Σ, and
diﬀerent rows are independent of each other. In other words,
E
[Y ] = ιµ

.
V
[Y ]
is an array of rank 4, not a matrix. In terms of Kronecker products one can write
V
[vec Y ] = Σ
Σ
Σ ⊗I.
One can form the following descriptive statistics:
¯
y =
1
n
y
i
is the vector of sample
means, W =


i
(y
i
−
¯
y)(y
i
−
¯
y)

is matrix of (corrected) squares and cross products,
the sample covariance matrix is S
(n)
=
1
n
W with divisor n, and R is the matrix of
sample correlation coeﬃcients.
Notation: the ith sample variance is called s
ii
(not s
2
i
, as one might perhaps
expect).
The sample means indicate location, the sample standard deviations dispersion,
and the sample correlation coeﬃcients linear relationship.
63.1. NOTATION AND BASIC STATISTICS 1335

How do we get these descriptive statistics from the data Y through a matrix
manipulation?
¯
y

=
1
n
ι

Y ; now Y −ι
¯
y

= (I −
ιι

n
)Y is the matrix of observations
with the appropriate sample mean taken out of each element, therefore
(63.1.2) W =

y
1
−
¯
y ··· y
n
−
¯

y




(
y
1
−
¯
y)

.
.
.
(y
n
−
¯
y)




=
= Y

(I −
ιι


n
)

(I −
ιι

n
)Y = Y

(I −
ιι

n
)Y .
Then S
(n)
=
1
n
W , and in order to get the sample correlation matrix R, use
(63.1.3) D
(n)
= diag(S
(n)
) =






s
11
0 ··· 0
0 s
22
··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· s
nn





and then R = (D
(n)
)
−1/2
S

(n)
(D
(n)
)
−1/2
.
In analogy to the formulas for variances and covariances of linear transformations
of a vector, one has the following formula for sample variances and covariances of
linear combinations Y a and Y b: est.cov[Y a, Y b] = a

S
(n)
b.
1336 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
Problem 517. Show that
E
[
¯
y] = µ and
V
[
¯
y] =
1
n
Σ
Σ
Σ. (The latter identity can
be shown in two ways: once using the Kronecker product of matrices, and once by
partitioning Y into its rows.)

Answer.
E
[
¯
y] =
E
[
1
n
Y

ι] =
1
n
(
E
[Y ])

ι =
1
n
µι

ι = µ. Using Kronecker products, one
obtains from
¯
y

=
1

n
ι

Y that
(63.1.4)
¯
y = vec(
¯
y

) =
1
n
(I ⊗ι

) vec Y ;
therefore
(63.1.5)
V
[
¯
y] =
1
n
2
(I ⊗ι

)(Σ
Σ
Σ ⊗ I)(I ⊗ι) =

1
n
2
(Σ
Σ
Σ ⊗ ι

ι) =
1
n
Σ
Σ
Σ
63.2. TWO GEOMETRIES 1337
The alternative way to do it is
V
[
¯
y] =
E
[(
¯
y − µ)(
¯
y − µ)

](63.1.6)
=
E
[


1
n

i
(y
i
− µ)

1
n

j
(y
j
− µ)


](63.1.7)
=
1
n
2

i,j
E
[(y
i
− µ)(y
j

− µ)

](63.1.8)
=
1
n
2

i
E
[(y
i
− µ)(y
i
− µ)

](63.1.9)
=
n
n
2
E
[(y
i
− µ)(y
i
− µ)

] =
1

n
Σ
Σ
Σ.(63.1.10)

Problem 518. Show that
E
[S
(n)
] =
n−1
n
Σ
Σ
Σ, therefore the unbiased S =
1
n−1

i
(x
i
−
¯
x)(x
i
−
¯
x)

has Σ

Σ
Σ as its expected value.
63.2. Two Geometries
One can distinguish two geometries, according to whether one takes the rows
or the columns of Y as the points. Rows as points gives n points in r-dimensional
1338 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
space, the “scatterplot geometry.” If r = 2, this is the scatter plot of the two variables
against each other.
In this geometry, the sample mean is the center of balance or center of gravity.
The dispersion of the observations around their mean deﬁnes a distance measure in
this geometry.
The book introduces this distance by suggesting with its illustrations that the
data are clustered in hyperellipsoids. The right way to introduce this distance would
be to say: we are not only interested in the r coordinates separately but also in any
linear combinations, then use our treatment of the Mahalanobis distance for a given
population, and then transfer it to the empirical distribution given by the sample.
In the other geometry, all observations of a given random variable form one point,
here called “vector.” I.e., the basic entities are the c olumns of Y . In this so-called
“vector geometry,”
¯
x is the projection on the diagonal vector ι, and the correlation
coeﬃcient is the cosine of the angle between the deviation vectors.
Generalized sample variance is deﬁned as determinant of S. Its geometric intu-
ition: in the scatter plot geometry it is proportional to the square of the volume of
the hyperellipsoids, (see J&W, p. 103), and in the geometry in which the observations
of each variabe form a vector it is
(63.2.1) det S = (n − 1)
−r
(volume)
2

63.3. ASSUMPTION OF NORMALITY 1339
where the volume is that spanned by the deviation vectors.
63.3. Assumption of Normality
A more general version of this section is 62.2.3.
Assume that the y
i
, the row vectors of Y , are independent, and each is ∼
N(µ, Σ
Σ
Σ) with Σ
Σ
Σ positive deﬁnite. Then the density function of Y is
f
Y
(Y ) =
n

j=1

(2π)
−r/2
(detΣ
Σ
Σ)
−1/2
exp

−
1
2

(y
j
− µ)

Σ
Σ
Σ
−1
(y
j
− µ)


(63.3.1)
= (2π)
−nr/2
(detΣ
Σ
Σ)
−n/2
exp

−
1
2

j
(y
j
− µ)


Σ
Σ
Σ
−1
(y
j
− µ)

.(63.3.2)
The quadratic form in the exponent can be rewritten as follows:
n

j=1
(y
j
− µ)

Σ
Σ
Σ
−1
(y
j
− µ) =
n

j=1
(y
j

−
¯
y +
¯
y −µ)

Σ
Σ
Σ
−1
(y
j
−
¯
y +
¯
y −µ)
=
n

j=1
(y
j
−
¯
y)

Σ
Σ
Σ

−1
(y
j
−
¯
y) + n(
¯
y −µ)

Σ
Σ
Σ
−1
(
¯
y −µ)(63.3.3)
1340 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
The ﬁrst term can be simpliﬁed as follows:

j
(y
j
−
¯
y)

Σ
Σ
Σ
−1

(y
j
−
¯
y) =

j
tr(y
j
−
¯
y)

Σ
Σ
Σ
−1
(y
j
−
¯
y)
=

j
trΣ
Σ
Σ
−1
(y

j
−
¯
y)(y
j
−
¯
y)

= tr Σ
Σ
Σ
−1

j
(y
j
−
¯
y)(y
j
−
¯
y)

= n tr Σ
Σ
Σ
−1
S

(n)
Using this one can write the density function as
(63.3.4)
f
Y
(Y ) = (2π)
−nr/2
(detΣ
Σ
Σ)
−n/2
exp

−
n
2
tr(Σ
Σ
Σ
−1
S
(n)
)

exp

−
n
2
(

¯
y−µ)

Σ
Σ
Σ
−1
(
¯
y−µ)

.
One sees, therefore, that the density function depends on the observation only
through
¯
y and S
(n)
, which means that
¯
y and S
(n)
are suﬃcient statistics.
Now we compute the maximum likelihood estimators: taking the maximum for
µ is simply ˆµ =
¯
y. This leaves the concentrated likelihood function
(63.3.5) max
µ
f
Y

(Y ) = (2π)
−nr/2
(detΣ
Σ
Σ)
−n/2
exp

−
n
2
tr(Σ
Σ
Σ
−1
S
(n)
)

.
63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1341
To obtain the maximum likelihood estimate of Σ
Σ
Σ one needs equation (A.8.21) in
Theorem A.8.3 in the Appendix and (62.2.15).
If one sets A = S
(n)
1/2
Σ
Σ

Σ
−1
S
(n)
1/2
, then tr A = tr(Σ
Σ
Σ
−1
S
(n)
) and det A =
(detΣ
Σ
Σ)
−1
det S
(n)
, in (62.2.15), therefore the concentrated likelihood function
(63.3.6) (2π)
−nr/2
(detΣ
Σ
Σ)
−n/2
exp

−
n
2

tr(Σ
Σ
Σ
−1
S
(n)
)

≤ (2πe)
−rn/2
(det S
(n)
)
−n/2
with equality holding if
ˆ
Σ
Σ
Σ = S
(n)
. Note that the maximum value is a multiple of the
estimated generalized variance.
63.4. EM-Algorithm for Missing Observations
The maximization of the likelihood function is far more diﬃcult if some obser-
vations are missing. (Here assume they are missing randomly, i.e., the fact that they
are missing is not related to the values of these entries. Otherwise one has sample
selection bias!) In this case, a good iterative procedure to obtain the maximum like-
lihood estimate is the EM-algorithm (expectation-maximization algorithm). It is an
iterative prediction and estimation.
1342 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION

Let’s follow Johnson and Wichern’s example on their p. 199. The matrix is
(63.4.1) Y =




− 0 3
7 2 6
5 1 2
− − 5




It is not so important how one gets the initial estimates of µ and Σ
Σ
Σ: say
˜
µ

=

6 1 4

, and to get
˜
Σ
Σ
Σ take deviations from the mean, putting ze ros in for the
missing values (which will of course underestimate the variances), and divide by

the number of observations. (Since we are talking maximum likelihood, there is no
adjustment for degrees of freedom.)
(63.4.2)
˜
Σ
Σ
Σ =
1
4
Y

Y where Y =




0 −1 −1
1 1 2
−1 0 −2
0 0 1




, i.e.,
˜
Σ
Σ
Σ =



1/2 1/4 1
1/4 1/2 3/4
1 3/4 5/2


.
Given these estimates, the prediction step is next. The likelihood function de-
pends on sample mean and sample dispersion matrix only. These, in turn, are simple
functions of the vector of column sums Y

ι and the matrix of (uncentered) sums of
squares and crossproducts Y

Y , which are complete suﬃcient statistics. To predict
63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1343
those we need predictions of the missing elements of Y , of their squares, and of
their products with each other and with the observed elements of Y . Our method of
predicting is to take conditional expectations, assuming
˜
µ and
˜
Σ
Σ
Σ are the true mean
and dispersion matrix.
For the prediction of the upper lefthand corner element of Y , only the ﬁrst row
of Y is relevant. Partitioning this row into the observed and unobserved elements
gives
(63.4.3)



y
11
0
3


∼ N



6
1
4


,


1/2 1/4 1
1/4 1/2 3/4
1 3/4 5/2



or

y
1

y
2

∼ N


˜
µ
1
˜
µ
2

,

˜
Σ
Σ
Σ
11
˜
Σ
Σ
Σ
12
˜
Σ
Σ
Σ
21

˜
Σ
Σ
Σ
22


.
The conditional mean of y
1
is the best linear predictor
E
[y
1
|y
2
;
˜
µ,
˜
Σ
Σ
Σ] = y
∗
1
=
˜
µ
1
+

˜
Σ
Σ
Σ
12
˜
Σ
Σ
Σ
−1
22
(y
2
−
˜
µ
2
)(63.4.4)
or in our numerical example
E
[y
11
|···] = y
∗
11
= 6 +

1/4 1



1/2 3/4
3/4 5/2

−1

0 − 1
3 − 4

= 5.73(63.4.5)
1344 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
Furthermore,
(63.4.6)
E
[(y
1
−y
∗
1
)(y
1
−y
∗
1
)

|y
2
;
˜
µ,

˜
Σ
Σ
Σ] =
E
[(y
1
−y
∗
1
)(y
1
−y
∗
1
)

] = MSE[y
∗
1
; y
1
] =
˜
Σ
Σ
Σ
11
−
˜

Σ
Σ
Σ
12
˜
Σ
Σ
Σ
−1
22
˜
Σ
Σ
Σ
21
.
These two data are suﬃcient to compute
E
[y
1
y

1
|y
2
;
˜
µ,
˜
Σ

Σ
Σ]. From y
1
= y
1
−y
∗
1
+
y
∗
1
follows
(63.4.7) y
1
y

1
= (y
1
−y
∗
1
)(y
1
−y
∗
1
)


+(y
1
−y
∗
1
)(y
∗
1
)

+(y
∗
1
)(y
1
−y
∗
1
)

+(y
∗
1
)(y
∗
1
)

.
Now take conditional expectations:

(63.4.8)
E
[y
1
y

1
|y
2
;
˜
µ,
˜
Σ
Σ
Σ] =
˜
Σ
Σ
Σ
11
−
˜
Σ
Σ
Σ
12
˜
Σ
Σ

Σ
−1
22
˜
Σ
Σ
Σ
21
+ O + O + (y
∗
1
)(y
∗
1
)

For the cross products with the observed values one can apply the linearity of
the (conditional) expectations operator:
(63.4.9)
E
[y
1
y

2
|y
2
;
˜
µ,

˜
Σ
Σ
Σ] = (y
∗
1
)y

2
Therefore one obtains
(63.4.10)
E
[

y
1
y

1
y
1
y

2
y
2
y

1
y

2
y

2

|y
2
;
˜
µ,
˜
Σ
Σ
Σ] =

˜
Σ
Σ
Σ
11
−
˜
Σ
Σ
Σ
12
˜
Σ
Σ
Σ

−1
22
˜
Σ
Σ
Σ
21
+ (y
∗
1
)(y
∗
1
)

y
∗
1
y

2
y
2
(y
∗
1
)

y
2

y

2

63.4. EM-ALGORITHM FOR MISSING OBSERVATIONS 1345
In our numerical example this gives
E[y
2
11
|···] = 1/2 −

1/4 1


1/2 3/4
3/4 5/2

−1

1/4
1

+ (5.73)
2
= 32.99(63.4.11)
E
[

y
11

y
12
y
11
y
13

|···] = 5.73

0 3

=

0 17.18

(63.4.12)
Problem 519. Compute in the same way for the last row of Y :
E
[

y
41
y
42

|···] =

y
∗
41

y
∗
42

=

6.4 1.2

(63.4.13)
E
[

y
2
41
y
41
y
42
y
42
y
41
y
2
42

|···] =

41.06 8.27

8.27 1.97

(63.4.14)
E
[

y
41
y
43
y
42
y
43

|···] =

32.0
6.5

(63.4.15)
Answer. This is in Johnson and Wichern, p. 200. 
Now switch back to the more usual notation, in which y
i
is the ith row vector
of Y and
¯
y the vector of column means. Since S
(n)
=

1
n

y
i
y

i
−
¯
y
¯
y

, one can
obtain from the above the value of
(63.4.16)
E
[S
(n)
| all observed values in Y ;
˜
µ,
˜
Σ
Σ
Σ].
1346 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
Of course, in a similar, much simpler fashion one obtains
(63.4.17)

E
[
¯
y| all observed values in Y ;
˜
µ,
˜
Σ
Σ
Σ].
In our numerical example, therefore, we obtain
E
[Y

ι|···] =
E
[


y
11
7 5 y
41
0 2 1 y
42
3 6 2 5







1
1
1
1




|···] =


5.73 7 5 6.4
0 2 1 1.3
3 6 2 5






1
1
1
1





=


24.13
4.30
16.00


(63.4.18)
E
[Y

Y |···] =
E
[


y
11
7 5 y
41
0 2 1 y
42
3 6 2 5







y
11
0 3
7 2 6
5 1 2
y
41
y
42
5




] =


148.05 27.27 101.18
27.27 6.97 20.50
101.18 20.50 74.00


(63.4.19)
The next step is to plug those estimated values of Y

ι and Y

Y into the
likelihoo d function and get the maximum likelihood estimates of µ and Σ
Σ

Σ, in other
words, set mean and dispersion matrix equal to the sample mean vector and sample
63.5. WISHART DISTRIBUTION 1347
dispersion matrices computed from these complete suﬃcient statistics:
˜
˜µ =
1
n
Y

ι =


6.03
1.08
4.00


(63.4.20)
˜
˜
Σ
Σ
Σ =
1
n
Y

(I −
1

n
ιι

)Y =
1
n
Y

Y −
˜
˜µ
˜
˜µ

=


.65 .31 1.18
.31 .58 .81
1.18 .81 2.50


,(63.4.21)
then predict the missing observations anew.
63.5. Wishart Distribution
The Wishart distribution is a multivariate generalization of the σ
2
χ
2
. The non-

central Wishart is the distribution of Y

Y if Y is normally distributed as above.
But we will be mainly interested in the central Wishart distribution.
Let Z =



z

1
.
.
.
z

r



where z
j
∼ NID(o, Σ
Σ
Σ). Then the joint distribution of Z

Z =

r
j=1

z
j
z

j
is called a (central) Wishart distribution, notation Z

Z ∼ W (r,Σ
Σ
Σ). r
1348 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
is the number of degrees of freedom. The following theorem is exactly parallel to
theorem 10.4.3.
Theorem 63.5.1. Let Z =



z

1
.
.
.
z

n



where z

j
∼ NID(o, Σ
Σ
Σ), and let P be sym-
metric and of rank r. A necessary and suﬃcient condition for Z

P Z to have a
Wishart distribution with covariance matrix Σ
Σ
Σ is P
2
= P . In this case, this Wishart
distribution has r degrees of freedom.
Proof of suﬃciency: If P
2
= P with rank r, a r × n matrix T exists with
P = T

T and T T

= I. Therefore Z

P Z = Z

T

T Z. Deﬁne X = T Z.
Writing x
i
for the column vectors of X, we know

C
[x
i
, x
j
] = σ
ij
T T

= σ
ij
I. For
the rows of X this means they are independent of each other and each of them
∼ N(o, Σ
Σ
Σ). Since there are r rows, the result follows.
Necessity: Take a vector c with c

Σ
Σ
Σc = 1. Then c

z
j
∼ N(0, 1) for each j, and
c

z
j
independent of c


z
k
for j = k. Therefore Zc ∼ N(o, I). It follows also T Zc =
Xc ∼ N(o, I) (the ﬁrst vector having n and the second r components). Therefore
c

Z

P Zc is distributed as a χ
2
, therefore we can use the necessity condition in
theorem 10.4.3 to show that P is idempotent.
As an application it follows from (63.1.2) that S
(n)
∼ W (n − 1, Σ
Σ
Σ).
63.6. SAMPLE CORRELATION COEFFICIENTS 1349
One can also show the following generalization of Craig’s theorem: If Z as above,
then Z

P Z is independent of Z

QZ if and only if P Q = O.
63.6. Sample Correlation Coeﬃcients
What is the distribution of the sample correlation coeﬃcients, and also of the
various multiple and partial correlation coeﬃcients in the above model? Suﬃce it to
remark at this point that this is a notoriously diﬃcult question. We will only look at
one special case, which also illustrates the use of random orthogonal transformations.

Look at the following scenario: our matrix Y hat two columns only, write it as
(63.6.1) Y =

u v

=



u
1
v
1
.
.
.
.
.
.
u
n
v
n



and we assume each row y
j
=


u
j
v
j

to be an independent sample of the same bivariate
normal distribution, characterized by the means µ
u
, µ
v
, the variances σ
uu
, σ
vv
, and
the correlation coeﬃcient ρ (but none of these ﬁve paramters are known). The goal
1350 63. INDEPENDENT OBSERVATIONS FROM SAME POPULATION
is to compute the distribution of the sample correlation coeﬃcient
(63.6.2) r =

(u
i
− ¯u)(v
i
− ¯v)


(u
i
− ¯u)

2


(v
i
− ¯v)
2
if the true ρ is zero.
We know that u ∼ N (o, σ
uu
I). Under the null hypothesis, u is independent
of v, therefore its distribution conditionally on v is the same as its unconditional
distribution. Furthermore look at the matrix consisting of random elements
(63.6.3) P =

1/
√
n ··· 1/
√
n
(v
1
− ¯v)/
√
s
vv
··· (v
n
− ¯v)/
√

s
vv

It satisﬁes P P

= I, i.e., it is incomplete orthogonal. The use of random orthogonal
transformations is an important trick which simpliﬁes many proofs in multivariate
statistics. Conditionally on v, the matrix P is of course constant, and therefore, by
theorem 10.4.2, conditionally on v the vector w = P u is standard normal with same
variance σ
uu
, and q = u

u − w

w is an independent σ
uu
χ
2
n−2
. In other words,
conditionally on v, the following three variables are mutually independent and have
63.6. SAMPLE CORRELATION COEFFICIENTS 1351
the following distributions:
w
1
=
√
n¯u ∼ N (0, σ
uu

)(63.6.4)
w
2
=

(u
i
− ¯u)(v
i
− ¯v)


(v
i
− ¯v)
2
= r
√
s
uu
∼ N(0, σ
uu
)(63.6.5)
q =

u
2
i
− n¯u
2

− w
2
2
= (1 − r
2
)s
uu
∼ σ
uu
χ
2
n−2
(63.6.6)
Since the values of v do not enter any of these distributions, these are also the
unconditional distributions. Therefore we can form a simple function of r which has
a t-distribution:
(63.6.7)
w
2

q/(n − 2)
=
r
√
n − 2
√
1 − r
2
∼ t
n−2

This can be used to test whether ρ = 0.

CHAPTER 64
Pooling of Cross Section and Time Series Data
Given m cross-sectional units, each of which has been observed for t time periods.
The dependent variable for cross sectional unit i at time s is y
si
. There are also
k independent variables, and the value of the jth independent variable for cross
sectional unit i at time s is x
sij
. I.e., instead of a vector, the dependent variable is a
matrix, and instead of a matrix, the independent variables form a 3-way array. We
will discuss three diﬀerent models here which assign equal slope parameters to the
diﬀerent cross-sectional units but which diﬀer in their treatment of the intercept.
1353
1354 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
64.1. OLS Model
The most restrictive model of the three assumes that all cross-sectional units
have the same intercept µ. I.e.,
(64.1.1) y
si
= µ +
k

j=1
x
sij
β
j

+ ε
si
s = 1, . . . , t, i = 1, . . . , m,
where the error terms are uncorrelated and have equal variance σ
2
ε
.
In tile notation:
(64.1.2)
t Y
m
=
t ι µ ι
m
+
t X k β
m
+
t E
m
In matrix notation this model can be written as
(64.1.3) Y = ιµι

+

X
1
β ··· X
m
β


+ E
where Y =

y
1
··· y
m

is t×m, each of the X
i
is t×k, the ﬁrst ι is the t-vector
of ones and the second ι the m-vector of ones, µ is the intercept and β the k-vector of
slope coeﬃcients, and E =

ε
ε
ε
1
··· ε
ε
ε
m

the matrix of disturbances. The notation
64.1. OLS MODEL 1355

X
1
β ··· X

m
β

represents a matrix obtained by the multiplication of a 3-way
array with a vector. We assume vec E ∼ o, σ
2
I.
If one vectorizes this one gets
(64.1.4)
vec(Y ) =





ι X
1
ι X
2
.
.
.
.
.
.
ι X
m







µ
β

+ vec(E) or vec(Y ) =





ι
ι
.
.
.
ι





µ +





X

1
X
2
.
.
.
X
m





β + vec(E)
Using the abbreviation
(64.1.5) Z =



X
1
.
.
.
X
m



this can also be written

(64.1.6) vec(Y ) = ιµ + Zβ + vec(E) =

ι Z

+

µ
β

.
Problem 520. 1 point Show that vec(

X
1
β ··· X
m
β

) = Zβ with Z as
just deﬁn ed.
1356 64. POOLING OF CROSS SECTION AND TIME SERIES DATA
Answer.
(64.1.7) vec(

X
1
β ··· X
m
β


) =


X
1
β
.
.
.
X
m
β


=


X
1
.
.
.
X
m


β = Zβ

One gets the paramater estimates by regressing running OLS on (64.1.4), i.e.,
regressing vec Y on Z with an intercept.

64.2. The Between-Estimator
By premultiplying (64.1.3) by
1
t
ι

one obtains the so-called “between”-regression.
Deﬁning
¯
y

=
1
t
ι

Y , i.e.,
¯
y

is the row vector consisting of the column means, and
in the same way
¯
x

i
=
1
t
ι


X
i
and
¯
ε
ε
ε

=
1
t
ι

E, one obtains
(64.2.1)
¯
y

= µι

+

¯
x

1
β ···
¯
x


m
β

+
¯
ε
ε
ε

= µι

+ (
¯
Xβ)

+
¯
ε
ε
ε

where
¯
X =



¯
x


1
.
.
.
¯
x

m



.
If one transposes this one obtains
¯
y = ιµ +
¯
Xβ +
¯
ε
ε
ε.
64.3. DUMMY VARIABLE MODEL (FIXED EFFECTS) 1357
In tiles, the between model is obtained from (64.1.2) by attaching ι/t t :
(64.2.2)
ι/t t Y
m
=
=
µ ι

m
+
ι/t t X k β
m
+
ι/t t E
m
If one runs this regression one will get estimates of µ and β which are less eﬃcient
than those from the full regression. But these regressions are consistent even if the
error terms in the same column are correlated (as they are in the Random Eﬀec ts
model).
64.3. Dummy Variable Model (Fixed Eﬀects)
While maintaining the ass umption that the cross sectional units have the same
slope parameters, we are now allowing a diﬀerent intercept for each unit. I.e., the

Class Notes in Statistics and Econometrics Part 32 pot

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về