Tải bản đầy đủ (.pdf) (10 trang)

KInh tế ứng dụng_ Lecture 7: Multicollinearity

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (165.33 KB, 10 trang )

Applied Econometrics Multicollinearity
1
Applied Econometrics
Lecture 7: Multicollinearity

Double whom you will, but never yourself

1) Introduction

Multiple regression can be written as follows:
Y
i
= b
0
+ b
1
X
1
+ b
2
X
2
+ … + b
k
X
k


Collinearity refers to linear relationships between two X variables. Multicollinearity encompasses
linear relationships between more than two X variables. Multiple regression is impossible in the
presence of perfect collinearity or multicollinearity. If X


1
and X
2
have no independent variation, we
cannot estimate the effects of X
1
adjusting for X
2
or vice versa. One of the variables must be
dropped. This is no loss, since a perfect relationship implies perfect redundancy. Perfect
multicollinearity is, however, rarely practice problem. Strong (not perfect) multicollinearity, which
permits estimation but makes it less precise, is more common. When the multicollinearity is present,
the interpretation of the coefficients will be quite difficult.

2) Practice consequences of multicollinearity

Standard errors of coefficients
The easiest way tell whether multicollinearity is causing problems is to examine the standard errors
of the coefficients. If several coefficients have high standard errors and dropping one or more
variables from the equation lowers the standard errors of the remaining variables. Multicollinearity
will be the source of the problem.

A more sophisticated analysis would take into account the fact that the covariance between estimated
parameters may be sensitive to multicollinearity (aq high degree of multicollinearity will be
associated with a relatively high covariance between estimated parameters). This suggests that if one
estimated parameter b
i
overestimates the true parameter β
i
, a second parameter estimates bj is likely

to underestimates β
j
, and vice versa.

Because of the large standard errors, the confident intervals for the relevant population parameters
tend to be larger.
Written by Nguyen Hoang Bao May 24, 2004
Applied Econometrics Multicollinearity
2
Sensitive coefficients
One of the consequences of high correlation between explanatory variables is that the parameters
estimates would be very sensitive to addition or deletion of observations.

A high R
2
but few significant t-ratios
There are few coefficients, which are not statistically significant difference from zero and the
coefficient of determination is high.

3) Detection of multicollinearity

3.1) There is high R
2
but few significant t-ratios. The F-test will reject the hypothesis that partial
slope coefficient are simultaneously equal to zero, but the individual t-test show that non or
very few partial slope coefficients are statistically different from zero.

3.2) Multicollinearity can be considered a serious problem only if R
2
y

< R
2
i
(Klein, 1962) where
R
2
y
is the squared multiple correlation coefficient between Y and the explanatory variables and
R
2
i
is the squared multiple correlation coefficient between X
i
and the other explanatory
variables.
z Even if R
2
y
< R
2
i
, t – values for the regression coefficients is still statistical significant.
z Even if R
2
i
is very high, the simple correlations among regressors are comparatively low.

3.3) In the regression of Y on X
2
, X

3
, and X
4
, if one find that R
2
1.234
is very high but r
2
12.34
, r
2
13.24
,
and r
2
14.23
are comparatively low, it may suggest that the variables X
2
, X
3
, and X
4
are highly
intercorrelated and that at least one of these variables is superfluous.

3.4) We may use the overall F test to check whether there is a relationship between any one of
explanatory variable on the remaining explanatory variables.

3.5) In the regression of Y on X
1

and X
2
, we may calculate λ from the following equation:
(S
11
– λ)(S
11
– λ) – S
2
12
= 0
where

()
()
()()
X
2
X
2i
n
1i
X
1
X
1i
S
12
n
1i

X
2
X
2i
2
S
22
n
1i
X
1
X
1i
2
S
11


=


=


=

=
=
=


Written by Nguyen Hoang Bao May 24, 2004
Applied Econometrics Multicollinearity
3
The condition number (Raduchel (1971) and Belsley, Kuhn and Welsch (1980)) is defined as:
λ
λ
CN
2
1
=

where λ
1
> λ
2

If CN is between 10 to 30, there is moderate to strong multicollinearity
If CN is greater than 30, there is severe multicollinearity
The closer the condition number is to zero, the better condition is.

3.6) Theil’s test (1971)
1
Calculate m, which is defined as:
( )

−−=
=

k
1i

2
i
22
RRRm

where
R
2
is the squared multiple correlation coefficient between Y and the explanatory variables (X
1
,
X
2
, …, X
i
, …, X
k
)
R
2
-i
is the squared multiple correlation coefficient between Y and the explanatory variables (X
1
,
X
2
, …, X
i-1
, X
i+1

, …, X
k
) with X
i
omitted

If (X
1
, X
2
, …, X
i
, …, X
k
) are mutually uncorrelated, then m will be zero.

3.7) Variance-Inflation Factor (VIF). The VIF is defined as:
()
R1
1
β
ˆ
VIF
2
i
i

=

where R

2
i
is the squared multiple correlation coefficient between X
i
and other explanatory
variables. We may calculate for each of explanatory variable separately. The VIF
i
s measures
the degree of multicollinearity among regressors with reference to the idea situation where all
explanatory variables are uncorrelated (R
2
i
= 0 implies VIF
i
= 1)
2
. VIF
j
s will be useful for
dropping some variables and imposing parameter constraints only in some very extreme cases
where R
2
i
is approximately equal to zero.

4) Remedial measures

4.1) Getting more data: Increasing the size of the sample may reduce the multicollinearity problem.
The variance of the coefficient is defined as follows:



1
Theil, H (1971), Principles of Econometrics, (New York: Wiley), pp 179.
2
We can interpret VIF
j
s as the ratio of the actual variance of β
i
to what the variance of β
i
would have been if X
i
were to be
uncorrelated with the remaining X’s.
Written by Nguyen Hoang Bao May 24, 2004
Applied Econometrics Multicollinearity
4
()
R1
S
σ

ˆ
V(
2
i
ii
2
i


=

where σ
2
is the variance of the error term.

()


=
=
n
1i
2
ii
X
i
X
ii
S

R
2
i
is the squared multiple correlations coefficient between X
i
and other explanatory
variables

As sample increases, S

ii
will increase. Therefore, for any given R
2
i
, the variance of the
coefficient (V(β
i
)) will decrease, thus decreasing standard error, which will enable us to
estimate β
i
more precisely.

4.2) Transforming of variables (using ratios or first differences): The ratios or first differences
regression model often reduces the severity of multicollinearity. However, the first different
regression may generate additional problems: (i) error terms may be serially correlated; (ii)
one observation is lost; (iii) the first differencing procedure may not be appropriate in
cross-sectional data, where there is no logical ordering of the observations.

4.3) Dropping variables: From the previous lectures, dropping a variable to alleviate the problem
of multicollinearity may lead to the specification bias. Hence, the remedy may be worst than
the disease in some situations because while multicollinearity may precise estimation of the
parameters of the model, omitting a variable my seriously mislead us as to the true value of
parameters.


4.4) Using extraneous estimates (Tobin, 1950): The equation to be estimated is:
lnQ = + lnP + lnI
α
ˆ
1

β
ˆ
2
β
ˆ
where Q, P and I represent quantity of products, price and income respectively.
The time-series data of income and price were both highly collinear.

First, we estimate the income elasticity because the data which is at a point in time, the
price do not vary much; is known as the extraneous estimate.
2
β
ˆ
2
β
ˆ
Second, we regress (lnQ – lnI) on lnP to get the estimates of and
2
β
ˆ
α
ˆ
1
β
ˆ
Written by Nguyen Hoang Bao May 24, 2004
Applied Econometrics Multicollinearity
5

We may not interpret the problem of how income elasticity is not changed over time. However,

the technique may be worth of consideration in situations, where the cross-sectional estimates
do not vary substantially from one cross section to another.

4.5) Using a priori information: Considering the following equation
Y
1
= X
1
+ X
2

1
β
ˆ
2
β
ˆ
We cannot get good estimates of and because of high correlation between X
1
and X
2
.
We get an estimate of from another data set and another equation
1
β
ˆ
2
β
ˆ
1

β
ˆ
Y
2
= X
1
+
2
Z
1
β
ˆ
α
ˆ
X
1
and Z are not highly correlated and we get good estimate of . Then we regress
(Y
1
– X
1
) on X
2
to get an estimate .
1
β
ˆ
1
β
ˆ

2
β
ˆ

5) Fragility analysis: Making sense of slope coefficients

It is a useful exercise to investigate the sensitivity of regression coefficients across plausible
neighboring specifications to check the fragility of the inferences we make on the basis of nay one
specification uncertainty as to which variables to include.

1. If the different regressors are highly correlated with one another, then there is a problem of
collinearity of multicollinearity. This means that the parameters we estimated are very
sensitive to the specification model we use and that we may get a high R
2
but insignificant
coefficients (another indication of multicollinearity is that the R
2
s from the simple
regressions do not sum to near the R
2
for the multiple regression).

2. We would much refer to have robust coefficients, which are not sensitive to small changes in
the model specifications. Consider the following model
Y
i
= β
0
+ β
1

X
1
+ β
2
X
2
+ β
3
X
3

As there are three variables, we have seven possible equations (the number of equation is
2
k
-1) to estimate, where k is the regressors. In some cases, there may be one or more
Written by Nguyen Hoang Bao May 24, 2004

×