Pham Thi Bich Ngoc, Ph.D. (University of Kiel, Germany)
FEC/Hoa Sen University
UNIVERSITY OF ECONOMICS HOCHIMINHCITY, June 2014
June 14 - Dr. Pham Thi Bich Ngoc
Endogeneity refers to the fact that an
independent variable (IV) included in the
model is a choice variable (not enxogenous)
June 14 - Dr. Pham Thi Bich Ngoc
Omitted variable bias
Sample selection bias/Measurement error
Simultaneity
June 14 - Dr. Pham Thi Bich Ngoc
Omitting a variable (X2) creates a bias only if:
1. X
2
is an explanator of Y (so, when omitted, it
becomes a component of the error term)
2. X
2
is correlated with X
1
(so that X
2
creates a
correlation between X
1
and
the error term).
June 14 - Dr. Pham Thi Bich Ngoc
Sample selection bias may occur when the
study subjects are not a random sample of
the population either on the dependent
variable or on an independent variable.
Examples:
◦ Estimating women’s wage equation.
Measurement error also induces a
correlation between our included
explanator and the error term.
Instead of observing X
i
, we observe
M
i
X
i
v
i
June 14 - Dr. Pham Thi Bich Ngoc
Sample selection bias:
Union workers may
be less able than
non-union workers,
and would be
earning less than
non-union workers
had they not joined
a union.
b underestimates
the wage gain of
joining a union.
Non-union Union
wages
Observed earnings of
union workers
Earnings of non-union
workers if they joined
the union.
Observed earnings of
non-union workers
Sample Selection Bias
b
Observed earnings of
union workers had they
not joined the union.
True union effect
June 14 - Dr. Pham Thi Bich Ngoc
SAT
College GPA
True relationship
Estimated relationship
Suppose that the sample is selected
such that only students with a GPA
higher than B are included.
B
June 14 - Dr. Pham Thi Bich Ngoc
An independent variable included in the
model is a choice variable, potentially
affected by the DV.
Examples:
◦ IV = using a tutor or not; DV = grade
◦ IV = education; DV = income
◦ IV = union status; DV = wages
June 14 - Dr. Pham Thi Bich Ngoc
June 14 - Dr. Pham Thi Bich Ngoc
Both X and Y are jointly determined
The process that generates Y also
generates X at the same time
Because X and Y are determined
simultaneously, X can adjust in response to
shocks to Y (
e
)
Thus X will be correlated with
e
June 14 - Dr. Pham Thi Bich Ngoc
The classic example of simultaneous causality in
economics is supply and demand.
Both prices and quantities adjust until supply and demand
are in equilibrium.
A shock to demand or supply causes BOTH prices and
quantities to move.
Thus, any attempt to estimate the relationship between
prices and quantities (say, to estimate a demand elasticity)
suffers from SIMULTANEITY BIAS.
Econometricians have a frequent interest in estimating
elasticities resulting from such an equilibrium process.
Simultaneity bias is a MAJOR problem.
Consider a simple OLS regression:
◦ Y
it
= a
0
+ a
1
X
1it
+ u
it
Recall that our estimate of a
1
will be
unbiased only if we can assume that X
1it
is
uncorrelated with the error term (u
it
)
We have discussed two ways to help ensure
that this assumption is true
◦ First, we should control for any observable
variables that affect Y
it
and which are correlated
with X
1it
. For example, we should control for X
2it
if
X
2it
affects Y
it
and X
2it
is correlated with X
1it
:
◦ Y
it
= a
0
+ a
1
X
1it
+ a
2
X
2it
+ u
it
June 14 - Dr. Pham Thi Bich Ngoc
Second, if we have panel data, we can
control for any unobservable firm-specific
characteristics (u
i
) that affect Y
it
and which
are correlated with the X variables.
From Chapter 4:
◦ Y
it
= a
0
+ a
1
X
1it
+ a
2
X
2it
+ u
i
+ e
it
We control for the correlations between u
i
and the X variables by estimating fixed
effects models.
Our estimates of a
1
and a
2
are unbiased if
the X variables are uncorrelated with e
it
. In
this case, we say that the X variables are
“exogenous”.
June 14 - Dr. Pham Thi Bich Ngoc
Unfortunately, multiple regression and fixed effects
models do not always ensure that the X variables
are uncorrelated with the error term:
◦ if we do not observe all the variables that affect Y and that
are correlated with X, multiple regression will not solve the
problem.
◦ if we do not have panel data, the fixed effects models
cannot be estimated.
◦ even if we have panel data, the Y and X variables may
display little variation over time in which case the fixed
effects models can be unreliable (Zhou, 2001).
◦ even if we have panel data and the Y and X variables display
sufficient variation over time, the unobservable variables
that are correlated with X may not be constant over time in
which case the fixed effects models will not solve the
problem.
June 14 - Dr. Pham Thi Bich Ngoc
A variable is more likely to be correlated with the
error term if it is “endogenous”
“Endogenous” means that the variable is
determined within the economic model that we are
trying to estimate.
June 14 - Dr. Pham Thi Bich Ngoc
For example, suppose that Y
2it
is an endogenous
explanatory variable:
◦ Y
1it
= a
0
+ a
1
Y
2it
+ a
2
X
it
+ u
it
(1)
◦ Y
2it
= b
0
+ b
1
X
it
+ b
2
Z
it
+ v
it
(2)
Equations (1) and (2) have a “triangular” structure
since Y
2it
is assumed to affect Y
1it
, but Y
1it
is
assumed not to affect Y
2it
Given this triangular structure, the OLS estimate of
a
1
in equation (1) is unbiased only if v
it
is
uncorrelated with u
it
If v
it
is correlated with u
it
, then Y
2it
is correlated
with u
it
which means that the OLS estimate of a
1
would be biased
To avoid this bias, we must estimate equation (1)
“instrumental variables” (IV) regression rather than
OLS.
June 14 - Dr. Pham Thi Bich Ngoc
Equations (1) and (2) are called “structural”
equations because they describe the
economic relationship between Y
1it
and Y
2it
We can obtain a “reduced-form” equation
by substituting eq. (2) into eq. (1):
◦ Y
1it
= a
0
+ a
1
(b
0
+ b
1
X
it
+ b
2
Z
it
+ v
it
) + a
2
X
it
+ u
it
◦ In this “reduced-form” equation, all the explanatory
variables (X
it
and Z
it
) are exogenous
The basic idea underlying IV regression is to
remove v
it
from the Y
1it
model so that our
estimate of a
1
is unbiased.
June 14 - Dr. Pham Thi Bich Ngoc
Note that v
it
is removed from the Y
1it
model if we
use the predicted rather than the actual values of
Y
2it
on the right hand side.
We predict Y
2it
using all the exogenous variables in
the system (in our example, we use the two
exogenous variables X
it
and Z
it
)
June 14 - Dr. Pham Thi Bich Ngoc
We then use the predicted rather than the actual
values of Y
2it
when estimating the Y
1it
model:
The a
1
estimate is biased in eq. (3) but it is
unbiased in eq. (4) because the v
it
term has been
removed.
June 14 - Dr. Pham Thi Bich Ngoc
In eq. (4) the estimated coefficient for the Z
it
variable is
We already know the value of from eq. (2):
Therefore,
it is important to note that the coefficient can
be estimated only if there is at least one exogenous
variable in the structural model for Y
2it
that is
excluded from the structural model for Y
1it
◦ This is the Z
it
variable in eq. (2)
June 14 - Dr. Pham Thi Bich Ngoc
In eq. (4) the coefficient is “just” identified
because there is only one exogenous variable (Z
it
)
that is in the Y
2it
model and that is excluded from
the Y
1it
model
June 14 - Dr. Pham Thi Bich Ngoc
Suppose we had included Z
it
in both models
In this case, the coefficient cannot be identified
because we estimate and
◦ In other words, we cannot determine whether the effect of
Z
it
on Y
1it
is a main effect (a
3
) or an indirect effect through
Y
2it
(a
1
b
2
)
Here we say that the system of equations is
“under-identified”
June 14 - Dr. Pham Thi Bich Ngoc
Suppose we had included two exogenous variables
in the Y
2it
model and we excluded both these
variables from the Y
1it
model
Now we have estimates of , , , and
Therefore:
Here we say that the system of equations is “over-
identified”
In this example, the system is “triangular” because
there are two equations and one endogenous
right-hand side variable
June 14 - Dr. Pham Thi Bich Ngoc
When the models have a triangular structure,
the models can be estimated using the
ivregress command
◦ The models can be estimated using 2SLS or GMM
◦ 2SLS is more commonly used in practice
June 14 - Dr. Pham Thi Bich Ngoc
STATA
◦ xtivreg2 depvar
1
[varlist
1
] (depvar
2
= varlist
iv
)
depvar
1
is the dependent variable for the model which
has an endogenous regressor
varlist
1
are the exogenous variables in the model that has
the endogenous regressor
depvar
2
is the endogenous regressor
varlist
iv
are the exogenous variables that are believed to
affect the endogenous regressor
June 14 - Dr. Pham Thi Bich Ngoc
We should test whether:
◦ our chosen instruments are exogenous (i.e., they
should be uncorrelated with the error term) and
◦ it is valid to exclude some of them from the model
that has the endogenous regressor.
If they are not exogenous or they should not
be excluded, they are not valid instruments.
estat endogenous
June 14 - Dr. Pham Thi Bich Ngoc