ECONOMETRICS
Bruce E. Hansen
c
2000, 2007
1
University of Wisconsin
www.ssc.wisc.edu/~bhansen
This Revision: January 18, 2007
Comments Welcome
1
This manuscript may be printed and reproduced for individual or instructional use, but may not be printed for
commercial purposes.
Contents
1 Introduction 1
1.1 Economic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Observational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Economic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Regression and Projection 3
2.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Conditional Density and Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Best Linear Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Least Squares Estimation 12
3.1 Random Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Normal Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Model in Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 Residual Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.9 Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.10 Semiparametric Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.12 Inuential Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.13 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Inference 27
4.1 Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Covariance Matrix Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Alternative Covariance Matrix Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Functions of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8 Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Wald Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.10 F Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.11 Normal Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.12 Problems with Tests of NonLinear Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.13 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.14 Estimating a Wage Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
i
4.16 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Additional Regression Topics 51
5.1 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Testing for Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Forecast Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 NonLinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Least Absolute Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Testing for Omitted NonLinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Omitted Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Irrelevant Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.10 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.11 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 The Bootstrap 66
6.1 Denition of the Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 The Empirical Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Nonparametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Bootstrap Estimation of Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Percentile Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.6 Percentile-t Equal-Tailed Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Symmetric Percentile-t Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.8 Asymptotic Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.9 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.10 Symmetric Two-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.11 Percentile Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.12 Bootstrap Methods for Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7 Generalized Method of Moments 77
7.1 Overidentied Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 GMM Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3 Distribution of GMM Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.4 Estimation of the Ecient Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.5 GMM: The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Over-Identication Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Hypothesis Testing: The Distance Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.8 Conditional Moment Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.9 Bootstrap GMM Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8 Empirical Likelihood 86
8.1 Non-Parametric Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2 Asymptotic Distribution of EL Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.3 Overidentifying Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.5 Numerical Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.6 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9 Endogeneity 92
9.1 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2 Reduced Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.3 Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.5 Special Cases: IV and 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.6 Bekker Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.7 Identication Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
ii
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10 Univariate Time Series 101
10.1 Stationarity and Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.2 Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.3 Stationarity of AR(1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4 Lag Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.5 Stationarity of AR(k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.6 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
10.7 Asymptotic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.8 Bootstrap for Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.9 Trend Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.10Testing for Omitted Serial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.11Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.12Autoregressive Unit Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.13Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11 Multivariate Time Series 110
11.1 Vector Autoregressions (VARs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.3 Restricted VARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.4 Single Equation from a VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.5 Testing for Omitted Serial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.6 Selection of Lag Length in an VAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.7 Granger Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.8 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
11.9 Cointegrated VARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
12 Limited Dependent Variables 115
12.1 Binary Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
12.2 Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.3 Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
12.4 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
13 Panel Data 120
13.1 Individual-Eects Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
13.2 Fixed Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
13.3 Dynamic Panel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14 Nonparametrics 123
14.1 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
14.2 Asymptotic MSE for Kernel Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A Matrix Algebra 127
A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2 Matrix Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.4 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.5 Rank and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.6 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.7 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.8 Positive Deniteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.9 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.10 Kronecker Products and the Vec Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.11 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
iii
B Probability 135
B.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
B.4 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
B.5 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.6 Conditional Distributions and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.7 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B.8 Normal and Related Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C Asymptotic Theory 146
C.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
C.2 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
C.3 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
C.4 Asymptotic Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
D Maximum Likelihood 151
E Numerical Optimization 155
E.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
E.2 Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
E.3 Derivative-Free Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
iv
Chapter 1
Introduction
Econometrics is the study of estimation and inference for economic models using economic data. Econo-
metric theory concerns the study and development of tools and methods for applied econometric applications.
Applied econometrics concerns the application of these tools to economic data.
1.1 Economic Data
An econometric study requires data for analysis. The quality of the study will be largely determined
by the data available. There are three major types of economic data sets: cross-sectional, time-series, and
panel. They are distinguished by the dependence structure across observations.
Cross-sectional data sets are characterized by mutually independent observations. Surveys are a typical
source for cross-sectional data. The individuals surveyed may be persons, households, or corporations.
Time-series data is indexed by time. Typical examples include macroeconomic aggregates, prices and
interest rates. This type of data is characterized by serial dependence.
Panel data combines elements of cross-section and time-series. These data sets consist surveys of a set of
individuals, repeated over time. Each individual (person, household or corporation) is surveyed on multiple
occasions.
1.2 Observational Data
A common econometric question is to quantify the impact of one set of variables on another variable.
For example, a concern in labor economics is the returns to schooling { the change in earnings induced by
increasing a worker's education, holding other variables constant. Another issue of interest is the earnings
gap between men and women.
Ideally, we would use experimental data to answer these questions. To measure the returns to schooling,
an experiment might randomly divide children into groups, mandate dierent levels of education to the
dierent groups, and then follow the children's wage path as they mature and enter the labor force. The
dierences between the groups could be attributed to the dierent levels of education. However, experiments
such as this are infeasible, even immoral!
Instead, most economic data is observational. To continue the above example, what we observe (through
data collection) is the level of a person's education and their wage. We can measure the joint distribution
of these variables, and assess the joint dependence. But we cannot infer causality, as we are not able to
manipulate one variable to see the direct eect on the other. For example, a person's level of education is
(at least partially) determined by that person's choices and their achievement in education. These factors
are likely to be aected by their personal abilities and attitudes towards work. The fact that a person is
highly educated suggests a high level of ability. This is an alternative explanation for an observed positive
correlation between educational levels and wages. High ability individuals do better in school, and therefore
choose to attain higher levels of education, and their high ability is the fundamental reason for their high
wages. The point is that multiple explanations are consistent with a positive correlation between schooling
levels and education. Knowledge of the joint distibution cannot distinguish between these explanations.
This discussion means that causality cannot be infered from observational data alone. Causal inference
requires identication, and this is based on strong assumptions. We will return to a discussion of some of
these issues in Chapter 9.
1
1.3 Economic Data
Fortunately for economists, the development of the internet has provided a convenient forum for dis-
semination of economic data. Many large-scale economic datasets are available without charge from gov-
ernmental agencies. An excellent starting point is the Resources for Economists Data Links, available at
/>Some other excellent data sources are listed below.
Bureau of Labor Statistics: />Federal Reserve Bank of St. Louis: />Board of Governors of the Federal Reserve System: />National Bureau of Economic Research: />US Census: />Current Population Survey (CPS): />Survey of Income and Program Participation (SIPP): />Panel Study of Income Dynamics (PSID): />U.S. Bureau of Economic Analysis: />CompuStat: />International Financial Statistics (IFS): />2
Chapter 2
Regression and Projection
2.1 Variables
The most commonly applied econometric tool is regression. This is used when the goal is to quantify
the impact of one set of variables (the regressors, conditioning variable, or covariates) on another
variable (the dependent variable). We let y denote the dependent variable and (x
1
; x
2
; :::; x
k
) denote the
k regressors. It is convenient to write the set of regressors as a vector in R
k
:
x =
0
B
B
B
@
x
1
x
2
.
.
.
x
k
1
C
C
C
A
: (2.1)
Following mathematical convention, real numbers (elements of the real line R) are written using lower
case italics such as y, and vectors (elements of R
k
) by lower case bold italics such as x: Upper case bold
italics such as X will be used for matrices.
The random variables (y; x) have a distribution F which we call the population. This \population"
is innitely large. This abstraction can be a source of confusion as it does not correspond to a physical
population in the real world. The distribution F is unknown, and the goal of statistical inference is to learn
about features of F from the sample.
At this point in our analysis it is unimportant whether the observations y and x come from continuous
or discrete distributions. For example, many regressors in econometric practice are binary, taking on only
the values 0 and 1, and are typically called dummy variables.
2.2 Conditional Density and Mean
To study how the distribution of y varies with the variables x in the population, we start with f (y j x) ;
the conditional density of y given x:
To illustrate, Figure 2.1 displays the density
1
of hourly wages for men and women, from the population
of white non-military wage earners with a college degree and 10-15 years of potential work experience. These
are conditional density functions { the density of hourly wages conditional on race, gender, education and
experience. The two density curves show the eect of gender on the distribution of wages, holding the other
variables constant.
While it is easy to observe that the two densities are unequal, it is useful to have numerical measures of
the dierence. An important summary measure is the conditional mean
m (x) = E (y j x) =
Z
1
1
yf (y j x) dy: (2.2)
In general, m (x) can take any form, and exists so long as Ejyj < 1: In the example presented in Figure
2.1, the mean wage for men is $27.22, and that for women is $20.73. These are indicated in Figure 2.1 by
the arrows drawn to the x-axis.
1
These are nonparametric density estimates using a Gaussian kernel with the bandwidth selected by cross-validation. See
Chapter 14. The data are from the 2004 Current Population Survey
3
Figure 2.1: Wage Densities for White College Grads with 10-15 Years Work Experience
Take a closer look at the density functions displayed in Figure 2.1. You can see that the right tail of
the density is much thicker than the left tail. These are asymmetric (skewed) densities, which is a common
feature of wage distributions. When a distribution is skewed, the mean is not necessarily a good summary
of the central tendency. In this context it is often convenient to transform the data by taking the (natural)
logarithm. Figure 2.2 shows the density of log hourly wages for the same population, with mean log hourly
wages drawn in with the arrows. The dierence in the log mean wage between men and women is 0.30, which
implies a 30% average wage dierence for this population. This is a more robust measure of the typical wage
gap between men and women than the dierence in the untransformed wage means. For this reason, wage
regressions typically use log wages as a dependent variable rather than the level of wages.
The comparison in Figure 2.1 is facilitated by the fact that the control variable (gender) is discrete.
When the distribution of the control variable is continuous, then comparisons become more complicated. To
illustrate, Figure 2.3 displays a scatter plot
2
of log wages against education levels. Assuming for simplicity
that this is the true joint distribution, the solid line displays the conditional expectation of log wages varying
with education. The conditional expectation function is close to linear; the dashed line is a linear projection
approximation which will be discussed in the Section 2.6. The main point to be learned from Figure 2.3 is
that the conditional expectation describes the central tendency of the conditional distribution. Of particular
interest to graduate students may be the observation that dierence between a B.A. and a Ph.D. degree in
mean log hourly wages is 0.36, implying an average 36% dierence in wage levels.
2.3 Regression Equation
The regression error e is dened to be the dierence between y and its conditional mean (2.2) evaluated
at the observed value of x:
e = y m(x):
By construction, this yields the formula
y = m(x) + e: (2.3)
Theorem 2.3.1 Properties of the regression error e
1. E (e j x) = 0:
2. E(e) = 0:
2
White non-military male wage earners with 10-15 years of potential work experience.
4
Figure 2.2: Log Wage Densities
3. E (h(x)e) = 0 for any function h () :
4. E(xe) = 0:
To show the rst statement, by the denition of e and the linearity of conditional expectations,
E (e j x) = E ((y m(x)) j x)
= E (y j x) E (m(x) j x)
= m(x) m(x)
= 0:
The remaining parts of the Theorem are left as an exercise.
The equations
y = m(x) + e
E (e j x) = 0:
are often stated jointly as the regression framework. It is important to understand that this is a framework,
not a model, because no restrictions have been placed on the joint distribution of the data. These equations
hold true by denition. A regression model imposes further restrictions on the joint distribution; most
typically, restrictions on the permissible class of regression functions m (x) :
The conditional mean also has the property of being the the best predictor of y; in the sense of achieving
the lowest mean squared error. To see this, let g (x) be an arbitrary predictor of y given x: The expected
squared error using this prediction function is
E (y g (x))
2
= E (e + m (x) g (x))
2
= Ee
2
+ 2E (e (m (x) g (x))) + E (m (x) g (x))
2
= Ee
2
+ E (m (x) g (x))
2
Ee
2
where the second equality uses Theorem 2.3.1.3. The right-hand-side is minimized by setting g (x) = m (x) :
Thus the mean squared error is minimized by the conditional mean.
5
Figure 2.3: Conditional Mean of Wages Given Education
2.4 Conditional Variance
While the conditional mean is a good measure of the location of a conditional distribution, it does
not provide information about the spread of the distribution. A common measure of the dispersion is the
conditional variance
2
(x) = var (y j x) = E
e
2
j x
:
Generally,
2
(x) is a non-trivial function of x, and can take any form, subject to the restriction that it is
non-negative. The conditional standard deviation is its square root (x) =
p
2
(x):
In the special case where
2
(x) is a constant and independent of x so that
E
e
2
j x
=
2
(2.4)
we say that the error e is homoskedastic. In the general case where
2
(x) depends on x we say that the
error e is heteroskedastic.
Some textbooks inappropriately describe heteroskedasticity as the case where \the variance of e varies
across observations". This concept is less helpful than dening heteroskedasticity as the dependence of the
conditional variance on the observables x:
As an example, take the conditional wage densities displayed in Figure 2.1. The conditional standard
deviation for men is 12.1 and that for women is 10.5. So while men have higher average wages, they are also
somewhat more dispersed.
2.5 Linear Regression
An important special case of (2.3) is when the conditional mean function m (x) is linear in x (or linear
in functions of x): Notationally, it is convenient to augment the regressor vector x by listing the number \1"
as an element. We call this the \constant" or \intercept". Equivalently, we assume that x
1
= 1, where x
1
is
the rst element of the vector x dened in (2.1). Thus (2.1) has been redened as the k 1 vector
x =
0
B
B
B
@
1
x
2
.
.
.
x
k
1
C
C
C
A
: (2.5)
When m(x) is linear in x; we can write it as
m(x) = x
0
=
1
+ x
2i
2
+ + x
ki
k
(2.6)
6
where
=
0
B
@
1
.
.
.
k
1
C
A
(2.7)
is a k 1 parameter vector.
In this case (2.3) can be written as
y = x
0
+ e (2.8)
E (e j x) = 0: (2.9)
Equation (2.8) is called the linear regression model,
An important special case is homoskedastic linear regression model
y = x
0
+ e
E (e j x) = 0
E
e
2
j x
=
2
:
2.6 Best Linear Predictor
While the conditional mean m(x) = E (y j x) is the best predictor of y among all functions of x; its
functional form is typically unknown, and the linear assumption of the previous section is empirically unlikely
to be accurate. Instead, it is more realistic to view the linear specication (2.6) as an approximation. We
derive an appropriate approximation in this section.
In the linear projection model the coecient is dened so that the function x
0
is the best linear
predictor of y. As before, by \best" we mean the predictor function with lowest mean squared error. For
any 2 R
k
a linear predictor for y is x
0
with expected squared prediction error
S() = E (y x
0
)
2
= Ey
2
2
0
E (xy) + E (xx
0
)
which is quadratic in : The best linear predictor is obtained by selecting to minimize S(): The
rst-order condition for minimization (from Appendix A.9) is
0 =
@
@
S() = 2E (xy) + 2E (xx
0
) :
Solving for we nd
= (E (xx
0
))
1
E (xy) : (2.10)
It is worth taking the time to understand the notation involved in this expression. E (xx
0
) is a k k matrix
and E (x
i
y
i
) is a k1 column vector. Therefore, alternative expressions such as
E(xy)
E(xx
0
)
or E (xy) (E (xx
0
))
1
are incoherent and incorrect. Appendix A provides a comprehensive review of matrix notation and operation.
The vector (2.10) exits and is unique as long as the k k matrix Q = E (xx
0
) is invertible. The matrix
Q plays an important role in least-squares theory so we will discuss some of its properties in detail. Observe
that for any non-zero 2 R
k
;
0
Q = E (
0
xx
0
) = E (
0
x)
2
0
so Q is by construction positive semi-denite. It is invertible if and only if it is positive denite, which
requires that for all non-zero ; E (
0
x)
2
> 0: Equivalently, there cannot exist a non-zero vector such that
0
x = 0 identically. This occurs when redundant variables are included in x: In order for to be uniquely
dened, this situation must be excluded.
Given the denition of in (2.10), x
0
is the best linear predictor for y: The error is
e = y x
0
: (2.11)
Notice that the error e from the linear prediction equation is equal to the error from the regression equation
when (and only when) the conditional mean is linear in x; otherwise they are distinct.
Rewriting, we obtain a decomposition of y into linear predictor and error
y = x
0
+ e: (2.12)
7
This completes the derivation of the model. We call x
0
alternatively the best linear predictor of y given
x
0
; or the linear projection of y onto x: In general we will call equation (2.12) the linear projection model.
We now summarize the assumptions necessary for its derivation and list the implications in Theorem
2.6.1.
Assumption 2.6.1
1. x contains an intercept;
2. Ey
2
< 1;
3. Ex
2
j
< 1 for j = 1; :::; k:
4. Q = E (xx
0
) is invertible.
Theorem 2.6.1 Under Assumption 2.6.1, (2.10) and (2.11) are well dened. Furthermore,
E
e
2
< 1 (2.13)
E (xe) = 0 (2.14)
and
E (e) = 0: (2.15)
A complete proof of Theorem (2.6.1) is presented in Section 2.7.
The two equations (2.12) and (2.14) summarize the linear projection model. Let's compare it with the
linear regression model (2.8)-(2.9). Since from Theorem 2.3.1.4 we know that the regression error has the
property E (xe) = 0; it follows that linear regression is a special case of the projection model. However,
the converse is not true as the projection error does not necessarily satisfy E (e j x) = 0: For example,
suppose that for x 2 R that Ex = 0; Ex
3
= 0; and e = x
2
Ex
2
: Then Exe = Ex
3
ExEx
2
= 0 yet
E (e j x) = x
2
Ex
2
6= 0:
It is useful to note that the facts that E (xe) = 0 and E (e) = 0 means that the variables x and e are
uncorrelated.
The conditions listed in Assumption 2.6.1 are weak. The nite second moment Assumptions 2.6.1.2 and
2.6.1.3 are called regularity conditions. Assumption 2.6.1.4 is required to ensure that is uniquely dened.
Assumption 2.6.1.1 is employed to guarantee that (2.15) holds.
We have shown that under mild regularity conditions for any pair (y; x) we can dene a linear equation
(2.12) with the properties listed in Theorem 2.6.1. No additional assumptions are required. However, it is
important to not misinterpret the generality of this statement. The linear equation (2.12) is dened by the
denition of the best linear predictor and the associated coecient denition (2.10). In contrast, in many
economic models the parameter may be dened within the model. In this case (2.10) may not hold and the
implications of Theorem 2.6.1 may be false. These structural models require alternative estimation methods,
and are discussed in Chapter 9.
Returning to the joint distribution displayed in Figure 2.3, the dashed line is projection of log wages
onto education. In this example the linear predictor is a close approximation to the conditional mean. In
other cases the two may be quite dierent. Figure 2.4 displays the relationship
3
between mean log hourly
wages and labor market experience. The solid line is the conditional mean, and the straight dashed line is
the linear projection. In this case the linear projection is a poor approximation to the conditional mean. It
over-predicts wages for young and old workers, and under-predicts for the rest. Most importantly, it misses
the strong downturn in expected wages for those above 35 years work experience (equivalently, for those over
53 in age).
This defect in the best linear predictor can be partially corrected through a careful selection of regressors.
In the example just presented, we can augment the regressor vector x to include both experience and
experience
2
: The best linear predictor of log wages given these two variables can be called a quadratic
projection, since the resulting function is quadratic in experience: Other than the redenition of the regressor
vector, there are no changes in our methods or analysis. In Figure 2.4 we display as well the quadratic
projection. In this example it is a much better approximation to the conditional mean than the linear
projection.
3
In the population of Caucasian non-military male wage earners with 12 years of education.
8
Figure 2.4: Hourly Wage as a Function of Experience
Another defect of linear projection is that it is sensitive to the marginal distribution of the regressors when
the conditional mean is non-linear. We illustrate the issue in Figure 2.5 for a constructed
4
joint distribution
of y and x. The solid line is the non-linear conditional mean of y given x: The data are divided in two { Group
1 and Group 2 { which have dierent marginal distributions for the regressor x; and Group 1 has a lower
mean value of x than Group 2. The separate linear projections of y on x for these two groups are displayed
in the Figure by the dashed lines. These two projections are distinct approximations to the conditional
mean. A defect with linear projection is that it leads to the incorrect conclusion that the eect of x on y is
dierent for individuals in the two Groups. This conclusion is incorrect because is fact there is no dierence
in the conditional mean function. The apparant dierence is a by-product of a linear approximation to a
non-linear mean, combined with dierent marginal distributions for the conditioning variables.
2.7 Technical Proofs
Proof of Theorem 2.6.1. We rst show that the moments E (xy) and E (xx
0
) are nite and well dened.
First, it is useful to note that Assumption 2.6.1.3 implies that
Ekxk
2
= E (x
0
x) =
k
X
j=1
Ex
2
j
< 1: (2.16)
Note that for j = 1; :::; k; by the Cauchy-Schwarz Inequality (C.3) and Assumptions 2.6.1.2 and 2.6.1.3
Ejx
j
yj
Ex
2
j
1=2
Ey
2
1=2
< 1:
Thus the elements in the vector E (xy) are well dened and nite. Next, note that the jl'th element of
E (xx
0
) is E (x
j
x
l
) : Observe that
Ejx
j
x
l
j
Ex
2
j
1=2
Ex
2
l
1=2
< 1
under Assumption 2.6.1.3. Thus all elements of the matrix E (xx
0
) are nite.
Equation (2.10) states that = (E (xx
0
))
1
E (xy) which is well dened since (E (xx
0
))
1
exists under
Assumption 2.6.1.4. It follows that e = y x
0
as dened in (2.11) is also well dened.
4
The x
i
in Group 1 are N(2; 1) and those in Group 2 are N(4; 1); and the conditional distriubtion of y given x is N(m(x); 1)
where m(x) = 2x x
2
=6:
9
Figure 2.5: Conditional Mean and Two Linear Projections
Note the Schwarz Inequality (A.7) implies (x
0
)
2
kxk
2
kk
2
and therefore combined with (2.16) we
see that
E (x
0
)
2
Ekxk
2
kk
2
< 1: (2.17)
Using Minkowski's Inequality (C.5), Assumption 2.6.1.2 and (2.17) we nd
E
e
2
1=2
=
E (y x
0
)
2
1=2
Ey
2
1=2
+
E (x
0
)
2
1=2
< 1
establishing (2.13).
An application of the Cauchy-Schwarz Inequality (C.3) shows that for any j
Ejx
j
ej
Ex
2
j
1=2
Ee
2
1=2
< 1
and therefore the elements in the vector E (xe) are well dened and nite.
Using the denitions (2.11) and (2.10), and the matrix properties that AA
1
= I and Ia = a;
E (xe) = E (x (y x
0
))
= E (xy) E (xx
0
) (E (xx
0
))
1
E (xy)
= 0:
Finally, equation (2.15) follows from (2.14) and Assumption 2.6.1.1.
10
2.8 Exercises
1. Prove parts 2, 3 and 4 of Theorem 2.3.1.
2. Suppose that the random variables y and x only take the values 0 and 1, and have the following joint
probability distribution
x = 0 x = 1
y = 0 .1 .2
y = 1 .4 .3
Find E (y j x) ; E
y
2
j x
and var (y j x) for x = 0 and x = 1:
3. Suppose that y is discrete-valued, taking values only on the non-negative integers, and the conditional
distribution of y given x is Poisson:
P (y = k j x) =
exp (x
0
) (x
0
)
j
j!
; j = 0; 1; 2; :::
Compute E (y j x) and var (y j x) : Does this justify a linear regression model of the form y = x
0
+ e?
Hint: If P (y = j) =
exp()
j
j!
; then Ey = and var(y) = :
4. Let x and y have the joint density f (x; y) =
3
2
x
2
+ y
2
on 0 x 1; 0 y 1: Compute the
coecients of the best linear predictor y =
1
+
2
x + e: Compute the conditional mean m(x) =
E (y j x) : Are they dierent?
5. Take the bivariate linear projection model
y =
1
+
2
x + e
E (e) = 0
E (xe) = 0
Dene
y
= Ey;
x
= Ex;
2
x
= var(x);
2
y
= var(y) and
xy
= cov(x; y): Show that
2
=
xy
=
2
x
and
1
=
y
1
x
:
6. True or False. If y = x + e; x 2 R; and E (e j x) = 0; then E
x
2
e
= 0:
7. True or False. If y = x
0
+ e and E (e j x) = 0; then e is independent of x:
8. True or False. If y = x
0
+ e, E (e j x) = 0; and E
e
2
j x
=
2
; a constant, then e is independent of
x:
9. True or False. If y = x + e; x 2 R; and E (x
i
e
i
) = 0; then E
x
2
e
= 0:
10. True or False. If y = x
0
+ e and E(xe) = 0; then E (e j x) = 0:
11. Let x be a random variable with = Ex and
2
= var(x): Dene
g
x j ;
2
=
x
(x )
2
2
:
Show that Eg (x j m; s) = 0 if and only if m = and s =
2
:
11
Chapter 3
Least Squares Estimation
3.1 Random Sample
In Chapter 2, we discussed the joint distribution of a pair of random variables (y; x) 2 R R
k
; describing
the regression relationship as the conditional mean of y given x; and the approximation given by the best
linear predictor of y given x. We now discuss estimation of these relationships from economic data.
In a typical application, an econometrician's data is a set of observed measurements on the variables
(y; x) for a group of individuals. These individuals may be persons, households, rms or other economic
agents. We call this information the data, dataset, or sample, and denote the number of inviduals in the
dataset by the natural number n.
We will use the index i to indicate the i'th individual in the dataset. The observation for the i'th
individual will be written as (y
i
; x
i
) : y
i
is the observed value of y for individual i and x
i
is the observed
value of x for the same individual.
If the data is cross-sectional (each observation is a dierent individual) it is often reasonable to assume
the observations are mutually independent. This means that the pair (y
i
; x
i
) is independent of (y
j
; x
j
)
for i 6= j. (Sometimes the independent label is misconstrued. It is not a statement about the relationship
between y
i
and x
i
:) Furthermore, if the data is randomly gathered, it is reasonable to model each observation
as a random draw from the same probability distribution. In this case we say that the data are independent
and identically distributed, or iid. We call this a random sample.
Assumption 3.1.1 The observations (y
i
; x
i
) i = 1; :::; n; are mutually independent across observations i
and identically distributed.
This chapter explores estimation and inference in the linear projection model for a random sample:
y
i
= x
0
i
+ e
i
(3.1)
E (x
i
e
i
) = 0 (3.2)
= (E (x
i
x
0
i
))
1
E (x
i
y
i
) (3.3)
In Sections 3.8 and 3.9, we narrow the focus to the linear regression model, but for most of the chapter we
retain the broader focus on the projection model.
3.2 Estimation
Equation (3.3) writes the projection coecient as an explicit function of population moments E (x
i
y
i
)
and E (x
i
x
0
i
) : Their moment estimators are the sample moments
b
E (x
i
y
i
) =
1
n
n
X
i=1
x
i
y
i
b
E (x
i
x
0
i
) =
1
n
n
X
i=1
x
i
x
0
i
:
12
It follows that the moment estimator of replaces the population moments in (3.3) with the sample moments:
^
=
b
E (x
i
x
0
i
)
1
b
E (x
i
y
i
)
=
1
n
n
X
i=1
x
i
x
0
i
!
1
1
n
n
X
i=1
x
i
y
i
!
=
n
X
i=1
x
i
x
0
i
!
1
n
X
i=1
x
i
y
i
!
(3.4)
Another way to derive
^
is as follows. Observe that (3.2) can be written in the parametric form g () =
E (x
i
(y
i
x
0
i
)) = 0: The function g () can be estimated by
^g () =
1
n
n
X
i=1
x
i
(y
i
x
0
i
) :
This is a set of k equations which are linear in : The estimator
^
is the value which jointly sets these
equations equal to zero:
0 = ^g
^
(3.5)
=
1
n
n
X
i=1
x
i
y
i
x
0
i
^
=
1
n
n
X
i=1
x
i
y
i
1
n
n
X
i=1
x
i
x
0
i
^
whose solution is (3.4).
To illustrate, consider the data used to generate Figure 2.3. These are white male wage earners from the
March 2004 Current Population Survey, excluding military, with 10-15 years of potential work experience.
This sample has 988 observations. Let y
i
be log wages and x
i
be an intercept and years of education. Then
1
n
n
X
i=1
x
i
y
i
=
2:95
42:40
1
n
n
X
i=1
x
i
x
0
i
=
1 14:14
14:14 205:83
:
Thus
^
=
1 14:14
14:14 205:83
1
2:95
42:40
=
34:94 2: 40
2: 40 0:170
2:95
42:40
=
1: 313
0:128
:
We often write the estimated equation using the format
\
log(W age) = 1:313 + 0:128 Education:
An interpretation of the estimated equation is that each year of education is associated with a 12.8% increase
in mean wages.
13
3.3 Least Squares
Least squares is another classic motivation for the estimator (3.4). Dene the sum-of-squared errors
(SSE) function
S
n
() =
n
X
i=1
(y
i
x
0
i
)
2
=
n
X
i=1
y
2
i
2
0
n
X
i=1
x
i
y
i
+
0
n
X
i=1
x
i
x
0
i
:
This is a quadratic function of : To visualize this function, Figure 3.1 displays an example sum-of-squared
errors function S
n
() for the case k = 2:
Figure 3.1: Sum-of-Squared Errors Function
The Ordinary Least Squares (OLS) estimator is the value of which minimizes S
n
(): Matrix
calculus (see Appendix A.9) gives the rst-order conditions for minimization:
0 =
@
@
S
n
(
^
)
= 2
n
X
i=1
x
i
y
i
+ 2
n
X
i=1
x
i
x
0
i
^
whose solution is (3.4). Following convention we will call
^
the OLS estimator of :
As a by-product of OLS estimation, we dene the predicted value
^y
i
= x
0
i
^
and the residual
^e
i
= y
i
^y
i
= y
i
x
0
i
^
:
Note that y
i
= ^y
i
+ ^e
i
: It is important to understand the distinction between the error e
i
and the residual
^e
i
: The error is unobservable, while the residual is a by-product of estimation. These two variables are
frequently mislabeled, which can cause confusion.
14
Equation (3.5) implies that
1
n
n
X
i=1
x
i
^e
i
= 0:
Since x
i
contains a constant, one implication is that
1
n
n
X
i=1
^e
i
= 0:
Thus the residuals have a sample mean of zero and the sample correlation between the regressors and the
residual is zero. These are algebraic results, and hold true for all linear regression estimates.
The error variance
2
= Ee
2
i
is also a parameter of interest. It measures the variation in the \unexplained"
part of the regression. Its method of moments estimator is the sample average of the squared residuals
^
2
=
1
n
n
X
i=1
^e
2
i
: (3.6)
An alternative estimator uses the formula
s
2
=
1
n k
n
X
i=1
^e
2
i
: (3.7)
A justication for the latter choice will be provided in Section 3.8.
A measure of the explained variation relative to the total variation is the coecient of determination
or R-squared.
R
2
=
P
n
i=1
(^y
i
y)
2
P
n
i=1
(y
i
y)
2
= 1
^
2
^
2
y
where
^
2
y
=
1
n
n
X
i=1
(y
i
y)
2
is the sample variance of y
i
: The R
2
is frequently mislabeled as a measure of \t". It is an inappropriate
label as the value of R
2
does not help interpret the parameter estimates
^
or test statistics concerning .
Instead, it should be viewed as an estimator of the population parameter
2
=
var (x
0
i
)
var(y
i
)
= 1
2
2
y
where
2
y
= var(y
i
): An alternative estimator of
2
proposed by Theil called \R-bar-squared" is
R
2
= 1
s
2
~
2
y
where
~
2
y
=
1
n 1
n
X
i=1
(y
i
y)
2
:
Theil's estimator R
2
is a ratio of adjusted variance estimators, and therefore is expected to be a better
estimator of
2
than the unadjusted estimator R
2
:
3.4 Normal Regression Model
Another motivation for the least-squares estimator can be obtained from the normal regression model.
This is the linear regression model with the additional assumption that the error e
i
is independent of x
i
and has the distribution N
0;
2
: This is a parametric model, where likelihood methods can be used for
estimation, testing, and distribution theory.
15
The log-likelihood function for the normal regression model is
log L(;
2
) =
n
X
i=1
log
1
(2
2
)
1=2
exp
1
2
2
(y
i
x
0
i
)
2
!
=
n
2
log
2
2
1
2
2
S
n
()
The MLE (
^
; ^
2
) maximize log L(;
2
): Since log L(;
2
) is a function of only through the sum of squared
errors S
n
(); maximizing the likelihood is identical to minimizing S
n
(). Hence the MLE for equals the
OLS estimator:
Plugging
^
into the log-likelihood we obtain
log L
^
;
2
=
n
2
log
2
2
1
2
2
n
X
i=1
^e
2
i
:
Maximization with respect to
2
yields the rst-order condition
@
@
2
log L
^
; ^
2
=
n
2^
2
+
1
2
^
2
2
n
X
i=1
^e
2
i
= 0:
Solving for ^
2
yields the method of moments estimator (3.6). Thus the MLE
^
; ^
2
for the normal regression
model are identical to the method of moment estimators. Due to this equivalence, the OLS estimator
^
is
frequently referred to as the Gaussian MLE.
3.5 Model in Matrix Notation
For many purposes, including computation, it is convenient to write the model and statistics in matrix
notation. The linear equation (2.12) is a system of n equations, one for each observation. We can stack these
n equations together as
y
1
= x
0
1
+ e
1
y
2
= x
0
2
+ e
2
.
.
.
y
n
= x
0
n
+ e
n
:
Now dene
y =
0
B
B
B
@
y
1
y
2
.
.
.
y
n
1
C
C
C
A
; X =
0
B
B
B
@
x
0
1
x
0
2
.
.
.
x
0
n
1
C
C
C
A
; e =
0
B
B
B
@
e
1
e
2
.
.
.
e
n
1
C
C
C
A
:
Observe that y and e are n 1 vectors, and X is an n k matrix. Then the system of n equations can be
compactly written in the single equation
y = X + e:
Sample sums can also be written in matrix notation. For example
n
X
i=1
x
i
x
0
i
= X
0
X
n
X
i=1
x
i
y
i
= X
0
y:
Thus the estimator (3.4), residual vector, and sample error variance can be written as
^
=
X
0
X
1
X
0
y
^e = y X
^
^
2
= n
1
^e
0
^e:
16
A useful result is obtained by inserting y = X + e into the formula for
^
to obtain
^
=
X
0
X
1
X
0
(X + e)
=
X
0
X
1
X
0
X +
X
0
X
1
X
0
e
= +
X
0
X
1
X
0
e: (3.8)
3.6 Projection Matrices
Dene the matrices
P = X
X
0
X
1
X
0
and
M = I
n
X
X
0
X
1
X
0
= I
n
P
where I
n
is the n n identity matrix. P and M are called projection matrices due to the property that
for any matrix Z which can be written as Z = X for some matrix ; (we say that Z lies in the range
space of X) then
P Z = P X = X
X
0
X
1
X
0
X = X = Z
and
MZ = (I
n
P ) Z = Z P Z = Z Z = 0:
As an important example of this property, partition the matrix X into two matrices X
1
and X
2
; so that
X = [X
1
X
2
] :
Then P X
1
= X
1
and M X
1
= 0: It follows that MX = 0 and M P = 0; so M and P are orthogonal.
The matrices P and M are symmetric and idempotent
1
. To see that P is symmetric,
P
0
=
X
X
0
X
1
X
0
0
=
X
0
0
X
0
X
1
0
(X)
0
= X
X
0
X
0
1
X
0
= X
(X)
0
X
0
0
1
X
0
= P :
To establish that it is idempotent,
P P =
X
X
0
X
1
X
0
X
X
0
X
1
X
0
= X
X
0
X
1
X
0
X
X
0
X
1
X
0
= X
X
0
X
1
X
0
= P :
Similarly,
M
0
= (I
n
P )
0
= I
n
P = M
and
MM = M (I
n
P )
= M MP
= M;
1
A matrix P is symmetric if P
0
= P : A matrix P is idempotent if P P = P: See Appendix A.8
17
since M P = 0:
Another useful property is that
tr P = k (3.9)
tr M = n k (3.10)
(See Appendix A.4 for denition and properties of the trace operator.) To show (3.9) and (3.10),
tr P = tr
X
X
0
X
1
X
0
= tr
X
0
X
1
X
0
X
= tr (I
k
)
= k;
and
tr M = tr (I
n
P ) = tr (I
n
) tr (P ) = n k:
Given the denitions of P and M; observe that
^y = X
^
= X
X
0
X
1
X
0
y = P y
and
^e = y X
^
= y P y = M y: (3.11)
Furthermore, since y = X + e and MX = 0; then
^e = M (X + e) = M e: (3.12)
Another way of writing (3.11) is
y = (P + M ) y = P y + My = ^y + ^e:
This decomposition is orthogonal, that is
^y
0
^e = (P y)
0
(My) = y
0
P My = 0:
3.7 Residual Regression
Partition
X = [X
1
X
2
]
and
=
1
2
:
Then the regression model can be rewritten as
y = X
1
1
+ X
2
2
+ e: (3.13)
Observe that the OLS estimator of = (
0
1
;
0
2
)
0
can be obtained by regression of y on X = [X
1
X
2
]: OLS
estimation can be written as
y = X
1
^
1
+ X
2
^
2
+ ^e (3.14)
Suppose that we are primarily interested in
2
; not in
1
; so we are only interested in obtaining the
OLS sub-component
^
2
: In this section we derive an alternative expression for
^
2
which does not involve
estimation of the full model.
Dene
M
1
= I
n
X
1
X
0
1
X
1
1
X
0
1
:
Recalling the denition M = I X
X
0
X
1
X
0
; observe that X
0
1
M
1
= 0 and thus
M
1
M = M X
1
X
0
1
X
1
1
X
0
1
M = M:
18
It follows that
M
1
^e = M
1
Me = M e = ^e:
Using this result, if we premultiply (3.14) by M
1
we obtain
M
1
y = M
1
X
1
^
1
+ M
1
X
2
^
2
+ M
1
^e
= M
1
X
2
^
2
+ ^e (3.15)
the second equality since M
1
X
1
= 0. Premultiplying by X
0
2
and recalling that X
0
2
^e = 0; we obtain
X
0
2
M
1
y = X
0
2
M
1
X
2
^
2
+ X
0
2
^e = X
0
2
M
1
X
2
^
2
:
Solving,
^
2
=
X
0
2
M
1
X
2
1
X
0
2
M
1
y
an alternative expression for
^
2
:
Now, dene
~
X
2
= M
1
X
2
(3.16)
~y = M
1
y; (3.17)
the least-squares residuals from the regression of X
2
and y; respectively, on the matrix X
1
only. Since the
matrix M
1
is idempotent, M
1
= M
1
M
1
and thus
^
2
=
X
0
2
M
1
X
2
1
X
0
2
M
1
y
=
X
0
2
M
1
M
1
X
2
1
X
0
2
M
1
M
1
y
=
~
X
0
2
~
X
2
1
~
X
0
2
~y
:
This shows that
^
2
can be calculated by the OLS regression of ~y on
~
X
2
: This technique is called residual
regression.
Furthermore, using the denitions (3.16) and (3.17), expression (3.15) can be equivalently written as
~y =
~
X
2
^
2
+ ^e:
Since
^
2
is precisely the OLS coecient from a regression of ~y on
~
X
2
; this shows that the residual vector
from this regression is ^e, numerically the same residual as from the joint regression (3.14). We have proven
the following theorem.
Theorem 3.7.1 (Frisch-Waugh-Lovell). In the model (3.13), the OLS estimator of
2
and the OLS resid-
uals ^e may be equivalently computed by either the OLS regression (3.14) or via the following algorithm:
1. Regress y on X
1
; obtain residuals ~y;
2. Regress X
2
on X
1
; obtain residuals
~
X
2
;
3. Regress ~y on
~
X
2
; obtain OLS estimates
^
2
and residuals ^e:
In some contexts, the FWL theorem can be used to speed computation, but in most cases there is little
computational advantage to using the two-step algorithm. Rather, the primary use is theoretical.
A common application of the FWL theorem, which you may have seen in an introductory econometrics
course, is the demeaning formula for regression. Partition X = [X
1
X
2
] where X
1
= is a vector of ones,
and X
2
is the vector of observed regressors. In this case,
M
1
= I (
0
)
1
0
:
Observe that
~
X
2
= M
1
X
2
= X
2
(
0
)
1
X
2
= X
2
X
2
19
and
~y = M
1
y
= y (
0
)
1
0
y
= y y;
which are \demeaned". The FWL theorem says that
^
2
is the OLS estimate from a regression of y
i
y on
x
2i
x
2
:
^
2
=
n
X
i=1
(x
2i
x
2
) (x
2i
x
2
)
0
!
1
n
X
i=1
(x
2i
x
2
) (y
i
y)
!
:
Thus the OLS estimator for the slope coecients is a regression with demeaned data.
3.8 Bias and Variance
In this and the following section we consider the special case of the linear regression model (2.8)-(2.9).
In this section we derive the small sample conditional mean and variance of the OLS estimator.
By the independence of the observations and (2.9), observe that
E (e j X) =
0
B
B
@
.
.
.
E (e
i
j X)
.
.
.
1
C
C
A
=
0
B
B
@
.
.
.
E (e
i
j x
i
)
.
.
.
1
C
C
A
= 0: (3.18)
Using (3.8), the properties of conditional expectations, and (3.18), we can calculate
E
^
j X
= E
X
0
X
1
X
0
e j X
=
X
0
X
1
X
0
E (e j X)
= 0:
We have shown that
E
^
j X
= (3.19)
which implies
E
^
=
and thus the OLS estimator
^
is unbiased for :
Next, for any random vector Z dene the covariance matrix
var(Z) = E (Z EZ) (Z EZ)
0
= EZZ
0
(EZ) (EZ)
0
:
Then given (3.19) we see that
var
^
j X
= E
^
^
0
j X
=
X
0
X
1
X
0
DX
X
0
X
1
where
D = E (ee
0
j X) :
The i'th diagonal element of D is
E
e
2
i
j X
= E
e
2
i
j x
i
=
2
i
while the ij
0
th o-diagonal element of D is
E (e
i
e
j
j X) = E (e
i
j x
i
) E (e
j
j x
j
) = 0:
20