Chapter 4
Linear Regression
and Correlation
analysis
1. Introduction to regression
analysis
Regression analysis
- Describe a relationship between two
variables in mathematical terms.
- Predict the value of a dependent variable
based on the value of at least one
independent variable
- Explain the impact of changes in an
independent variable on the dependent
variable
1. Introduction to regression
analysis
Dependent
variable
the variable we
wish to explain
Independent
variable
the variable
used to
explain the
dependent
variable
Names for ys and xs in
regression model
Names for y
Name for xs
Dependent
variable
Regressand
Independent
variables
Regressors
Effect variable
Causal variables
Explained variable
Explanatory
variables
Simple Linear Regression
Model
Only one independent variable,
x
Relationship between x and y
is described by a linear function
Changes in y are assumed to
be caused by changes in x
Types of Regression
Models
Positive Linear Relationship
Negative Linear Relationship
Non-linear relationship
No Relationship
Population Linear
Regression
The population regression model:
Population
y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
y β0 β1x ε
Linear component
Random
Error
term, or
residual
Random Error
component
Linear Regression
Assumptions
Error values (ε) are statistically
independent
Error values are normally distributed
for any given value of x
The probability distribution of the
errors has constant variance
The underlying relationship between
the x variable and the y variable is
linear
Population Linear
Regression
y
y β0 β1x ε
Observed Value
of y for xi
εi
Predicted Value
of y for xi
Slope = β1
Random Error
for this x value
Intercept = β0
xi
x
Estimated Regression
Model
The sample regression line provides an estimate of
the population regression line
Estimated
(or predicted)
y value
Estimate of
the regression
Estimate of the
regression slope
intercept
yˆ i b0 b1x
Independent
variable
The individual random error terms ei have a mean of zero
Least Squares Criterion
b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the
sum of the squared residuals
2
ˆ
e (y y)
2
2
(y (b0 b1x))
The Least Squares
Equation
The formulas for b1 and b0 are:
b1
o
r
xy
x y
n
2
(
x
)
2
x n
xy x . y
b1
2
x
and
b0 y b1 x
Interpretation of the
Slope and the Intercept
b0 is the estimated average value
of y when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a
one-unit change in x
Example
A real estate agent wishes to examine
the relationship between the selling
price of a home and its size
(measured in square feet)
A random samplehouse
of 10price
houses
is
in $1000s
selected
Independent
Dependent variable
variable (x)?
(y)?square feet
Sample Data for House
Price Model
House Price in
$1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Least Squares
Regression Properties
The sum of the residuals from the least
squares regression
yˆ ) 0is 0
( y line
The sum of the squared residuals is a
minimum
(minimized)
( y yˆ ) 2
The simple regression line always passes
through the mean of the y variable and
the mean of the
y xb0variable
b1 x
The least squares coefficients are
unbiased estimates of β0 and β1
2
R
Coefficient of Determination
The coefficient of determination is
the portion of the total variation in
the dependent variable that is
explained by variation in the
independent variable
The coefficient of determination is
also called RSS
R-squared and is denoted
2
2
where
as
0 R 1
R
TSS
2
Coefficient of Determination
R
Coefficient of determination
RSS sum of squares explained by regression
R
TSS
total sum of squares
2
Examples of
Approximate
2 Values
R
y
R2 = 1
R2 = 1
x
100% of the variation
in y is explained by
variation in x
y
R2 = +1
Perfect linear relationship
between x and y:
x
Examples of
Approximate
2
R Values
y
0 < R2 < 1
x
Weaker linear relationship
between x and y:
Some but not all of
the variation in y is
explained by
variation in x
y
x
Examples of
Approximate
2 Values
R
R2 = 0
y
No linear relationship
between x and y:
R2 = 0
x
The value of Y does
not depend on x.
(None of the
variation in y is
explained by
variation in x)
y
245
x
1400
312
1600
279
1700
308
1875
199
1100
219
2865
1550
17150
yˆ
251.82
83
273.76
83
284.73
83
303.93
58
218.91
83
268.28
33
2863.8
38
( yˆ y ) 2
( y y )2
1202.1
27
162.096
2
3.1035
87
304.007
1
4567.2
86
331.84
82
18911.
71
1722.2
5
650.25
56.25
462.25
7656.25
4556.25
32600.
5
Coefficient of
determination
RSS
R
TSS
2
2. Correlation analysis
Correlation is a technique used to
measure the strength of the
relationship between two variables.
The stronger the correlation, the
better the relationship or the better
fit the regression line and vice versa.
Scatter Plot Examples
Low degree of
correlation
High degree of
correlation
y
y
x
y
x
y
x
x