Chap 14: Simple linear regression and correlation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.15 MB, 75 trang )

Chapter 14
Simple linear regression and
correlation
Introduction
Our problem objective is to analyse the
relationship between numerical variables;
regression analysis is the first tool we will study.
Regression analysis is used to predict the value of
one variable (the dependent variable) on the
basis of other variables (the independent
variables).
Dependent variable: denoted Y
Independent variables: denoted X
1
, X
2
, …, X
k
3
Correlation Analysis…
If we are interested only in determining whether a
relationship exists, we employ correlation
analysis, a technique introduced earlier.
This chapter will examine the relationship between
two variables, sometimes called simple linear
regression.
Mathematical equations describing these
relationships are also called models, and they fall
into two types: deterministic or probabilistic.
4

Model Types…
Deterministic Model: an equation or set of
equations that allow us to fully determine the
value of the dependent variable from the values of
the independent variables.
Contrast this with…
Probabilistic Model: a method used to capture the
randomness that is part of a real-life process.
E.g. do all houses of the same size (measured in
square metre) sell for exactly the same price?
5
A Model…
To create a probabilistic model, we start with a
deterministic model that approximates the
relationship we want to model and add a random
term that measures the error of the deterministic
component.
Deterministic Model:
The cost of building a new house is about $800 per
square metre and most lots sell for about
$200 000. Hence the approximate selling price (y)
would be:
y = $200 000 + $800(x)
(where x is the size of the house in square metres)
6
A model of the relationship between house size
(independent variable) and house price (dependent
variable) would be:
House size
House

price
Most lots sell
for $200 000.
B
u
i
l
d
i
n
g

a

h
o
u
s
e

c
o
s
t
s

a
b
o
u

t

$
8
0
0

p
e
r

s
q
u
a
r
e

m
e
t
r
e
.

H
o
u
s
e

p
r
i
c
e

=

2
0
0

0
0
0

+

8
0
0
(
S
i
z
e
)
In this model, the price of
the house is completely

determined by the size.
7
A Model…
In real life, however, the house cost will vary even
among the same size of house:
Same house size, but different price points
(e.g. décor options, cabinet upgrades, lot location…).
x House size
House
price
200K$
Lower vs. higher
variability
House price = 200 000 + 800(Size) + ε
8
A Model…
Random Term…
We now represent the price of a house as a function
of its size in this probabilistic model:
y = 200 000 + 800x + ε
where ε (Greek letter epsilon) is the random term
(a.k.a. error variable). It is the difference
between the actual selling price and the
estimated price based on the size of the house. Its
value will vary from house sale to house sale, even
if the area of the house (i.e. x) remains the same
due to other factors such as the location, age,
décor etc of the house.
9
14.1 Simple Linear Regression Model

A straight line model with one independent
variable is called a first order linear model or
a simple linear regression model. It is written
as:
y = dependent variable
x = independent variable
β
0
= y-intercept
β
1
= slope of the line
ε = error variable
10
11
y = dependent variable
x = independent variable
β
0
= y-intercept
β
1
= slope of the line
ε = error variable
x
y
β
0
Run
Rise

β
1
= rise/run
β
0
and β
1
are population parameters
which are usually unknown, therefore
are estimated from the data.
Simple Linear Regression Model…
=y-intercept
In much the same way we base estimates of µ on x,
we estimate β
0
using and β
1
using , the y-intercept
and slope (respectively) of the least squares or
regression line given by:
(Recall: this is an application of the least squares
method and it produces a straight line that
minimises the sum of the squared differences
between the points and the line)
14.2 Estimating the Coefficients
x
ˆˆ
y
ˆ
10

ββ +=

12
13
Least Squares Method










x
y
The question is:
•
Which straight line fits best?
•
The least squares line minimises the sum of
squared differences between the points and
the line.


14

3
3
The best line is the one that minimises the sum of squared vertical
differences between the points and the line.




41
1
4
(1, 2)
2
2
(2, 4)
(3, 1.5)
Line 1: Sum of squared differences =
(2 – 1)
2
+(4 – 2)
2
+(1.5 – 3)
2
+
(4, 3.2)
(3.2 – 4)
2
= 6.89
Line 2: Sum of squared differences =
(2 – 2.5)

2
+(4 – 2.5)
2
+(1.5 – 2.5)
2
+(3.2–- 2.5)
2
= 3.99
2.5
Let us compare two lines.
The second line is horizontal.
The smaller the sum of
squared differences,
the better the fit of the
line to the data.
Line 2
L
i
n
e

1
Least Squares Method…
15
To calculate the estimates of the coefficients that
minimise the differences between the data points
and the line, use the formulas:
x
ˆ
y

ˆ
x nx
y x nyx
or
n
)x(
x
n
)yx(
yx
ˆ
10
22
i
ii
2
i
2
i
ii
ii
1
β−=β
∑
−
∑
−
∑
∑
−

∑
∑ ∑
−
=β
Least Squares Estimates
16
Now we define
∑
−=
∑
∑ ∑
−=
∑
−=
∑
∑
−=
∑
−=
∑
∑
−=
y x nyx
n
)yx(
yxSS
y ny
n
)y(
ySS

x nx
n
)x(
xSS
ii
ii
iixy
22
i
2
i
2
iy
22
i
2
i
2
ix
Least Squares Estimates…
17
x
ˆ
y
ˆ

SS
SS
ˆ

10
x
xy
1
β−=β
=β
Then
The estimated simple linear regression equation that
estimates the equation of the first-order linear model is:
x
ˆˆ
y
ˆ
10
ββ +=
Least Squares Estimates…
18
A car dealer wants to find
the relationship between the
odometer reading and the
selling price of used cars.
A random sample of 100 cars
is selected and the data are
recorded in file XM21-03.
Find the regression line.
Independent variable x
Dependent variable y
Example 14.3
19
To calculate and we need to calculate several

statistics first:
;.y
;.x
2416
0136
=
=
where n = 100.
x x
ˆˆ
y
ˆ
094061119
10
−=+=
ββ
-403.6207
307.378 4
=
∑
∑ ∑
−=
=
∑
∑
−=
n
)yx(
yxSS
n

)x(
xSS
ii
iixy
i
ix
2
2
19.611 ))((
-0.0937
307.378 4
403.6207-

=−−=−=
===
0136093702416
10
1
x
ˆ
y
ˆ
SS
SS
ˆ
x
xy
ββ
β
0

ˆ
β
1
ˆ
β
Example 14.3 Solution
20
Using the computer (see file XM14-03.xls)
Data > Data Analysis > Regression >
[Highlight the data y range and x range] > OK
10
12
14
16
18
20
15 20 25 30 35 40 45 50 55
Odometer (x)
Price (y)
x y
ˆ
094061119
−=
21
This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.094
x y
ˆ
094061119

−=
The intercept is = 19.611.
19.611
0
Do not interpret the intercept as the
‘price of cars that have not been driven’.
10
12
14
16
18
20
15 20 25 30 35 40 45 50 55
Odometer (x)
Price (y)
No data
0
β
ˆ
22
14.3 Error variable: Required Conditions
• The error ε is a critical part of the
regression model.
•
Five requirements involving the distribution
of ε must be satisfied:
–
The mean of ε is zero: E(ε) = 0.
–
The standard deviation of ε is a constant (σ

ε
)
for all values of x.
–
The errors are independent.
– The errors are independent of the
independent variable x.
–
The probability distribution of ε is normal.
23
From the first three assumptions we have:
y is normally distributed with mean E(y) = β
0
+ β
1
x and a constant
standard deviation σ
ε
.
From the first three assumptions we have:
y is normally distributed with mean E(y) = β
0
+ β
1
x and a constant
standard deviation σ
ε
.
µ
3

β
0
+ β
1
x
1
β
0
+ β
1
x
2
β
0
+ β
1
x
3
E(y|x
2
)
E(y|x
3
)
x
1
x
2
x
3

µ
1
E(y|x
1
)
µ
2
The standard deviation remains constant
… but the mean value changes with x.
24
14.4 Assessing the Model
•
The least squares method will produce a
regression line whether or not there is a
linear relationship between x and y.
•
Consequently, it is important to assess
how well the linear model fits the data.
• Several methods are used to assess the
model:
–
testing and/or estimating the regression
model coefficients
–
using descriptive measurements such as the
sum of squares for errors (SSE).
25
–
This is the sum of differences between the
points and the regression line.

–
It can serve as a measure of how well the
line fits the data.
–
The sum of squares for errors is calculated
as
–
This statistic plays a role in every statistical
technique we employ to assess the model.
x
xy
y
SS
SS
SSSSE
2
−=
OR )
ˆ
(
1
2
∑
=
−=
n
i
ii
yySSE
Sum of Squares for Errors (SSE)

Chap 14: Simple linear regression and correlation

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về