Linear Regression
Regression
Given:
– Data
x
X =
x
(1)
where
, . . . ,
Rd
x(i)
(n )
where
y
9
– Corresponding labels
y
(1)
, . . . , y
y(i) 2 R
(n )
8
=
7
6
5
4
Linear Regression
3
Quadratic Regression
2
1
0
1975
1980
1985
1990
1995
2000
2005
2010
2015
Year
2
Prostate Cancer Dataset
•
•
97 samples, partitioned into 67 train / 30 test
Eight predictors (features):
–
•
Continuous outcome variable:
–
Based on slide by Jeff Howbert
6 continuous (4 log transforms), 1 binary, 1 ordinal
lpsa: log(prostate specific antigen level)
Linear Regression
•
Hypothesis:
X
y = ✓0 + ✓1x1 + ✓2x2 + . . . + ✓dxd
d
=
✓j x j
j =0
Assume x 0 = 1
•
Fit model by minimizing sum of squared errors
x
Figures are courtesy of Greg Shakhnarovich
5
Least Squares Linear Regression
•
Cost Function
1
X
n
J (✓) =
2n
•
Fit by solving
⇣
⇣
h✓
i=1
⌘
x
(i )
⌘2
(i)
— y
min J (✓)
✓
6
Intuition Behind Cost Function
1
J (✓) =
X
n
⇣
⇣
h✓
2n
⌘
x (i )
⌘2
—
i=1
For insight on J(), let’s assume
y (i )
x
R
so
✓ = [✓0 , ✓1 ]
Based on example by Andrew
Ng
7
Intuition Behind Cost Function
1
J (✓) =
X
n
⇣
⇣
h✓
2n
⌘
⌘2
x (i )
—
i=1
y (i )
x
For insight on J(), let’s assume
(for fixed
, this is a function of x)
R
so
✓ = [✓0 , ✓1 ]
(function of the parameter
3
3
2
2
1
1
0
0
)
y
0
1
x
2
3
-0.5
0
0.5
1
1.5
2
2.5
Based on example by Andrew
Ng
8
Intuition Behind Cost Function
X
1
J (✓) =
n
⇣
⇣
h✓
2n
⌘
⌘2
x (i )
—
i=1
y (i )
x
For insight on J(), let’s assume
(for fixed
, this is a function of x)
R
✓ = [✓0 , ✓1 ]
so
(function of the parameter
3
3
2
2
1
1
0
0
)
y
0
Based on example by Andrew
Ng
1
x
J ([0, 0. 5]) =
2
-0.5
3
1
2
3
⇥
(0.5 —
1)
2
+ (1 — 2
)
2
0
+ (1.5 — 3)
0.5
2
⇤
1
1.5
2
2.5
⇡ 0.58
9
Intuition Behind Cost Function
1
J (✓) =
X
n
⇣
⇣
h✓
2n
⌘
⌘2
x (i )
—
i=1
y (i )
x
For insight on J(), let’s assume
(for fixed
, this is a function of x)
R
so
✓ = [✓0 , ✓1 ]
(function of the parameter
3
3
2
2
1
1
0
0
)
J ([0, 0]) ⇡ 2. 333
y
0
1
x
2
3
-0.5
J() is concave
0
0.5
1
1.5
2
2.5
Based on example by Andrew
Ng
10
Intuition Behind Cost Function
Slide by Andrew Ng
11
Intuition Behind Cost Function
(for fixed
Slide by Andrew Ng
, this is a function of x)
(function of the parameters
)
12
Intuition Behind Cost Function
(for fixed
Slide by Andrew Ng
, this is a function of x)
(function of the parameters
)
13
Intuition Behind Cost Function
(for fixed
Slide by Andrew Ng
, this is a function of x)
(function of the parameters
)
14
Intuition Behind Cost Function
(for fixed
Slide by Andrew Ng
, this is a function of x)
(function of the parameters
)
15
Basic Search Procedure
•
•
Choose initial value for
✓
Until we reach a minimum:
– Choose a new value for
J(
0,
✓
to reduce
J (✓)
1)
1
0
Figure by Andrew Ng
16
Basic Search Procedure
•
•
Choose initial value for
✓
Until we reach a minimum:
– Choose a new value for
J(
0,
✓
to reduce
J (✓)
✓
1)
1
0
Figure by Andrew Ng
17
Basic Search Procedure
•
•
Choose initial value for
✓
Until we reach a minimum:
– Choose a new value for
J(
0,
1)
✓
to reduce
J (✓)
✓
Since the least squares objective function is conv1ex (concave),
we don’t ne0ed to worry about local minima
Figure by Andrew Ng
18
Gradient Descent
•
•
Initialize
✓ until convergence
Repeat
@
✓j
←✓
j
— ↵
simultaneous update for j = 0 ... d
J (✓)
@✓j
learning rate (small) e.g., α = 0.05
3
2
J(✓)
1
0
-0.5
0
0.5
1
1.5
2
2.5
✓
19
Gradient Descent
•
•
Initialize
✓
Repeat until convergence
@
✓j
@
For Linear Regression:
simultaneous update for j = 0 ... d
J (✓)
← ✓j — ↵
@✓j
@
1
X
n
J (✓) =
@✓j
@✓j 2n
⇣
⇣
h✓
⌘
x (i )
⌘2
—
i =1
y (i )
20
Gradient Descent
•
•
Initialize
✓
Repeat until convergence
@
✓j
@
For Linear Regression:
simultaneous update for j = 0 ... d
J (✓)
← ✓j — ↵
@✓j
@
1
X
n
J (✓) =
@✓j
@✓j 2n
@
1
⇣
⇣
h✓
x (i )
⌘2
—
i =1
X
n
X
d
=
@✓j 2n
⌘
i =1
! 2
y (i )
✓k x (ki )—
y
(i )
k =0
21
Gradient Descent
•
•
Initialize
✓
Repeat until convergence
@
✓j
@
For Linear Regression:
simultaneous update for j = 0 ... d
J (✓)
← ✓j — ↵
@✓j
@
1
X
⇣
n
J (✓) =
@✓j
@✓j 2n
@
1
⇣
h✓
x (i )
1 X
n
X
n
X
i =1
—
X
! 2
y (i )
d
✓k x (ki )
k =0
n
k =0
(i )
✓k x k
— y
(i )
!
d
=
i =1
⌘2
i =1
=
@✓j 2n
⌘
— y
(i )
@
X
⇥
@✓j
!
d
✓k x (ki )—
y
(i )
k =0
22
Gradient Descent
•
•
Initialize
✓
Repeat until convergence
@
✓j
@
For Linear Regression:
simultaneous update for j = 0 ... d
J (✓)
← ✓j — ↵
@✓j
@
1
X
⇣
n
J (✓) =
@✓j
@✓j 2n
@
1
⇣
h✓
x (i )
1 X
n
X
n
X
i =1
X
n
✓k x (ki )
k =0
✓k x k
X
(i )
— y
@
(i )
!
✓k x (ki )
— y
(i )
X
!
d
⇥
@✓j
d
n
— y
!
k =0
=
i =1
! 2
y (i )
d
(i )
n
1 X
—
d
=
i =1
⌘2
i =1
=
@✓j 2n
⌘
✓k x (ki )—
y
(i )
k =0
x j( i )
k =0
23
Gradient Descent for Linear Regression
•
•
Initialize
✓
Repeat until convergence
1 X
✓j
← ✓j — ↵
n
⇣
n
⇣
⌘
h✓
(i )
x (i)
• To achieve simultaneous update
• At the start of each GD iteration, compute h✓
• Use this stored value in the update step loop
• Assume convergence when
✓
L 2 norm:
new
simultaneous update
(i)
xj
— y
i =1
s
⌘
for j = 0 ... d
x (i)
— ✓o l d k2 <
q
X
kv k =
2
v
2
i
=
✏ v 12
+ v
2
2
+ . . . + v
2
|v |
i
24
Gradient Descent
(for fixed
, this is a function of x)
(function of the parameters
)
h(x) = -900 – 0.1 x
Slide by Andrew Ng
25
Gradient Descent
(for fixed
Slide by Andrew Ng
, this is a function of x)
(function of the parameters
)
26