Bài 6 Slide Support Vector Machines Kernels

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.64 MB, 68 trang )

Support Vector Machines
& Kernels

Doing really well with linear decision surfaces

Outline
 Prediction
 Why might predictions be wrong?

 Support vector machines
 Doing really well with linear models

 Kernels
 Making the non-‐linear linear

Why Might Predictions be Wrong?

•

True non-‐determinism

–
–
–
–

Flip a biased coin
p(heads) = θ
Estimate θ

If θ > 0.5 predict ‘heads’, else ‘tails’

Lots of ML research on problems like this:

–
–

Learn a model
Do the best you can in expectation

Why Might Predictions be Wrong?

•

Partial observability

–
–

Something needed to predict y is missing from observation x
N-‐bit parity problem

• x contains N-‐1 bits (hard PO)
• x contains N bits but learner ignores some of them (sof PO)

•

Noise in the observation x

–
–

Measurement error
Instrument limitations

Why Might Predictions be Wrong?

•
•

True non-‐determinism
Partial observability
– hard, sof

•
•
•

Representational bias
Algorithmic bias
Bounded resources

Representational Bias

•

Having the right features (x) is crucial

0

x

x2

Support Vector

Machines

Doing Really Well with Linear Decision Surfaces

Strengths of SVMs

•

Good generalization

–
–
•
•
•
•

in theory
in practice

Works well with few training instances
Find globally best model
Efficient algorithms
Amenable to the kernel trick

Minor Notation Change
To better match notation used in SVMs
...and to make matrix formulas simpler
We will drop using superscripts for the i th instance

i th instance

i th instance label

j th feature of i th instance

x (i)
y (i)
(i)

xj

Bold denotes vector

x

i

yi

x

Non-‐bold
denotes scalar

ij

9

Linear Separators

•

Training instances
x 2R

d+1

, x0 = 1

y 2 {—1, 1}

•

Model parameters
✓

•

Recall:

R

Inner (dot) product:

|

hu, vi = u · v = u v
X

d+1

Hyperplane

=

|

✓ x = h✓, x i = 0

•

u iv
i

Decision function

|

h (x ) = sign(✓ x ) = sign(h✓, x i )

i

Intuitions

Intuitions

Intuitions

Intuitions

A “Good” Separator

Noise in the Observations

Ruling Out Some Separators

Lots of Noise

Only One Separator Remains

Maximizing the Margin

“Fat” Separators

“Fat” Separators
margin

Why Maximize Margin
Increasing margin reduces capacity

•

i.e., fewer possible models

Lesson from Learning Theory:

•

If the following holds:

–
–

H is sufficiently constrained in size
and/or the size of the training data set n is large,

then low training error is likely to be evidence of low generalization error

23

Alternative View of Logistic Regression
1
h✓
(x) =

h ✓ (x ) = g(z)

1 + e— ✓T x
T

z= ✓ x

If y

= 1

, we want

h ✓ (x ) ⇡ 1 ,

If y = 0 , we want h✓ (x) ⇡ 0 ,
X

✓T

x

T

✓ x

0

0

n

J (✓) = —

✓(x i ))]

[y log
h (x ✓
) + (1
i
i — y ) log (1 —
i h
i=1

min J (✓)
✓
Based on slide by Andrew Ng

|

cost1(✓ xi)

|

cost0(✓ xi)
24

Alternate View of Logistic Regression
Cost of example:

—yi log h✓ (x i ) — (1 — y i ) log (1 — h✓ (x i ))
1

h✓
(x) =

T

z= ✓ x

1 + e — ✓T x
If

y= 1

Based on slide by Andrew Ng

T

(want ✓ x

0 ):

T
If y = 0 (want ✓ x

0 ):

25

Bài 6 Slide Support Vector Machines Kernels

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về