Tải bản đầy đủ (.pdf) (19 trang)

introduction to kernel methods.

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (918.41 KB, 19 trang )

Learning Kernels -Tutorial
Part I: Introduction to Kernel Methods.
page
Outline
Part I: Introduction to kernel methods.
Part II: Learning kernel algorithms.
Part III: Theoretical guarantees.
Part IV: Software tools.
2
page
Binary Classification Problem
Training data: sample drawn i.i.d. from set
according to some distribution ,
Problem: find hypothesis in
(classifier) with small generalization error .
Linear classification:

Hypotheses based on hyperplanes.

Linear separation in high-dimensional space.
3
h :X �→{−1, +1}
H
S =((x
1
,y
1
), ,(x
m
,y
m


)) ∈ X ×{−1, +1}.
R
D
(h)
X ⊆ R
N
D
page
Linear Separation
Classifiers: .
4
H ={x �→sgn(w · x + b):w ∈ R
N
,b ∈ R}
w · x+b =0
w · x+b =0
page
Optimal Hyperplane: Max. Margin
Canonical hyperplane: and chosen such that for
closest points .
Margin: .
5
margin
(Vapnik and Chervonenkis, 1964)
w
b
ρ=min
x∈S
|w ·x +b|
�w�

=
1
�w�
w · x+b =+1
w · x+b = −1
w · x+b =0
|w ·x + b|=1
page
Optimization Problem
Constrained optimization:
Properties:

Convex optimization (strictly convex).

Unique solution for linearly separable sample.
6
min
w,b
1
2
�w�
2
subject to y
i
(w · x
i
+ b) ≥ 1,i∈ [1,m].
page
Support Vector Machines
Problem: data often not linearly separable in

practice. For any hyperplane, there exists such
that
Idea: relax constraints using slack variables
7
(Cortes and Vapnik, 1995)
x
i
y
i
[w · x
i
+ b] �≥ 1.
y
i
[w · x
i
+ b] ≥ 1 − ξ
i
.
ξ
i
≥ 0
page
Support vectors: points along the margin or outliers.
Soft margin:
Soft-Margin Hyperplanes
8
ξ
i
ξ

j
w · x+b =+1
w · x+b = −1
w · x+b =0
ρ =1/�w�.
page
Optimization Problem
Constrained optimization:
Properties:

trade-off parameter.

Convex optimization (strictly convex).

Unique solution.
9
min
w,b,ξ
1
2
�w�
2
+ C
m

i=1
ξ
i
subject to y
i

(w · x
i
+ b) ≥ 1 − ξ
i
∧ ξ
i
≥ 0,i∈ [1,m].
C ≥ 0
(Cortes and Vapnik, 1995)
page
Dual Optimization Problem
Constrained optimization:
Solution:
10
for any SV .
h(x)=sgn

m

i=1
α
i
y
i
(x
i
· x)+b

,
b = y

i

m

j=1
α
j
y
j
(x
j
· x
i
)
x
i
with
max
α
m

i=1
α
i

1
2
m

i,j=1

α
i
α
j
y
i
y
j
(x
i
· x
j
)
subject to: α
i
≥ 0 ∧
m

i=1
α
i
y
i
=0,i∈ [1,m].
page
Kernel Methods
Idea:

Define , called kernel, such that:


often interpreted as a similarity measure.
Benefits:

Efficiency: is often more efficient to compute
than and the dot product.

Flexibility: can be chosen arbitrarily so long as
the existence of is guaranteed (Mercer’s
condition).
11
Φ(x) · Φ(y)=K(x, y).
K : X × X → R
K
K
Φ
K
Φ
page
Example - Polynomial Kernels
Definition:
Example: for and ,
12
N =2
d =2
∀x, y ∈ R
N
,K(x, y)=(x · y + c)
d
,c>0.
K(x, y)=(x

1
y
1
+ x
2
y
2
+ c)
2
=








x
2
1
x
2
2

2 x
1
x
2


2cx
1

2cx
2
c








·








y
2
1
y
2
2


2 y
1
y
2

2cy
1

2cy
2
c








.
page
(1, 1, −

2, +

2, −

2, 1)
XOR Problem
Use second-degree polynomial kernel with :

x
1
x
2
(1, 1)
(-1, -1)
(-1, 1)
(1, -1)

2 x
1
x
2

2 x
1
Linearly non-separable
Linearly separable by

(1, 1, −

2, −

2, +

2, 1)
(1, 1, +

2, −


2, −

2, 1)
(1, 1, +

2, +

2, +

2, 1)
13
c =1
x
1
x
2
=0.
page
Other Standard PDS Kernels
Gaussian kernels:
Sigmoid Kernels:
14
K(x, y)=tanh(a(x · y)+b),a,b≥ 0.
K(x, y)=exp


||x − y||
2

2


,σ�=0.
page
Consequence: SVMs with PDS Kernels
Constrained optimization:
Solution:
15
with
for any with
0 <α
i
<C.
(Boser, Guyon, and Vapnik, 1992)
max
α
m

i=1
α
i

1
2
m

i,j=1
α
i
α
j

y
i
y
j
K(x
i
,x
j
)
subject to: 0 ≤ α
i
≤ C ∧
m

i=1
α
i
y
i
=0,i∈ [1,m].
h(x)=sgn

m

i=1
α
i
y
i
K(x

i
,x)+b

,
b = y
i

m

j=1
α
j
y
j
K(x
j
,x
i
)
x
i
page
SVMs with PDS Kernels
Constrained optimization:
Solution:
16
with
for any with
0 <α
i

<C.
x
i
b = y
i
− (α ◦ y)

Ke
i
h =sgn

m

i=1
α
i
y
i
K(x
i
, ·)+b

,
max
α
2 α

1 − α

Y


KYα
subject to: 0 ≤ α ≤ C ∧ α

y =0.
page
Regression Problem
Training data: sample drawn i.i.d. from set
according to some distribution ,
Loss function: a measure of closeness,
typically or for
some .
Problem: find hypothesis in with small
generalization error with respect to target
17
H
D
X
S =((x
1
,y
1
), ,(x
m
,y
m
))∈ X ×Y,
with is a measurable subset.
Y ⊆ R
L(y, y


)=(y

− y)
2
L(y, y

)=|y

−y|
p
p ≥ 1
f
R
D
(h)= E
x∼D

L

h(x),f(x)

.
h :X → R
L : Y ×Y → R
+
page
Kernel Ridge Regression
Optimization problem:
Solution:

18
or
max
α∈R
m
−α

(K + λI)α +2α

y.
max
α∈R
m
−λα

α +2α

y − α


with
α =(K + λI)
−1
y.
h(x)=
m

i=1
α
i

K(x
i
,x),
(Saunders et al., 1998)
page
Questions
How should the user choose the kernel?

problem similar to that of selecting features for
other learning algorithms.

poor choice learning made very difficult.

good choice even poor learners could
succeed.
The requirement from the user is thus critical.

can this requirement be lessened?

is a more automatic selection of features
possible?
19

×