Tải bản đầy đủ (.pdf) (49 trang)

Lecture Introduction to Machine learning and Data mining: Lesson 6

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.37 MB, 49 trang )

Introduction to

Machine Learning and Data Mining
(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021


2

Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
ă

Support Vector Machines

Ă Practical advice


Support Vector Machines (1)

3

¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was
proposed by Vapnik and his colleages in 1970s. Then it
became famous and popular in 1990s.
¡ Originally, SVM is a method for linear classification. It finds a


hyperplane (also called linear classifier) to separate the
two classes of data.
¡ For non-linear classification for which no hyperplane
separates well the data, kernel functions (hm nhõn) will be
used.
ă

Kernel functions play the role to transform the data into
another space, in which the data is linearly separable.

¡ Sometimes, we call linear SVM when no kernel function is
used. (in fact, linear SVM uses a linear kernel)


Support Vector Machines (2)
¡ SVM has a strong theory that supports its performance.
¡ It can work well with very high dimensional problems.
¡ It is now one of the most popular and strong methods.
¡ For text categorization, linear SVM performs very well.

4


1. SVM: the linearly separable case
Ă Problem representation:
ă

ă

Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances.

Each xi is a vector in an n-dimensional space,
e.g., xi = (xi1, xi2, , xin)T. Each dimension represents an attribute.

ă

Bold characters denote vectors.

ă

yi is a class label in {-1; 1}. ‘1’ is possitive class, ‘-1’ is negative class.

¡ Linear separability assumption: there exists a hyperplane
(of linear form) that well separates the two classes
(giả thuyết tồn tại một siêu phẳng mà phân tách 2 lớp được)

5


6

Linear SVM
¡ SVM finds a hyperplane of the form:
f(x) = ỏw ì xủ + b
ă

ă

[Eq.1]

w is the weight vector; b is a real number (bias).

áw × xđ and áw, xđ denote the inner product of two vectors
(tích vơ hướng của hai véctơ)

¡ Such that for each xi:

ì 1 if
yi = ớ
ợ- 1 if

ỏw ì xi ủ + b 0
ỏw × xi ñ + b < 0

[Eq.2]


Separating hyperplane

7

¡ The hyperplane (H0) which separates the possitive from
negative class is of the form:
áw × xđ + b = 0
¡ It is also known as the decision boundary/surface.
¡ But there might be infinitely many separating hyperplanes.
Which one should we choose?

[Liu, 2006]


8


Hyperplane with max margin
¡ SVM selects the hyperplane with max margin.
(SVM tìm siêu phẳng tách mà có lề lớn nhất)
¡ It is proven that the max-margin hyperplane has minimal errors
among all possible hyperplanes.

[Liu, 2006]


9

Marginal hyperplanes

¡ Assume that the two classes in our data can be separated
clearly by a hyperplane.
¡ Denote (x+,1) in possitive class and (x-,-1) in negative class
which are closest to the separating hyperplane H0
(áw × xđ + b = 0)
¡ We define two parallel marginal hyperplanes as follows:
ă

H+ crosses x+ and is parallel with H0: áw × x+đ + b = 1

ă

H- crosses x- and is parallel with H0: ỏw ì x-ủ + b = -1

ă


No data point lies between these two marginal hyperplanes.
And satisfying:
áw × xiđ + b ≥ 1,

if yi = 1

áw × xiđ + b ≤ -1,

if yi = -1

[Eq.3]


10

The margin (1)

¡ Margin (mức lề) is defined as the distance between the two
marginal hyperplanes.
ă

Denote d+ the distance from H0 to H+.

ă

Denote d- the distance from H0 to H-.

ă

(d+ + d-) is the margin.


¡ Remember that the distance from a point xi to the
hyperplane H0 (áw × xđ + b = 0) is computed as:
| # ! + |

ă

[Eq.4]

Where:

=

𝒘#𝒘 =

𝑤"#

+

𝑤##

+ ⋯+

𝑤$#

[Eq.5]


The margin (2)


11

¡ So the distance d+ from x+ to H0 is

| áw × x+ đ + b | | 1 |
1
d+ =
=
=
|| w ||
|| w || || w ||

[Eq.6]

¡ Similarly, the distance d- from x- to H0 is

| á w × x - đ + b | | -1 |
1
d- =
=
=
|| w ||
|| w || || w ||

[Eq.7]

¡ As a result, the margin is:

2
margin = d + + d - =

|| w ||

[Eq.8]


SVM: learning with max margin (1)

12

¡ SVM learns a classifier H0 with a maximum margin, i.e., the
hyperplane that has the greatest margin among all
possible hyperplanes.
¡ This learning principle can be formulated as the following
quadratic optimization problem:
ă

Find w and b that maximize

2
margin =
w
ă

and satisfy the below conditions for any training data xi:

ìá w × x i đ + b ³ 1, if y i = 1

ợỏ w ì x i ủ + b £ -1, if y i = -1



13

SVM: learning with max margin (2)

¡ Learning SVM is equivalent to the following minimization
problem:
ỏ w ì wủ
[Eq.9]
ă Minimize
2
ỡ ỏ w × x i ñ + b ³ 1, if yi = 1
ă Conditioned on

ợỏ w ì x i ủ + b £ -1, if yi = -1
¡ Note, it can be reformulated as:
ă

ă

Minimize
Conditioned on

ỏ w ì wủ
2
yi (ỏ w ì x i ñ + b) ³ 1, "i = 1..r

¡ This is a constrained optimization problem.

[Eq.10]
(P)



Constrained optimization (1)
¡ Consider the problem:
Minimize f(x) conditioned on g(x) = 0
¡ Necessary condition: a solution x0 will satisfy

ì¶
=0
ï ( f(x) + g (x))
;
ớ ảx
x=x0
ù g(x) = 0

ă

Where is a Lagrange multiplier.

¡ In the cases of many constraints (gi(x)=0 for i=1r), a
solution x0 will satisfy:
r
ỡả ổ

f(
x
)
+

g

(
x
)
=0


ù

i i
;
ớ ảx ố
i =1
ø x=x0
ï g (x) = 0
ỵ i

14


Constrained optimization (2)
¡ Consider the problem with inequality constraints:
Minimize f(x) conditioned on gi(x) ≤ 0
¡ Necessary condition: a solution x0 will satisfy
r
ỡả ổ

=0
ù ỗ f(x) + ồ i g i(x) ữ
;
ớ ảx ố

i =1
ứ x=x0
ù g (x) Ê 0
ợ i
ă

Where α! ≥ 0 is a Lagrange multiplier.

¡ L = f(x) +

r

ồ g (x)
i =1

i

i

is known as the Lagrange function.

ă

x is called primal variable (bin gc)

ă

is called dual variable (biến đối ngẫu)

15



16

SVM: learning with max margin (3)
¡ The Lagrange function for problem [Eq. 10] is
r
1
L(w, b, α ) = 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) 1]
2
i=1
ă

[Eq.11a]

Where each ! 0 is a Lagrange multiplier.

¡ Solving [Eq. 10] is equivalent to the following minimax
problem:

arg min max L(w, b, α )
w,b

α ≥0

r
'1
*
= arg min max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
w,b α ≥0

(2
+
i=1

[Eq.11b]


SVM: learning with max margin (4)

17

¡ The primal problem [Eq. 10] can be derived by solving:
r
'1
*
max L(w, b, α ) = max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
α ≥0
α ≥0
(2
+
i=1

¡ Its dual problem (đối ngẫu) can be derived by solving:
r
&1
)
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b
w,b
'2

*
i=1

¡ It is known that the optimal solution to [Eq. 10] will satisfy
some conditions which is called the Karush-Kuhn-Tucker
(KKT) conditions.


SVM: Karush-Kuhn-Tucker
r
∂L
= w − ∑α i yi x i = 0
∂w
i=1

r
∂L
= −∑α i yi = 0
∂b
i=1

18

[Eq.12]
[Eq.13]

yi ( w × xi + b) - 1 ³ 0, "xi (i = 1..r )

[Eq.14]


αi ³ 0

[Eq.15]

αi ( yi ( w × xi + b) - 1) = 0

[Eq.16]

¡ The last equation [Eq. 16] comes from a nice result from the
duality theory.
ă

Note: any ! > 0 will imply that the associated point xi lies in a
boundary hyperplane (H+ or H-).

ă

Such a boundary point is named as a support vector.

ă

A non-support vector will correspond to 𝛼! = 0.


SVM: learning with max margin (5)

19

¡ In general, the KKT conditions do not guarantee the
optimality of the solution.

¡ Fortunately, due to the convexity of the primal problem
[Eq.10], the KKT conditions are both necessary and
sufficient to assure the global optimality of the solution. It
means a vector satisfying all KKT conditions provides the
globally optimal classifier.
ă

ă

Convex optimization is easy in the sense that we always can
find a good solution with a provable guarantee.
There are many algorithms in the literature, but most are
iterative.

¡ In fact, problem [Eq.10] is pretty hard to derive an efficient
algorithm. Therefore, its dual problem is more preferable.


20

SVM: the dual form (1)
¡ Remember that the dual counterpart of [Eq.10] is

r
&1
)
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b
w,b
'2

*
i=1

¡ By taking the gradient of L(w,b,𝛼 ) in variables (w,b) and
zeroing it, we can find the following dual function:
r

1 r
LD (α ) = ∑α i − ∑ α iα j yi y j 〈x i ⋅ x j 〉
2 i, j=1
i=1

[Eq.17]


21

SVM: the dual form (2)
¡ Solving problem [Eq.10] is equivalent to solving its dual
problem below:
r

ă

ă

1 r
Maximize LD ( ) = å a i - å a ia j yi y j á x i × x j đ
2 i , j =1
i =1

ìr
ïå a i yi = 0
Such that
í i =1
ïỵa i ³ 0, "i = 1..r

[Eq.18]
(D)

¡ The constraints in (D) is much more simpler than those of
the primal problem. Therefore deriving an efficient method
to solve this problem might be easier.
ă

However, existing algorithms for this problem are iterative and
complicated. Therefore, we will not discuss any algorithm in
detail !


SVM: the optimal classifier

22

¡ Once the dual problem is solved for 𝜶, we can recover the
optimal solution to problem [Eq.10] by using the KKT.
¡ Let SV be the set of all support vectors
ă

SV is a subset of the training data.


ă

! > 0 suggests that xi is a support vector.

¡ We can compute w* by using [Eq.12]. So:
r

w* = å a i yi x i =
i =1

åa y x ;

x i ÎSV

i

i

i

(due to α" = 0 for any xj not in SV)

¡ To find b*, we take an index k such that $ > 0:
ă

It means yk(ỏw* ì xkủ + b*) -1 = 0 due to [Eq.16].

ă

Hence,


ă

b* = yk - ỏw* × xkñ


SVM: classifying new instances

23

¡ The decision boundary is

f(x) = á w * ×xđ + b* =

å α y áx × xđ + b* = 0

x i ỴSV

i

i

i

[Eq.19]

¡ For a new instance z, we compute:


ư

sign( á w * ×zđ + b*) = signỗỗ ồ i yi ỏ x i ì zủ + b* ữữ
ố xi ẻSV


ă

If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.

Ă Note that this classification principle
ă

Just depends on the support vectors.

ă

Just needs to compute some dot products.

[Eq.20]


2. Soft-margin SVM

24

¡ What if the two classes are not linearly separable?
(Trường hợp 2 lớp không thể phân tách tuyến tớnh thỡ sao?)
ă
ă


Linear separability is ideal in practice.
Data are often noisy or erronous, making two classes
overlapping (nhiễu/lỗi có thể làm 2 lp giao nhau)

Ă In the case of linear separability:
ă

ă

ỏ w × wđ
2
Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
Minimize

¡ In the cases of noises or overlapping, those constraints may
never meet simutaneously.
ă

It means we cannot solve for w* and b*.


Example of inseparability
¡ Noisy points xa and xb are mis-placed.

25


×