Introduction to
Machine Learning and Data Mining
(Học máy và Khai phá dữ liệu)
Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology
2021
2
Contents
¡ Introduction to Machine Learning & Data Mining
¡ Unsupervised learning
¡ Supervised learning
ă
Support Vector Machines
Ă Practical advice
Support Vector Machines (1)
3
¡ Support Vector Machines (SVM) (máy vectơ hỗ trợ) was
proposed by Vapnik and his colleages in 1970s. Then it
became famous and popular in 1990s.
¡ Originally, SVM is a method for linear classification. It finds a
hyperplane (also called linear classifier) to separate the
two classes of data.
¡ For non-linear classification for which no hyperplane
separates well the data, kernel functions (hm nhõn) will be
used.
ă
Kernel functions play the role to transform the data into
another space, in which the data is linearly separable.
¡ Sometimes, we call linear SVM when no kernel function is
used. (in fact, linear SVM uses a linear kernel)
Support Vector Machines (2)
¡ SVM has a strong theory that supports its performance.
¡ It can work well with very high dimensional problems.
¡ It is now one of the most popular and strong methods.
¡ For text categorization, linear SVM performs very well.
4
1. SVM: the linearly separable case
Ă Problem representation:
ă
ă
Training data D = {(x1, y1), (x2, y2), …, (xr, yr)} with r instances.
Each xi is a vector in an n-dimensional space,
e.g., xi = (xi1, xi2, , xin)T. Each dimension represents an attribute.
ă
Bold characters denote vectors.
ă
yi is a class label in {-1; 1}. ‘1’ is possitive class, ‘-1’ is negative class.
¡ Linear separability assumption: there exists a hyperplane
(of linear form) that well separates the two classes
(giả thuyết tồn tại một siêu phẳng mà phân tách 2 lớp được)
5
6
Linear SVM
¡ SVM finds a hyperplane of the form:
f(x) = ỏw ì xủ + b
ă
ă
[Eq.1]
w is the weight vector; b is a real number (bias).
áw × xđ and áw, xđ denote the inner product of two vectors
(tích vơ hướng của hai véctơ)
¡ Such that for each xi:
ì 1 if
yi = ớ
ợ- 1 if
ỏw ì xi ủ + b 0
ỏw × xi ñ + b < 0
[Eq.2]
Separating hyperplane
7
¡ The hyperplane (H0) which separates the possitive from
negative class is of the form:
áw × xđ + b = 0
¡ It is also known as the decision boundary/surface.
¡ But there might be infinitely many separating hyperplanes.
Which one should we choose?
[Liu, 2006]
8
Hyperplane with max margin
¡ SVM selects the hyperplane with max margin.
(SVM tìm siêu phẳng tách mà có lề lớn nhất)
¡ It is proven that the max-margin hyperplane has minimal errors
among all possible hyperplanes.
[Liu, 2006]
9
Marginal hyperplanes
¡ Assume that the two classes in our data can be separated
clearly by a hyperplane.
¡ Denote (x+,1) in possitive class and (x-,-1) in negative class
which are closest to the separating hyperplane H0
(áw × xđ + b = 0)
¡ We define two parallel marginal hyperplanes as follows:
ă
H+ crosses x+ and is parallel with H0: áw × x+đ + b = 1
ă
H- crosses x- and is parallel with H0: ỏw ì x-ủ + b = -1
ă
No data point lies between these two marginal hyperplanes.
And satisfying:
áw × xiđ + b ≥ 1,
if yi = 1
áw × xiđ + b ≤ -1,
if yi = -1
[Eq.3]
10
The margin (1)
¡ Margin (mức lề) is defined as the distance between the two
marginal hyperplanes.
ă
Denote d+ the distance from H0 to H+.
ă
Denote d- the distance from H0 to H-.
ă
(d+ + d-) is the margin.
¡ Remember that the distance from a point xi to the
hyperplane H0 (áw × xđ + b = 0) is computed as:
| # ! + |
ă
[Eq.4]
Where:
=
𝒘#𝒘 =
𝑤"#
+
𝑤##
+ ⋯+
𝑤$#
[Eq.5]
The margin (2)
11
¡ So the distance d+ from x+ to H0 is
| áw × x+ đ + b | | 1 |
1
d+ =
=
=
|| w ||
|| w || || w ||
[Eq.6]
¡ Similarly, the distance d- from x- to H0 is
| á w × x - đ + b | | -1 |
1
d- =
=
=
|| w ||
|| w || || w ||
[Eq.7]
¡ As a result, the margin is:
2
margin = d + + d - =
|| w ||
[Eq.8]
SVM: learning with max margin (1)
12
¡ SVM learns a classifier H0 with a maximum margin, i.e., the
hyperplane that has the greatest margin among all
possible hyperplanes.
¡ This learning principle can be formulated as the following
quadratic optimization problem:
ă
Find w and b that maximize
2
margin =
w
ă
and satisfy the below conditions for any training data xi:
ìá w × x i đ + b ³ 1, if y i = 1
ớ
ợỏ w ì x i ủ + b £ -1, if y i = -1
13
SVM: learning with max margin (2)
¡ Learning SVM is equivalent to the following minimization
problem:
ỏ w ì wủ
[Eq.9]
ă Minimize
2
ỡ ỏ w × x i ñ + b ³ 1, if yi = 1
ă Conditioned on
ớ
ợỏ w ì x i ủ + b £ -1, if yi = -1
¡ Note, it can be reformulated as:
ă
ă
Minimize
Conditioned on
ỏ w ì wủ
2
yi (ỏ w ì x i ñ + b) ³ 1, "i = 1..r
¡ This is a constrained optimization problem.
[Eq.10]
(P)
Constrained optimization (1)
¡ Consider the problem:
Minimize f(x) conditioned on g(x) = 0
¡ Necessary condition: a solution x0 will satisfy
ì¶
=0
ï ( f(x) + g (x))
;
ớ ảx
x=x0
ù g(x) = 0
ợ
ă
Where is a Lagrange multiplier.
¡ In the cases of many constraints (gi(x)=0 for i=1r), a
solution x0 will satisfy:
r
ỡả ổ
ử
f(
x
)
+
g
(
x
)
=0
ỗ
ữ
ù
ồ
i i
;
ớ ảx ố
i =1
ø x=x0
ï g (x) = 0
ỵ i
14
Constrained optimization (2)
¡ Consider the problem with inequality constraints:
Minimize f(x) conditioned on gi(x) ≤ 0
¡ Necessary condition: a solution x0 will satisfy
r
ỡả ổ
ử
=0
ù ỗ f(x) + ồ i g i(x) ữ
;
ớ ảx ố
i =1
ứ x=x0
ù g (x) Ê 0
ợ i
ă
Where α! ≥ 0 is a Lagrange multiplier.
¡ L = f(x) +
r
ồ g (x)
i =1
i
i
is known as the Lagrange function.
ă
x is called primal variable (bin gc)
ă
is called dual variable (biến đối ngẫu)
15
16
SVM: learning with max margin (3)
¡ The Lagrange function for problem [Eq. 10] is
r
1
L(w, b, α ) = 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) 1]
2
i=1
ă
[Eq.11a]
Where each ! 0 is a Lagrange multiplier.
¡ Solving [Eq. 10] is equivalent to the following minimax
problem:
arg min max L(w, b, α )
w,b
α ≥0
r
'1
*
= arg min max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
w,b α ≥0
(2
+
i=1
[Eq.11b]
SVM: learning with max margin (4)
17
¡ The primal problem [Eq. 10] can be derived by solving:
r
'1
*
max L(w, b, α ) = max ) 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1],
α ≥0
α ≥0
(2
+
i=1
¡ Its dual problem (đối ngẫu) can be derived by solving:
r
&1
)
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b
w,b
'2
*
i=1
¡ It is known that the optimal solution to [Eq. 10] will satisfy
some conditions which is called the Karush-Kuhn-Tucker
(KKT) conditions.
SVM: Karush-Kuhn-Tucker
r
∂L
= w − ∑α i yi x i = 0
∂w
i=1
r
∂L
= −∑α i yi = 0
∂b
i=1
18
[Eq.12]
[Eq.13]
yi ( w × xi + b) - 1 ³ 0, "xi (i = 1..r )
[Eq.14]
αi ³ 0
[Eq.15]
αi ( yi ( w × xi + b) - 1) = 0
[Eq.16]
¡ The last equation [Eq. 16] comes from a nice result from the
duality theory.
ă
Note: any ! > 0 will imply that the associated point xi lies in a
boundary hyperplane (H+ or H-).
ă
Such a boundary point is named as a support vector.
ă
A non-support vector will correspond to 𝛼! = 0.
SVM: learning with max margin (5)
19
¡ In general, the KKT conditions do not guarantee the
optimality of the solution.
¡ Fortunately, due to the convexity of the primal problem
[Eq.10], the KKT conditions are both necessary and
sufficient to assure the global optimality of the solution. It
means a vector satisfying all KKT conditions provides the
globally optimal classifier.
ă
ă
Convex optimization is easy in the sense that we always can
find a good solution with a provable guarantee.
There are many algorithms in the literature, but most are
iterative.
¡ In fact, problem [Eq.10] is pretty hard to derive an efficient
algorithm. Therefore, its dual problem is more preferable.
20
SVM: the dual form (1)
¡ Remember that the dual counterpart of [Eq.10] is
r
&1
)
min L(w, b, α ) = min ( 〈w ⋅ w〉 − ∑α i [yi (〈w ⋅ x i 〉 + b) −1]+
w,b
w,b
'2
*
i=1
¡ By taking the gradient of L(w,b,𝛼 ) in variables (w,b) and
zeroing it, we can find the following dual function:
r
1 r
LD (α ) = ∑α i − ∑ α iα j yi y j 〈x i ⋅ x j 〉
2 i, j=1
i=1
[Eq.17]
21
SVM: the dual form (2)
¡ Solving problem [Eq.10] is equivalent to solving its dual
problem below:
r
ă
ă
1 r
Maximize LD ( ) = å a i - å a ia j yi y j á x i × x j đ
2 i , j =1
i =1
ìr
ïå a i yi = 0
Such that
í i =1
ïỵa i ³ 0, "i = 1..r
[Eq.18]
(D)
¡ The constraints in (D) is much more simpler than those of
the primal problem. Therefore deriving an efficient method
to solve this problem might be easier.
ă
However, existing algorithms for this problem are iterative and
complicated. Therefore, we will not discuss any algorithm in
detail !
SVM: the optimal classifier
22
¡ Once the dual problem is solved for 𝜶, we can recover the
optimal solution to problem [Eq.10] by using the KKT.
¡ Let SV be the set of all support vectors
ă
SV is a subset of the training data.
ă
! > 0 suggests that xi is a support vector.
¡ We can compute w* by using [Eq.12]. So:
r
w* = å a i yi x i =
i =1
åa y x ;
x i ÎSV
i
i
i
(due to α" = 0 for any xj not in SV)
¡ To find b*, we take an index k such that $ > 0:
ă
It means yk(ỏw* ì xkủ + b*) -1 = 0 due to [Eq.16].
ă
Hence,
ă
b* = yk - ỏw* × xkñ
SVM: classifying new instances
23
¡ The decision boundary is
f(x) = á w * ×xđ + b* =
å α y áx × xđ + b* = 0
x i ỴSV
i
i
i
[Eq.19]
¡ For a new instance z, we compute:
ỉ
ư
sign( á w * ×zđ + b*) = signỗỗ ồ i yi ỏ x i ì zủ + b* ữữ
ố xi ẻSV
ứ
ă
If the result is 1, z will be assigned to the possitive class;
otherwise z will be assigned to the negative class.
Ă Note that this classification principle
ă
Just depends on the support vectors.
ă
Just needs to compute some dot products.
[Eq.20]
2. Soft-margin SVM
24
¡ What if the two classes are not linearly separable?
(Trường hợp 2 lớp không thể phân tách tuyến tớnh thỡ sao?)
ă
ă
Linear separability is ideal in practice.
Data are often noisy or erronous, making two classes
overlapping (nhiễu/lỗi có thể làm 2 lp giao nhau)
Ă In the case of linear separability:
ă
ă
ỏ w × wđ
2
Conditioned on yi (á w × x i ñ + b) ³ 1, "i = 1..r
Minimize
¡ In the cases of noises or overlapping, those constraints may
never meet simutaneously.
ă
It means we cannot solve for w* and b*.
Example of inseparability
¡ Noisy points xa and xb are mis-placed.
25