1
Machine Learning
Clustering
Nguyen Thi Thu Ha
Email:
2
What is clustering
•
Clustering can be considered the most
important unsupervised learning problem;
•
An other definition of clustering could be “the
process of organizing objects into groups whose
members are similar”.
3
What is clustering
•
A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to
the objects belonging to other clusters.
4
What is clustering
•
In this case we identify the 4 clusters into which the
data can be divided;
•
the similarity criterion is distance:
•
two or more objects belong to the same cluster if
they are “close” according to a given distance.
(called distance-based clustering.)
5
What is clustering
•
Another kind of clustering is conceptual clustering:
two or more objects belong to the same cluster if this
one defines a concept common to all that objects.
•
In other words, objects are grouped according to
their fit to descriptive concepts, not according to
simple similarity measures.
Why?
•
determine the intrinsic grouping in a set of
unlabeled data.
•
what constitutes a good clustering?
6
Application
•
Marketing: finding groups of customers with
similar behavior given a large database of
customer data containing their properties and past
buying records;
•
Biology: classification of plants and animals given
their features;
•
Libraries: book ordering;
7
Application
•
City-planning: identifying groups of houses
according to their house type, value and
geographical location;
•
Earthquake studies: clustering observed
earthquake to identify dangerous zones;
•
WWW: document classification; clustering weblog
data to discover groups of similar access patterns.
8
Problems
•
dealing with large number of dimensions
and large number of data items.
•
the effectiveness of the method depends on
the definition of “distance” (for distance-
based clustering);
9
Classification of clustering algorithm
•
Exclusive Clustering
•
Overlapping Clustering
•
Hierarchical Clustering
•
Probabilistic Clustering
10
Classification of clustering algorithm
•
four of the most used clustering algorithms:
–
K-means
–
Fuzzy C-means
–
Hierarchical clustering
–
Mixture of Gaussians
11
K-Means
•
K-Means Algorithm Properties
–
There are always K clusters.
–
There is always at least one item in each cluster.
–
The clusters are non-hierarchical and they do
not overlap.
–
Every member of a cluster is closer to its
cluster than any other cluster.
12
13
K-Means
•
Assumes instances are real-valued vectors.
•
Clusters based on centroids, or mean of
points in a cluster, c:
•
Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
∑
∈
=
cx
x
c
||
1
(c)μ
14
Distance Metrics
•
Euclidian distance (L
2
norm):
•
L
1
norm:
•
Cosine Similarity:
2
1
2
)(),(
i
m
i
i
yxyxL −=
∑
=
∑
=
−=
m
i
ii
yxyxL
1
1
),(
yx
yx
⋅
−
•
1
15
K-Means
Let d be the distance measure between instances.
Select k random instances {s
1
, s
2
,… s
k
} as seeds.
Until clustering converges or other stopping criterion:
For each instance x
i
:
Assign x
i
to the cluster c
j
such that d(x
i
, s
j
) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster c
j
s
j
= µ(c
j
)
K-means
16
17
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
K-means
18
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
19
Step 0
Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4
Step 3 Step 2 Step 1 Step 0
agglomerative
agglomerative
divisive
divisive
Hierarchical Clustering
•
Start by assigning each item to a cluster, so
that if you have N items.
•
Find the closest (most similar) pair of
clusters and merge them into a single
cluster, so that now you have one cluster
less.
•
Compute distances (similarities) between
the new cluster and each of the old clusters.
•
Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N. (*)
20
Hierarchical Clustering
•
Input distance matrix
21
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
Hierarchical Clustering
•
The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single
cluster called "MI/TO". The level of the
new cluster is L(MI/TO) = 138 and the new
sequence number is m = 1.
22
Hierarchical Clustering
BA FI
MI/T
O
NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/T
O
877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
23
Hierarchical Clustering
24
•
min d(i,j) = d(NA,RM) = 219 => merge NA
and RM into a new cluster called NA/RM
L(NA/RM) = 219
m = 2
Hierarchical Clustering
25
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0