Tải bản đầy đủ (.ppt) (30 trang)

Machine Learning Clustering

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (201.59 KB, 30 trang )

1
Machine Learning
Clustering
Nguyen Thi Thu Ha
Email:
2
What is clustering

Clustering can be considered the most
important unsupervised learning problem;

An other definition of clustering could be “the
process of organizing objects into groups whose
members are similar”.
3
What is clustering

A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to
the objects belonging to other clusters.
4
What is clustering

In this case we identify the 4 clusters into which the
data can be divided;

the similarity criterion is distance:

two or more objects belong to the same cluster if
they are “close” according to a given distance.
(called distance-based clustering.)


5
What is clustering

Another kind of clustering is conceptual clustering:
two or more objects belong to the same cluster if this
one defines a concept common to all that objects.

In other words, objects are grouped according to
their fit to descriptive concepts, not according to
simple similarity measures.
Why?

determine the intrinsic grouping in a set of
unlabeled data.

what constitutes a good clustering?
6
Application

Marketing: finding groups of customers with
similar behavior given a large database of
customer data containing their properties and past
buying records;

Biology: classification of plants and animals given
their features;

Libraries: book ordering;
7
Application


City-planning: identifying groups of houses
according to their house type, value and
geographical location;

Earthquake studies: clustering observed
earthquake to identify dangerous zones;

WWW: document classification; clustering weblog
data to discover groups of similar access patterns.
8
Problems

dealing with large number of dimensions
and large number of data items.

the effectiveness of the method depends on
the definition of “distance” (for distance-
based clustering);
9
Classification of clustering algorithm

Exclusive Clustering

Overlapping Clustering

Hierarchical Clustering

Probabilistic Clustering
10

Classification of clustering algorithm

four of the most used clustering algorithms:

K-means

Fuzzy C-means

Hierarchical clustering

Mixture of Gaussians
11
K-Means

K-Means Algorithm Properties

There are always K clusters.

There is always at least one item in each cluster.

The clusters are non-hierarchical and they do
not overlap.

Every member of a cluster is closer to its
cluster than any other cluster.
12
13
K-Means

Assumes instances are real-valued vectors.


Clusters based on centroids, or mean of
points in a cluster, c:

Reassignment of instances to clusters is
based on distance to the current cluster
centroids.


=
cx
x
c


||
1
(c)μ
14
Distance Metrics

Euclidian distance (L
2
norm):

L
1
norm:

Cosine Similarity:

2
1
2
)(),(
i
m
i
i
yxyxL −=

=


=
−=
m
i
ii
yxyxL
1
1
),(

yx
yx






1
15
K-Means
Let d be the distance measure between instances.
Select k random instances {s
1
, s
2
,… s
k
} as seeds.
Until clustering converges or other stopping criterion:
For each instance x
i
:
Assign x
i
to the cluster c
j
such that d(x
i
, s
j
) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster c
j
s
j
= µ(c

j
)
K-means
16
17
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
K-means
18
0
1
2
3
4
5
6
7

8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1

2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
19
Step 0
Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4
Step 3 Step 2 Step 1 Step 0
agglomerative
agglomerative
divisive
divisive
Hierarchical Clustering


Start by assigning each item to a cluster, so
that if you have N items.

Find the closest (most similar) pair of
clusters and merge them into a single
cluster, so that now you have one cluster
less.

Compute distances (similarities) between
the new cluster and each of the old clusters.

Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N. (*)
20
Hierarchical Clustering

Input distance matrix
21
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
Hierarchical Clustering

The nearest pair of cities is MI and TO, at
distance 138. These are merged into a single

cluster called "MI/TO". The level of the
new cluster is L(MI/TO) = 138 and the new
sequence number is m = 1.
22
Hierarchical Clustering
BA FI
MI/T
O
NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/T
O
877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0
23
Hierarchical Clustering
24

min d(i,j) = d(NA,RM) = 219 => merge NA
and RM into a new cluster called NA/RM
L(NA/RM) = 219
m = 2
Hierarchical Clustering
25
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564

NA/RM 255 268 564 0

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×