Tải bản đầy đủ (.pdf) (17 trang)

Data Mining with R Clustering

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (938.84 KB, 17 trang )

Data Mining with R
Clustering
Hugh Murrell


reference books
These slides are based on a book by Graham Williams:
Data Mining with Rattle and R,
The Art of Excavating Data for Knowledge Discovery.
for further background on decision trees try Andrew Moore’s
slides from: />and as always, wikipedia is a useful source of information.


clustering
Clustering is one of the core tools that is used by the data
miner.
Clustering gives us the opportunity to group observations in a
generally unguided fashion according to how similar they are.
This is done on the basis of a measure of the distance between
observations.
The aim of clustering is to identify groups of observations that
are close together but as a group are quite separate from other
groups.


k-means clustering
Given a set of observations, (x1 , x2 , . . . , xn ), where each
observation is a d-dimensional real vector, k-means clustering
aims to partition the n observations into k sets (S1 , S2 , . . . , Sk )
so as to minimize the within-cluster sum of squares:
k



||xj − µi ||2
i

xj ∈Si

where µi is the mean of observations in Si .


k-means algorithm
Given an initial set of k means, m1 , . . . , mk , the algorithm
proceeds by alternating between two steps:
Assignment step: Assign each observation to the
cluster whose mean is closest to it.
Update step: Calculate the new means to be the
centroids of the observations in the new clusters.
The algorithm has converged when the assignments no longer
change.


variants of k-means
As it stands the k-means algorithm gives different results
depending on how the initial means are chosen. Thus there
have been a number of attempts in the literature to address
these problems.
The cluster package in R implements three variants of
k-means.
pam: partitioning around medoids
clara: clustering large applications
fanny: fuzzy analysis clustering

In the next slide, we outline the k-medoids algorithm which is
implemented as the function pam.


partitioning around medoids
Initialize by randomly selecting k of the n data points as
the medoids.
Associate each data point to the closest medoid.
For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the
configuration

Select the configuration with the lowest cost.

repeat until there is no change in the medoid.


distance measures
There are a number of ways to measure closest when
implementing the k-medoids algorithm.
1

Euclidean distance d(u, v ) = ( i (ui − vi )2 ) 2
Manhattan distance d(u, v ) = ( i |ui − vi |
Minkowski distance d(u, v ) = (

i (ui

1


− vi )p ) p

Note that Minkowski distance is a generalization of the other
two distance measures with p = 2 giving Euclidian distance
and p = 1 giving Manhatten (or taxi-cab) distance.


example data set
For purposes of demonstration we will again make use of the
classic iris data set from R’s datasets collection.

> summary(iris$Species)
setosa versicolor
50
50

virginica
50

Can we throw away the Species attribute and recover it
through unsupervised learning?


partitioning the iris dataset
>
>
>
>


library(cluster)
dat <- iris[, -5]
pam.result <- pam(dat,3)
pam.result$clustering

[1]
[18]
[35]
[52]
[69]
[86]
[103]
[120]
[137]

1
1
1
2
2
2
3
2
3

1
1
1
3
2

2
3
3
3

1
1
1
2
2
2
3
2
2

1
1
1
2
2
2
3
3
3

1
1
1
2
2

2
2
2
3

1
1
1
2
2
2
3
3
3

1
1
1
2
2
2
3
3
2

1
1
1
2
2

2
3
2
3

1
1
1
2
2
2
3
2
3

1
1
1
2
3
2
3
3
3

1
1
1
2
2

2
3
3
2

#
#
#
#

load package
drop known Species
perform k-medoids
print the clustering

1
1
1
2
2
2
2
3
3

1
1
1
2
2

2
2
3
3

1
1
1
2
2
2
3
3
2

1
1
1
2
2
2
3
2

1
1
1
2
2
3

3
3

1
1
2
2
2
2
3
3


success rate

> # how many does it get wrong
> #
> sum(pam.result$clustering != as.numeric(iris$Species
[1] 16
>
>
>
>

#
# plot the clusters and produce a cluster silhouette
par(mfrow=c(2,1))
plot(pam.result)

In the silhouette, a large si (almost 1) suggests that the

observations are very well clustered, a small si (around 0)
means that the observation lies between two clusters.
Observations with a negative si are probably in the wrong
cluster.


cluster plot

1
−1
−3

Component 2

clusplot(pam(x = dat, k = 3))

−3

−2

−1

0

1

2

3


Component 1
These two components explain 95.81 % of the point variability.

Silhouette plot of pam(x = dat, k = 3)
3 clusters Cj
j : nj | avei∈Cj si
1 : 50 | 0.80

n = 150

2 : 62 | 0.42
3 : 38 | 0.45
0.0

0.2

0.4

0.6

Silhouette width si
Average silhouette width : 0.55

0.8

1.0


hierarchical clustering
In hierarchical clustering, each object is assigned to its own

cluster and then the algorithm proceeds iteratively, at each
stage joining the two most similar clusters, continuing until
there is just a single cluster.
At each stage distances between clusters are recomputed by a
dissimilarity formula according to the particular clustering
method being used.


hierarchical clustering of iris dataset
The cluster package in R implements two variants of
hierarchical clustering.
agnes: AGglomerative NESting
diana: DIvisive ANAlysis Clustering
However, R has a built-in hierarchical clustering routine called
hclust (equivalent to agnes) which we will use to cluster the
iris data set.
>
>
>
>
>

dat <- iris[, -5]
# perform hierarchical clustering
hc <- hclust(dist(dat),"ave")
# plot the dendogram
plclust(hc,hang=-2)


42

15
16
33
34
37
21
32
44
24
27
36
5
38
50
8
40
28
29
41
1
18
45
6
19
17
11
49
47
20
22

23
14
43
9
39
12
25
7
13
2
46
26
10
35
30
31
3
4
48
105
129
133
112
104
117
138
111
148
113
140

142
146
116
137
149
101
125
121
144
141
145
109
135
110
118
132
119
106
123
136
108
131
103
126
130
61
99
58
94
66

76
55
59
78
77
87
51
53
86
52
57
74
79
64
92
72
75
98
120
69
88
115
122
114
102
143
150
71
128
139

147
124
127
73
84
134
107
63
68
83
93
62
95
100
89
96
97
67
85
56
91
65
80
60
54
90
70
81
82


0

1

2

Height
3

4

cluster plot

dist(dat)
hclust (*, "average")

Similar to the k-means clustering, hclust shows that cluster
setosa can be easily separated from the other two clusters, and
that clusters versicolor and virginica are to a small degree
overlapped with each other.


success rate
>
>
>
>

# how many does it get wrong
#

clusGroup <- cutree(hc, k=3)
sum(clusGroup != as.numeric(iris$Species))

[1] 14


exercises
By invitation only:
Revisit the wine dataset from my website. This time discard
the Cultivar variable.
Use the pam routine from the Cluster package to derive 3
clusters for the wine dataset. Plot the clusters in a 2D plane
and compute and report on the success rate of your chosen
method.
Also perform a hierarchical clustering of the wine dataset and
measure its performance at the 3-cluster level.
email your wine clustering script to me by Monday the 9th
May, 06h00.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×