Tải bản đầy đủ (.pptx) (32 trang)

Bài 8 Slide Unsupervised Learning: K­‐Means Gaussian Mixture Models

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (611.97 KB, 32 trang )

Unsupervised Learning:
K-‐Means & Gaussian Mixture Models


Unsupervised Learning



Supervised learning used labeled data pairs (x, y)
to learn a function f : X→Y






No labels = unsupervised learning
Only some points are labeled = semi-‐supervised learning





But, what if we don’t have labels?

Labels may be expensive to obtain, so we only get a few

Clustering is the unsupervised grouping of data points.
knowledge discovery.

It can be used for




K-‐Means Clustering


Clustering Data


K-‐Means Clustering

K-‐Means ( k , X )



Randomly choose k cluster center locations
(centroids)



Loop until convergence



Assign each point to the cluster of the closest
centroid



Re-‐estimate the cluster centroids based on the
data assigned to each cluster



K-‐Means Clustering

K-‐Means ( k , X )



Randomly choose k cluster center locations
(centroids)



Loop until convergence



Assign each point to the cluster of the closest
centroid



Re-‐estimate the cluster centroids based on the
data assigned to each cluster


K-‐Means Clustering

K-‐Means ( k , X )




Randomly choose k cluster center locations
(centroids)



Loop until convergence



Assign each point to the cluster of the closest
centroid



Re-‐estimate the cluster centroids based on the
data assigned to each cluster


K-‐Means Animation

Example generated by Andrew Moore
using Dan Pelleg’s super- duper fast
K-means system:

Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms
with Geometric Reasoning.
Proc. Conference on Knowledge

Discovery in Databases 1999.


K-‐Means Objective Function

• K-‐means fnds a local optimum of the following objective
function:
X

k

X

arg min

kx — µ k

S

i=1

2
i

2

x2Si

where S = {S1,. . . , Sk} is a parti ti oning over
S

X = {x

1

,. . . , x

and µ i = mean(S i )

n

} s.t.

X =

k
i =1

Si


Problems with K-‐Means

• Very sensitive to the initial points
– Do many runs of K-‐Means, each with different initial centroids
– Seed the centroids using a better method than randomly choosing the centroids
• e.g., Farthest-‐frst sampling

• Must manually choose k



Learn the optimal k for the clustering

• Note that this requires a performance measure


Problems with K-‐Means



How do you tell it which clustering you want?

k = 2

Constrained clustering techniques

Same-‐cluster constraint (must-‐link)

(semi-‐supervised)

Different-‐cluster constraint (cannot-‐link)


Gaussian Mixture Models

• Recall the Gaussian distribution:



1
P ( x | µ, ⌃ ) =


p

(2⇡)d |⌃|


1

exp

— 2 (x —

µ ) |⌃ —1

(x — µ)


The GMM


assumption

There are k components. The i’th
component is called ωi



Component ωi has an associated

µ2


mean vector µi

µ1

µ3


The GMM


assumption

There are k components. The i’th
component is called ωi



Component i has an associated

à2

mean vector ài

ã

Each component generates data from a

µ1


Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:

µ3


The GMM


assumption

There are k components. The i’th
component is called ωi



Component i has an associated
mean vector ài

ã

Each component generates data from a
Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:

1.

Pick a component at random. Choose
component i with probability P(ωi).

µ2


The GMM


assumption

There are k components. The i’th
component is called ωi



Component i has an associated
mean vector ài

ã

Each component generates data from a
Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:


1.

Pick a component at random. Choose
component i with probability P(ωi).

2.

Datapoint ~ N(µi, σ2I )

µ2

x


The General GMM


assumption

There are k components. The i’th
component is called i

ã

Component i has an associated

à2

mean vector ài


ã

Each component generates data from a

µ1

Gaussian with mean µi and covariance matrix Σi

Assume that each datapoint is generated
according to the following recipe:

1.

Pick a component at random. Choose
component i with probability P(ωi).

2.

Datapoint ~ N(µi , Σi )

µ3


Fitting a Gaussian Mixture
Model
(Optional)


Expectation-Maximization for GMMs
Iterate until convergence:

On the t’th iteration let our estimates be
Just evaluate a

λt = { µ1(t), µ2(t) … µc(t) }

Gaussian at xk

E-step: Compute “expected” classes of all datapoints for each class

p(x

P(wi xk , λt )=

k

w ,i λ )Pt (w λ
p(x

k

λ )t

)

i

t

(


p x
=

w ,i µ (t ),i

k

µ i (t + 1 ) =

∑ (

p x

k

w ,j µ

Estimate µ given our data’s class membership distributions

∑ P(w
k


k

)

i xk , λt xk

P (w xi , λ k )


t

2

)

I p (t)
i

c

j =1

M-step:

σ

j

(t ),σ

2

)

I p

j


(t )


pi(t) is shorthand for

E.M. for General GMMs

estimate of P(ωi) on t’th
iteration

Iterate.

On the t’th iteration let our estimates be

λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
Just evaluate a

E-step: Compute “expected” clusters of all datapoints

p(x

P (wi xk , λt )=

w i, λ )tP(w λ

k

p(x

k


)
i

t

p(xk wi , µi (t ), Σ i (t ) )pi (t )

=

λ )t

Gaussian at xk

c

∑ p(x

k

w ,j µ

j

(t),Σ

j

)


(t) p

j

(t )

j=1

M-step: Estimate µ, Σ given our data’s class membership distributions

∑ P(w x , λ )x
i

µ i (t + 1) =

k


k

k

P (wi xk , λt

t



k


Σ i (t + 1) =

k

P (w x , λ
i

)[x

k

t

k



)

− µ (ti + 1)][x
P(w xi , λk )

k

pi (t + 1) =

∑ P (w
k

i xk , λt


R

)
R = #records

k

t

− µ (ti + 1)]

T


(End optional section)


Gaussian
Mixture
Example:
Start

Advance apologies: in Black and White this
example will be incomprehensible


After first
iteration



After 2nd
iteration


After 3rd
iteration


×