Bài 8 Slide Unsupervised Learning: K‐Means Gaussian Mixture Models

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (611.97 KB, 32 trang )

Unsupervised Learning:
K-‐Means & Gaussian Mixture Models

Unsupervised Learning

•

Supervised learning used labeled data pairs (x, y)
to learn a function f : X→Y

–

•
•

No labels = unsupervised learning
Only some points are labeled = semi-‐supervised learning

–

•

But, what if we don’t have labels?

Labels may be expensive to obtain, so we only get a few

Clustering is the unsupervised grouping of data points.
knowledge discovery.

It can be used for

K-‐Means Clustering

Clustering Data

K-‐Means Clustering

K-‐Means ( k , X )

•

Randomly choose k cluster center locations
(centroids)

•

Loop until convergence

•

Assign each point to the cluster of the closest
centroid

•

Re-‐estimate the cluster centroids based on the
data assigned to each cluster

K-‐Means Clustering

K-‐Means ( k , X )

•

Randomly choose k cluster center locations
(centroids)

•

Loop until convergence

•

Assign each point to the cluster of the closest
centroid

•

Re-‐estimate the cluster centroids based on the
data assigned to each cluster

K-‐Means Clustering

K-‐Means ( k , X )

•

Randomly choose k cluster center locations
(centroids)

•

Loop until convergence

•

Assign each point to the cluster of the closest
centroid

•

Re-‐estimate the cluster centroids based on the
data assigned to each cluster

K-‐Means Animation

Example generated by Andrew Moore
using Dan Pelleg’s super- duper fast
K-means system:

Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms
with Geometric Reasoning.
Proc. Conference on Knowledge

Discovery in Databases 1999.

K-‐Means Objective Function

• K-‐means fnds a local optimum of the following objective
function:
X

k

X

arg min

kx — µ k

S

i=1

2
i

2

x2Si

where S = {S1,. . . , Sk} is a parti ti oning over
S

X = {x

1

,. . . , x

and µ i = mean(S i )

n

} s.t.

X =

k
i =1

Si

Problems with K-‐Means

• Very sensitive to the initial points
– Do many runs of K-‐Means, each with diﬀerent initial centroids
– Seed the centroids using a better method than randomly choosing the centroids
• e.g., Farthest-‐frst sampling

• Must manually choose k
–

Learn the optimal k for the clustering

• Note that this requires a performance measure

Problems with K-‐Means

•

How do you tell it which clustering you want?

k = 2

Constrained clustering techniques

Same-‐cluster constraint (must-‐link)

(semi-‐supervised)

Diﬀerent-‐cluster constraint (cannot-‐link)

Gaussian Mixture Models

• Recall the Gaussian distribution:

✓

1
P ( x | µ, ⌃ ) =

p

(2⇡)d |⌃|

◆
1

exp

— 2 (x —

µ ) |⌃ —1

(x — µ)

The GMM
•

assumption

There are k components. The i’th
component is called ωi

•

Component ωi has an associated

µ2

mean vector µi

µ1

µ3

The GMM
•

assumption

There are k components. The i’th
component is called ωi

•

Component i has an associated

à2

mean vector ài

ã

Each component generates data from a

µ1

Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:

µ3

The GMM
•

assumption

There are k components. The i’th
component is called ωi

•

Component i has an associated
mean vector ài

ã

Each component generates data from a
Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:

1.

Pick a component at random. Choose
component i with probability P(ωi).

µ2

The GMM
•

assumption

There are k components. The i’th
component is called ωi

•

Component i has an associated
mean vector ài

ã

Each component generates data from a
Gaussian with mean µi and covariance matrix

σ2 I
Assume that each datapoint is generated
according to the following recipe:

1.

Pick a component at random. Choose
component i with probability P(ωi).

2.

Datapoint ~ N(µi, σ2I )

µ2

x

The General GMM
•

assumption

There are k components. The i’th
component is called i

ã

Component i has an associated

à2

mean vector ài

ã

Each component generates data from a

µ1

Gaussian with mean µi and covariance matrix Σi

Assume that each datapoint is generated
according to the following recipe:

1.

Pick a component at random. Choose
component i with probability P(ωi).

2.

Datapoint ~ N(µi , Σi )

µ3

Fitting a Gaussian Mixture
Model
(Optional)

Expectation-Maximization for GMMs
Iterate until convergence:

On the t’th iteration let our estimates be
Just evaluate a

λt = { µ1(t), µ2(t) … µc(t) }

Gaussian at xk

E-step: Compute “expected” classes of all datapoints for each class

p(x

P(wi xk , λt )=

k

w ,i λ )Pt (w λ
p(x

k

λ )t

)

i

t

(

p x
=

w ,i µ (t ),i

k

µ i (t + 1 ) =

∑ (

p x

k

w ,j µ

Estimate µ given our data’s class membership distributions

∑ P(w
k

∑
k

)

i xk , λt xk

P (w xi , λ k )

t

2

)

I p (t)
i

c

j =1

M-step:

σ

j

(t ),σ

2

)

I p

j

(t )

pi(t) is shorthand for

E.M. for General GMMs

estimate of P(ωi) on t’th
iteration

Iterate.

On the t’th iteration let our estimates be

λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
Just evaluate a

E-step: Compute “expected” clusters of all datapoints

p(x

P (wi xk , λt )=

w i, λ )tP(w λ

k

p(x

k

)
i

t

p(xk wi , µi (t ), Σ i (t ) )pi (t )

=

λ )t

Gaussian at xk

c

∑ p(x

k

w ,j µ

j

(t),Σ

j

)

(t) p

j

(t )

j=1

M-step: Estimate µ, Σ given our data’s class membership distributions

∑ P(w x , λ )x
i

µ i (t + 1) =

k

∑
k

k

P (wi xk , λt

t

∑

k

Σ i (t + 1) =

k

P (w x , λ
i

)[x

k

t

k

∑

)

− µ (ti + 1)][x
P(w xi , λk )

k

pi (t + 1) =

∑ P (w
k

i xk , λt

R

)
R = #records

k

t

− µ (ti + 1)]

T

(End optional section)

Gaussian
Mixture
Example:
Start

Advance apologies: in Black and White this
example will be incomprehensible

After first
iteration

After 2nd
iteration

After 3rd
iteration