Unsupervised Learning:
K-‐Means & Gaussian Mixture Models
Unsupervised Learning
•
Supervised learning used labeled data pairs (x, y)
to learn a function f : X→Y
–
•
•
No labels = unsupervised learning
Only some points are labeled = semi-‐supervised learning
–
•
But, what if we don’t have labels?
Labels may be expensive to obtain, so we only get a few
Clustering is the unsupervised grouping of data points.
knowledge discovery.
It can be used for
K-‐Means Clustering
Clustering Data
K-‐Means Clustering
K-‐Means ( k , X )
•
Randomly choose k cluster center locations
(centroids)
•
Loop until convergence
•
Assign each point to the cluster of the closest
centroid
•
Re-‐estimate the cluster centroids based on the
data assigned to each cluster
K-‐Means Clustering
K-‐Means ( k , X )
•
Randomly choose k cluster center locations
(centroids)
•
Loop until convergence
•
Assign each point to the cluster of the closest
centroid
•
Re-‐estimate the cluster centroids based on the
data assigned to each cluster
K-‐Means Clustering
K-‐Means ( k , X )
•
Randomly choose k cluster center locations
(centroids)
•
Loop until convergence
•
Assign each point to the cluster of the closest
centroid
•
Re-‐estimate the cluster centroids based on the
data assigned to each cluster
K-‐Means Animation
Example generated by Andrew Moore
using Dan Pelleg’s super- duper fast
K-means system:
Dan Pelleg and Andrew Moore.
Accelerating Exact k-means Algorithms
with Geometric Reasoning.
Proc. Conference on Knowledge
Discovery in Databases 1999.
K-‐Means Objective Function
• K-‐means fnds a local optimum of the following objective
function:
X
k
X
arg min
kx — µ k
S
i=1
2
i
2
x2Si
where S = {S1,. . . , Sk} is a parti ti oning over
S
X = {x
1
,. . . , x
and µ i = mean(S i )
n
} s.t.
X =
k
i =1
Si
Problems with K-‐Means
• Very sensitive to the initial points
– Do many runs of K-‐Means, each with different initial centroids
– Seed the centroids using a better method than randomly choosing the centroids
• e.g., Farthest-‐frst sampling
• Must manually choose k
–
Learn the optimal k for the clustering
• Note that this requires a performance measure
Problems with K-‐Means
•
How do you tell it which clustering you want?
k = 2
Constrained clustering techniques
Same-‐cluster constraint (must-‐link)
(semi-‐supervised)
Different-‐cluster constraint (cannot-‐link)
Gaussian Mixture Models
• Recall the Gaussian distribution:
✓
1
P ( x | µ, ⌃ ) =
p
(2⇡)d |⌃|
◆
1
exp
— 2 (x —
µ ) |⌃ —1
(x — µ)
The GMM
•
assumption
There are k components. The i’th
component is called ωi
•
Component ωi has an associated
µ2
mean vector µi
µ1
µ3
The GMM
•
assumption
There are k components. The i’th
component is called ωi
•
Component i has an associated
à2
mean vector ài
ã
Each component generates data from a
µ1
Gaussian with mean µi and covariance matrix
σ2 I
Assume that each datapoint is generated
according to the following recipe:
µ3
The GMM
•
assumption
There are k components. The i’th
component is called ωi
•
Component i has an associated
mean vector ài
ã
Each component generates data from a
Gaussian with mean µi and covariance matrix
σ2 I
Assume that each datapoint is generated
according to the following recipe:
1.
Pick a component at random. Choose
component i with probability P(ωi).
µ2
The GMM
•
assumption
There are k components. The i’th
component is called ωi
•
Component i has an associated
mean vector ài
ã
Each component generates data from a
Gaussian with mean µi and covariance matrix
σ2 I
Assume that each datapoint is generated
according to the following recipe:
1.
Pick a component at random. Choose
component i with probability P(ωi).
2.
Datapoint ~ N(µi, σ2I )
µ2
x
The General GMM
•
assumption
There are k components. The i’th
component is called i
ã
Component i has an associated
à2
mean vector ài
ã
Each component generates data from a
µ1
Gaussian with mean µi and covariance matrix Σi
Assume that each datapoint is generated
according to the following recipe:
1.
Pick a component at random. Choose
component i with probability P(ωi).
2.
Datapoint ~ N(µi , Σi )
µ3
Fitting a Gaussian Mixture
Model
(Optional)
Expectation-Maximization for GMMs
Iterate until convergence:
On the t’th iteration let our estimates be
Just evaluate a
λt = { µ1(t), µ2(t) … µc(t) }
Gaussian at xk
E-step: Compute “expected” classes of all datapoints for each class
p(x
P(wi xk , λt )=
k
w ,i λ )Pt (w λ
p(x
k
λ )t
)
i
t
(
p x
=
w ,i µ (t ),i
k
µ i (t + 1 ) =
∑ (
p x
k
w ,j µ
Estimate µ given our data’s class membership distributions
∑ P(w
k
∑
k
)
i xk , λt xk
P (w xi , λ k )
t
2
)
I p (t)
i
c
j =1
M-step:
σ
j
(t ),σ
2
)
I p
j
(t )
pi(t) is shorthand for
E.M. for General GMMs
estimate of P(ωi) on t’th
iteration
Iterate.
On the t’th iteration let our estimates be
λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
Just evaluate a
E-step: Compute “expected” clusters of all datapoints
p(x
P (wi xk , λt )=
w i, λ )tP(w λ
k
p(x
k
)
i
t
p(xk wi , µi (t ), Σ i (t ) )pi (t )
=
λ )t
Gaussian at xk
c
∑ p(x
k
w ,j µ
j
(t),Σ
j
)
(t) p
j
(t )
j=1
M-step: Estimate µ, Σ given our data’s class membership distributions
∑ P(w x , λ )x
i
µ i (t + 1) =
k
∑
k
k
P (wi xk , λt
t
∑
k
Σ i (t + 1) =
k
P (w x , λ
i
)[x
k
t
k
∑
)
− µ (ti + 1)][x
P(w xi , λk )
k
pi (t + 1) =
∑ P (w
k
i xk , λt
R
)
R = #records
k
t
− µ (ti + 1)]
T
(End optional section)
Gaussian
Mixture
Example:
Start
Advance apologies: in Black and White this
example will be incomprehensible
After first
iteration
After 2nd
iteration
After 3rd
iteration