Clustering
Unsupervised learing introduction
K-means algorithm
Input:
- K (number of clusters)
- Training set
K-means algroithm
Randomly initialize K cluster centroids
Repeat {
for to m
:=index (from 1 to K) of cluster centroid closest to
for to K
:=average (mean) of points assigned to cluster
}
K-means for non-separated clusters
Optimization Objective
K-means optimization objective
= index of cluster (1,2,...,K) to which example is currently assigned.
= cluster centriod ()
= cluster centroid of cluster to which example has been assigned
Distortion:
Random Initalization
- Randomly pick K trianing examples.
- Set equal to these K examples.
Random initialization many times:
For i=1 to 100 {
Randomly initialize K-means.
Run K-means. Get
Compute cost function (distortion)
}
Pick clustering that gave lowest cost J.
, if K is very large, Randomly initilization many times may help less.
Choosing the number of clusters
- by hand (see the data set's figure)
- Elbow method: compute cost J form 1-..., and draw a figure of it. choose the number at the elbow.
- Evaluate K-means based on a metric for how well it performs for that later purpose. Choose a K that serve the purpose.