简介
聚类算法是一种无监督机器学习模型,它直接从数据的内在性质中内在性质中学习最优的划分结果或者确定离散标签类型。
最简单的k-means聚类算法:
- cluster center,该簇所有数据点的算术平均值
- 每个点到自己cluster center的距离比到其他cluster centers近
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)
plt.scatter(X[:,0],X[:,1],s=50);
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.figure()
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
k-means可以自动完成4个簇的识别。它使用了期望最大化算法:
- 猜测一些簇中心
- 重复直到收敛:
- E-step 期望步骤: 分配点到最近的簇中心
- M-step 最大化步骤: 更新簇中心为所有点平均值
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=2):
# 1. Randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 2b. Find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
# 2c. Check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
k-means的缺点:
- 不一定是全局最优
- 需要事先指定簇数量
- 只能确定线性边界
- 数据量大时速度慢
非线性边界可以使用核变换投影到高维空间,使用最近邻图来计算数据的高维表示:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
labels = KMeans(2, random_state=0).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis');
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2,affinity='nearest_neighbors', assign_labels='kmeans')
labels = model.fit_predict(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis');
案例:手写数字
将1767个64维数据,分为10个类。显示簇中心、准确率、混淆矩阵。
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
print(kmeans.cluster_centers_.shape)
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
from sklearn.metrics import accuracy_score
print(accuracy_score(digits.target, labels))
plt.figure()
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=digits.target_names,
yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');
使用t-分布邻域嵌入算法进行预处理(64维降到2维),提高准确率
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsne.fit_transform(digits.data)
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
print(accuracy_score(digits.target, labels))
案例: 图像色彩压缩
该图像存储在一个(height,width,RGB)的三维数组中,每个元素以0~255的整数表示红绿蓝信息。具体维度(427,640,3)
对像素空间(特征矩阵)使用k-means聚类,将万种颜色缩减到16种。使用了MiniBatchKmeans算法对数据集的子集进行计算,速度更快。
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
print(china.shape)
data = china /255
data=data.reshape(427*640,3)
print(data.shape)
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6),
subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);
参考:
[1]美 万托布拉斯 (VanderPlas, Jake).Python数据科学手册[M].人民邮电出版社,2018.