ML5 - Clustering

目录

0 前言
1 K-means
2 Hierarchical Clustering
3 Gaussian Mixture Model

0 前言

Clustering是一种unsupervised learning，也就是对没有标签的数据集进行集群。

1 K-means

原理

Step 1:
– 随机选取k个质心centroids(仅在第一次迭代中)，所有离每个质心点最近的点都被分配到特定的聚类中。（质心是所有点的算术平均值或平均位置）
Step 2:
– 使用该聚类中所有点坐标的平均值重新计算新的质心点。
– 然后重复第一步(分配最近的点)，直到聚类收敛。

局限

集群的数量必须事先选定
k-means局限于线性集群边界
k-means对于大量的样本可能是缓慢的
可能无法获得全局最优解
label clusters是一个难点

Python code

# Load libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Load data
iris = datasets.load_iris()
features = iris.data

# Standardize features
scaler = StandardScaler()
features_std = scaler.fit_transform(features)

# Create k-mean object
cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)

# Train model
model = cluster.fit(features_std)

# Accuracy
iris['pred_species'] = np.choose(model.labels_, [1, 0, 2]).astype(np.int64)
print ("Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species))
print ("Classification report :", metrics.classification_report(iris.species,
iris.pred_species))

# Create new observation
new_observation = [[0.8, 0.8, 0.8, 0.8]]

# Predict observation's cluster
model.predict(new_observation)

Choose the best K using elbow graph:

def plot_k(df_X,K):
    wcss = []
    for i in range(1, K):
        kmeans = KMeans(n_clusters = i, init = 'k-means++')
        kmeans.fit(df_X)
        wcss.append(kmeans.inertia_)
    sns.lineplot(range(1, K), wcss,marker='o',color='red')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    return(plt.show())

2 Hierarchical Clustering

Hierarchical Clustering解决了像K-MEANS这样必须预先定义集群数量K的问题。它建立了cluster的层次结构（hierarchy），结果通常以树状图表示（dendrogram）。

种类

Divisive善于识别大型聚类。Agglomerative是识别小聚类的有效方法。

Agglomerative ("bottom up")
自底向上的算法。在开始时将每个数据作为一个单例singleton集群处理，然后依次聚集成对的集群，直到所有集群合并为包含所有数据的一个总集群。
– Calculate distance of each point to the rest of points.
– The points that are closest to the point in consideration are combined to form a cluster.
– Repeat process across all points until a final cluster is obtained.
Divisive ("top-down")
自上而下的算法。

最开始给定一个大小为N的数据集(d1, d2, d3， ....dN)，所有数据都在最顶部的一个cluster中；
通过clustering method比如K-Means逐次split；
使用上一步所说的flat clustering method选择最佳集群分割进行split，不停重复，直到每个数据都在自己的单例集群中。

距离计算

有三种距离计算方法，作为sklearn中的参数可以选择：

Ward linkage method:
merge clusters if the in-cluster variance or the sum of square error is a minimum.
Complete linkage method:
considers the distance between the farthest elements of two clusters, also known as maximum linkage.
Average linkage method:
considers all pairwise distances of both clusters; it is less affected by outliers.

选择最优的cluster number

cluster number的最佳选择是在不与cluster相交的情况下，由相距最远的水平线切成的树状图中垂直线的个数

best K=4

Python code

# Normalize data so that the values are within 0 and 1
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()

# Produce dendrogram
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

# Apply Hierarchical Clustering for 2 clusters
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')  
cluster.fit_predict(data_scaled)

3 Gaussian Mixture Model

“The Gaussian Mixture Model, or GMM for short, is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each. ” 即GMM假设data的generative process符合正态分布，然后model需要估计的参数是这些generative process的mean和std。本来这些需要估计的参数可以用MLE去估计，但是我们这里存在latent variable，即process的类别。所以这个时候需要EM Algorithm。

EM Algorithm

Gaussian Mixture Model 是Expectation-Maximization (EM Algorithm)的一个实际例子。
EM algorithm is an approach for maximum likelihood estimation in the presence of latent variables(unobserved or hidden variables). EM 分为两步：

E-Step. Estimate the expected value for each latent variable.
M-Step. Optimize the parameters of the distribution using maximum likelihood.

In the EM algorithm, the estimation-step would estimate a value for the process latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions in an attempt to best capture the density of the data. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.
详细见该文： https://machinelearningmastery.com/expectation-maximization-em-algorithm/

优势

K-means 是一种hard clustering method，也就是说它会把一个数据点分到有且只有一个确定的集群。这种方法的一个限制是，没有不确定性度量或概率来告诉我们一个数据点与特定集群有多少关联。所以GMM的soft clustering补足了这个缺陷。

同时，GMM也可以作为disrtibution estimator，可以generate new samples following distributions estimated from training set.

原理

如果我们假设一共有K个Gaussian generative process，每一个process会有三个参数：

A mean μ that defines its centre.
A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
A mixing probability π that defines how big or small the Gaussian function will be.

首先我们先随机给出这三个参数的初始值
然后进行EM算法

E: for each point, find weights encoding the probability of membership in each cluster.
M: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights.

参数

sklearn包中的GaussianMixture() class有一个hyper-parameter that controls the degrees of freedom in the shape of each cluster

covariance_type=“diag” (default)
– Means that the size of the cluster along each dimension can be set independently, with the resulting ellipse constrained to align with the axes.
covariance_type=“spherical”:
– Constrains the shape of the cluster such that all dimensions
are equal.
– The resulting clustering will have similar characteristics to
that of k-means, though it is not entirely equivalent.
– Simpler and faster model.
covariance_type=“full”:
– Allows each cluster to be modeled as an ellipse with arbitrary orientation.
– More complicated and computationally expensive

选择optimal number of components

To correct for over-fitting, adjust model likelihoods by using some analytic criterion such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). The optimal number of clusters is the value that minimizes the AIC or BIC.

Python code

from sklearn.mixture import GaussianMixture
# Use the AIC to get a gauge for the number of GMM components to use
n_components = np.arange(50, 210, 10)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(digits.data)
          for n in n_components]
aics = [model.fit(digits.data).aic(digits.data) for model in models]
plt.plot(n_components, aics, label='AIC');
plt.legend(loc='best')
plt.xlabel('n_components');

# Fit number of components to 110 and confirm that it has converged
gmm = GaussianMixture(110, covariance_type='full', random_state=0)
gmm.fit(digits.data)
print(gmm.converged_)

参考网站

[1] https://www.youtube.com/watch?v=EWd1xRkyEog
[2] https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95

最后编辑于：2021.02.03 13:56:54

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 221,548评论 6赞 515
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,497评论 3赞 399
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 167,990评论 0赞 360
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,618评论 1赞 296
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,618评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,246评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,819评论 3赞 421
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,725评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,268评论 1赞 320
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,356评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,488评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,181评论 5赞 350
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,862评论 3赞 333
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,331评论 0赞 24
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,445评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,897评论 3赞 376
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,500评论 2赞 359

ML5 - Clustering

目录

0 前言

1 K-means

原理

局限

Python code

2 Hierarchical Clustering

种类

距离计算

选择最优的cluster number

Python code

3 Gaussian Mixture Model

EM Algorithm

优势

原理

参数

选择optimal number of components

Python code

参考网站

推荐阅读更多精彩内容