目录
0 前言
1 K-means
2 Hierarchical Clustering
3 Gaussian Mixture Model
0 前言
Clustering是一种unsupervised learning,也就是对没有标签的数据集进行集群。
1 K-means
原理
- Step 1:
– 随机选取k个质心centroids(仅在第一次迭代中),所有离每个质心点最近的点都被分配到特定的聚类中。(质心是所有点的算术平均值或平均位置) -
Step 2:
– 使用该聚类中所有点坐标的平均值重新计算新的质心点。
– 然后重复第一步(分配最近的点),直到聚类收敛。
局限
- 集群的数量必须事先选定
- k-means局限于线性集群边界
- k-means对于大量的样本可能是缓慢的
- 可能无法获得全局最优解
- label clusters是一个难点
Python code
# Load libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Load data
iris = datasets.load_iris()
features = iris.data
# Standardize features
scaler = StandardScaler()
features_std = scaler.fit_transform(features)
# Create k-mean object
cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)
# Train model
model = cluster.fit(features_std)
# Accuracy
iris['pred_species'] = np.choose(model.labels_, [1, 0, 2]).astype(np.int64)
print ("Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species))
print ("Classification report :", metrics.classification_report(iris.species,
iris.pred_species))
# Create new observation
new_observation = [[0.8, 0.8, 0.8, 0.8]]
# Predict observation's cluster
model.predict(new_observation)
Choose the best K using elbow graph:
def plot_k(df_X,K):
wcss = []
for i in range(1, K):
kmeans = KMeans(n_clusters = i, init = 'k-means++')
kmeans.fit(df_X)
wcss.append(kmeans.inertia_)
sns.lineplot(range(1, K), wcss,marker='o',color='red')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
return(plt.show())
2 Hierarchical Clustering
Hierarchical Clustering解决了像K-MEANS这样必须预先定义集群数量K的问题。它建立了cluster的层次结构(hierarchy),结果通常以树状图表示(dendrogram)。
种类
Divisive善于识别大型聚类。Agglomerative是识别小聚类的有效方法。
-
Agglomerative ("bottom up")
自底向上的算法。在开始时将每个数据作为一个单例singleton集群处理,然后依次聚集成对的集群,直到所有集群合并为包含所有数据的一个总集群。
– Calculate distance of each point to the rest of points.
– The points that are closest to the point in consideration are combined to form a cluster.
– Repeat process across all points until a final cluster is obtained. -
Divisive ("top-down")
自上而下的算法。
- 最开始给定一个大小为N的数据集(d1, d2, d3, ....dN),所有数据都在最顶部的一个cluster中;
- 通过clustering method比如K-Means逐次split;
- 使用上一步所说的flat clustering method选择最佳集群分割进行split,不停重复,直到每个数据都在自己的单例集群中。
距离计算
有三种距离计算方法,作为sklearn中的参数可以选择:
- Ward linkage method:
merge clusters if the in-cluster variance or the sum of square error is a minimum. - Complete linkage method:
considers the distance between the farthest elements of two clusters, also known as maximum linkage. -
Average linkage method:
considers all pairwise distances of both clusters; it is less affected by outliers.
选择最优的cluster number
cluster number的最佳选择是在不与cluster相交的情况下,由相距最远的水平线切成的树状图中垂直线的个数
Python code
# Normalize data so that the values are within 0 and 1
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
# Produce dendrogram
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
# Apply Hierarchical Clustering for 2 clusters
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
3 Gaussian Mixture Model
“The Gaussian Mixture Model, or GMM for short, is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each. ” 即GMM假设data的generative process符合正态分布,然后model需要估计的参数是这些generative process的mean和std。本来这些需要估计的参数可以用MLE去估计,但是我们这里存在latent variable,即process的类别。所以这个时候需要EM Algorithm。
EM Algorithm
Gaussian Mixture Model 是Expectation-Maximization (EM Algorithm)的一个实际例子。
EM algorithm is an approach for maximum likelihood estimation in the presence of latent variables(unobserved or hidden variables). EM 分为两步:
- E-Step. Estimate the expected value for each latent variable.
- M-Step. Optimize the parameters of the distribution using maximum likelihood.
In the EM algorithm, the estimation-step would estimate a value for the process latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions in an attempt to best capture the density of the data. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.
详细见该文: https://machinelearningmastery.com/expectation-maximization-em-algorithm/
优势
K-means 是一种hard clustering method,也就是说它会把一个数据点分到有且只有一个确定的集群。这种方法的一个限制是,没有不确定性度量或概率来告诉我们一个数据点与特定集群有多少关联。 所以GMM的soft clustering补足了这个缺陷。
同时,GMM也可以作为disrtibution estimator,可以generate new samples following distributions estimated from training set.
原理
如果我们假设一共有K个Gaussian generative process,每一个process会有三个参数:
- A mean μ that defines its centre.
- A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
- A mixing probability π that defines how big or small the Gaussian function will be.
- 首先我们先随机给出这三个参数的初始值
- 然后进行EM算法
- E: for each point, find weights encoding the probability of membership in each cluster.
- M: for each cluster, update its location, normalization, and shape based on all data points, making use of the weights.
参数
sklearn包中的GaussianMixture() class有一个hyper-parameter that controls the degrees of freedom in the shape of each cluster
- covariance_type=“diag” (default)
– Means that the size of the cluster along each dimension can be set independently, with the resulting ellipse constrained to align with the axes. - covariance_type=“spherical”:
– Constrains the shape of the cluster such that all dimensions
are equal.
– The resulting clustering will have similar characteristics to
that of k-means, though it is not entirely equivalent.
– Simpler and faster model. - covariance_type=“full”:
– Allows each cluster to be modeled as an ellipse with arbitrary orientation.
– More complicated and computationally expensive
选择optimal number of components
To correct for over-fitting, adjust model likelihoods by using some analytic criterion such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). The optimal number of clusters is the value that minimizes the AIC or BIC.
- AIC: https://medium.com/@analyttica/akaike-information-criterion-aic-7a4b58bce206
- BIC: https://medium.com/@analyttica/what-is-bayesian-information-criterion-bic-b3396a894be6
Python code
from sklearn.mixture import GaussianMixture
# Use the AIC to get a gauge for the number of GMM components to use
n_components = np.arange(50, 210, 10)
models = [GaussianMixture(n, covariance_type='full', random_state=0).fit(digits.data)
for n in n_components]
aics = [model.fit(digits.data).aic(digits.data) for model in models]
plt.plot(n_components, aics, label='AIC');
plt.legend(loc='best')
plt.xlabel('n_components');
# Fit number of components to 110 and confirm that it has converged
gmm = GaussianMixture(110, covariance_type='full', random_state=0)
gmm.fit(digits.data)
print(gmm.converged_)
参考网站
[1] https://www.youtube.com/watch?v=EWd1xRkyEog
[2] https://towardsdatascience.com/gaussian-mixture-models-explained-6986aaf5a95