Ye Zhu, Deakin University
2021/10/1
当前有很多核方法(kernel method)来改进现有的基于距离的聚类算法的性能(如kernel k-means 和谱聚类)。最近,孤立核(Isolation Kernel)[1,2,3,4]被提出是一种更有效的数据依赖相似性度量方法,使得稀疏区域中的两个点(x,y)比稠密区域中两个点(x',y')更加相似,即使x到y的距离等于x'到y'的距离。该测度自适应于局部数据分布,在捕获局部数据分布特征方面具有更大的灵活性。它在基于密度和距离的分类和聚类问题上显示出比现有核函数更好的性能。
在本文档中,我们将用R语言探索孤立核[1]对鸢尾花数据集聚类的影响。我们将比较使用欧式距离和使用孤立核在k-means, k-medoids和heatmap上的效果。本文中使用的基本包有RANN, aricode, Rcpp, seriation和kmed。
鸢尾花数据集可视化
We first visualise the original iris dataset:
library(heatmaply)
df <- iris
df[,1:4] <- normalize(df[,1:4])
ggplot(df, aes(Petal.Length, Petal.Width)) + geom_point(aes(col=Species), size=4)
基于欧式距离的聚类效果
K-means clustering
- 混淆矩阵 The confusion matrix is
irisCluster <- kmeans(df[,1:4], center=3, nstart=100)
table(irisCluster$cluster, iris$Species)
- The AMI score is (AMI是常用聚类评价指标,越接近1表示聚类效果越好)
library(aricode)
AMI(irisCluster$cluster,iris$Species)
[1] 0.7364193
K-means medoids clustering
- The confusion matrix is
library(kmed)
d <- dist(df[,1:4])
sfkm <- fastkmed(d, ncluster = 3, iterate = 100)
table(sfkm$cluster, iris$Species)
- The AMI score is
AMI(sfkm$cluster,iris$Species)
[1] 0.7540459
Heatmap
library(seriation)
hmap(d, method = "OLO_single", main = "Heatmap for Lines (opt. leaf ordering)",
col = bluered(100, bias = .5))
基于孤立核函数的聚类效果
下面是孤立核函数的R语言实现。IKFeature函数将返回基于孤立核函数映射的核空间特征向量。IKSimilarity函数则计算基于孤立核的相似度度量,也就是特征向量的内积值。因此,我们可以将IKFeature用于需要特征作为输入的算法(例如,k-means),而将IKSimilarity用于需要相似性/距离矩阵作为输入的算法(例如,k-medoids)。
library(RANN)
library(Matrix)
IKFeature <- function(data, Sdata=data, psi = 64, t = 200, Sp=TRUE) {
# IKFeature function will return the finite binary features based on the kernel feature map.
# data is used for applying Isolation kernel function
# Sdata is the data use for generating Voronoi diagrams, it can be the same as the input data
# psi is the number of cells in each Voronoi diagram, it should be large if there are more clusters or more complex structures in the data
# t is the number of Voronoi diagrams, the higher the more stable the result
# Sp indicate whether return the sparse feature vectors
sizeS <- nrow(Sdata)
sizeN <- nrow(data)
Feature<-matrix(, nrow = sizeN, ncol = 0)
for (i in 1:t) {
subIndex <- sample(1:sizeS, psi, replace = FALSE, prob = NULL)
tdata <- Sdata[subIndex, ]
NN <- nn2(tdata, data, k = 1) # find the nearest negibour
OneFeature <- matrix(0, nrow = sizeN, ncol = psi)
OneFeature <- Matrix(OneFeature, sparse=Sp)
ind <- cbind(1:sizeN,NN$nn.idx)
OneFeature[ind] <- 1 # update non-zero values
Feature<- cbind(Feature, OneFeature)
}
if (Sp == TRUE){
Feature # binary feature matrix based on Isolation kernel
}else{
as.matrix(Feature) # return full matrix
}
}
IKSimilarity <- function(data, Sdata=data, psi = 64, t = 200) {
# IKSimilarity function calculates the similarity kernel measure.
Feature<-IKFeature(data, Sdata, psi, t)
SimMatrix <- Feature%*%t(Feature)/t # the similarity matrix based on Isolation kernel
}
K-means clustering
- The confusion matrix is
set.seed(136)
ndata <- IKFeature(data=df[,1:4],psi=4,t=200)
irisCluster <- kmeans(ndata, center=3, nstart=100)
table(irisCluster$cluster, iris$Species)
- The AMI score is
AMI(irisCluster$cluster,iris$Species)
[1] 0.8166619
K-medoids clustering
- The confusion matrix is
library(kmed)
set.seed(136)
Sim <- IKSimilarity(df[,1:4],df[,1:4],4,200)
d <- 1-as.matrix(Sim) # get the dissimilarity/distance matrix
sfkm <- fastkmed(d, ncluster = 3, iterate = 100)
table(sfkm$cluster, iris$Species)
- The AMI score is
AMI(sfkm$cluster,iris$Species)
[1] 0.8027312
Heatmap
hmap(d, method = "OLO_single", main = "Heatmap for Lines (opt. leaf ordering)",
col = bluered(100, bias = .5))
基于这些结果,我们可以看到,孤立核可以显著改善k-means和k-medoids的聚类结果。使用孤立核时,集群结构在heatmap也更加清晰。
R CRAN官方安装包:https://github.com/zhuye88/isokernel
原文英文版: https://rpubs.com/zhuye88/IK
Isolation kernel的Matlab实现代码:https://github.com/zhuye88/anne-dbscan-demo
Reference
[1] Qin, X., Ting, K.M., Zhu, Y. and Lee, V.C., 2019, July. Nearest-neighbour-induced isolation similarity and its impact on density-based clustering. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 4755-4762).
[2] Ting, K.M., Xu, B.C., Washio, T. and Zhou, Z.H., 2020, August. Isolation Distributional Kernel: A New Tool for Kernel based Anomaly Detection. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 198-206).
[3] Ting, K.M., Wells, J.R. and Washio, T., 2021. Isolation kernel: the X factor in efficient and effective large scale online kernel learning. Data Mining and Knowledge Discovery, pp.1-31.
[4] Ting, K.M., Zhu, Y. and Zhou, Z.H., 2018, July. Isolation kernel and its effect on SVM. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2329-2337).