聚类是一种无监督学习,它将相似的对象归到同一个簇中。他有点像全自动分类。
簇识别给出聚类结果的含义。假定一些数据,现在将相似数据归到一起,簇识别会告诉我们这些簇到底都是些什么。
K-均值聚类
优点:容易实现
缺点:可能收敛到局部最小值,在大规模数据集上收敛较慢
k-均值算法的伪代码如下:
创建k个点作为起始质心(经常是随机选择)
当任意一个点的簇分配结果发生改变时
对数据集中的每个数据点
对每个质心
计算质心与数据点之间的距离
将数据点分配到距其最近的簇
对每一个簇,计算簇中所有点的均值并将均值作为质心
下面给出代码实现:
#kMeans.py
from numpy import *
def loadDataSet(fileName):
dataMat = []
fr = open(fileName)
for line in fr.readlines():
curLine = line.strip().split('\t')
fltLine = map(float,curLine)
dataMat.append(fltLine)
return dataMat
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2))) #两个向量的欧氏距离
def randCent(dataSet, k):
n = shape(dataSet)[1]
centroids = mat(zeros((k,n)))
for j in range(n):
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))#最小值与最大值
return centroids
下面开始测试代码
>>> import kMeans
>>> from numpy import *
>>> datMat = mat(kMeans.loadDataSet('testSet.txt'))#构建矩阵
>>> min(datMat[:,0])
matrix([[-5.379713]])
>>> min(datMat[:,1])
matrix([[-4.232586]])
>>> max(datMat[:,1])
matrix([[ 5.1904]])
>>> max(datMat[:,0])
matrix([[ 4.838138]])
>>> kMeans.randCent(datMat,2)
matrix([[-4.4796995 , -0.70940004],
[ 0.81371123, 2.24200682]])
最后测试一下距离计算方法
>>> kMeans.distEclud(datMat[0],datMat[1])
5.184632816681332
所有支持函数正常运行之后,就可以准备试下完整的k-均值算法了。
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))#创建一个矩阵来存储每一次分配的结果,索引值和误差(点到簇质心的距离)
centroids = createCent(dataSet, k)
clusterChanged = True
while clusterChanged:#直到不再改变为止
clusterChanged = False
for i in range(m):#寻找最近的质心
minDist = inf; minIndex = -1
for j in range(k):
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
minDist = distJI; minIndex = j
if clusterAssment[i,0] != minIndex: clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
for cent in range(k):#recalculate centroids
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#获得簇中所有点
centroids[cent,:] = mean(ptsInClust, axis=0) #计算所有值的均值。axis=0表示沿着矩阵的列方向进行均值计算
return centroids, clusterAssment
接下来看看运行效果
In [4]: import kMeans
...: datMat = mat(kMeans.loadDataSet('testSet.txt'))
...: myCentroids, clustAssing = kMeans.kMeans(datMat,4)
...:
[[-1.07551094 -1.37091103]
[ 3.16596464 -2.76092215]
[ 4.21540817 -2.58163381]
[-3.9285101 2.94186313]]
[[-1.93673761 -1.58827286]
[ 2.61992629 -2.31851518]
[ 3.92807891 1.38550982]
[-1.74200079 3.04124525]]
[[-3.38237045 -2.9473363 ]
[ 2.72031426 -2.83200232]
[ 2.91014783 2.71954072]
[-1.94392522 2.96291883]]
[[-3.38237045 -2.9473363 ]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]
[-2.46154315 2.78737555]]
上面给出了四个质心,可以看到,经过三次迭代以后该算法收敛。
SSE,sum of squared error,误差平方和。SSE值越小,数据点越接近他的质心,聚类效果越好。
为了克服K-均值孙发收敛于局部最小值的问题,有人提出了二分K-均值算法(bisecting K-means)。伪代码如下:
将所有点看成一个簇
当簇数目小于k时
对于每一个簇
计算总误差
在给定的簇上面进行K-均值聚类(k=2)
计算将该簇一分为二之后的总误差
选择使得误差最小的那个簇进行划分操作
另一种做法是选择SSE最大的簇进行划分,直到簇数目达到用户指定的数目为止。这种做法不难实现,下面看看实际效果:
def biKmeans(dataSet, k, distMeas=distEclud):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))
centroid0 = mean(dataSet, axis=0).tolist()[0] #axis=0计算每一列的均值,axis=1计算每一行的均值
centList =[centroid0]
for j in range(m):#计算初始误差
clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
while (len(centList) < k):
lowestSSE = inf
for i in range(len(centList)):
ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#.A,转化为数组
centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
sseSplit = sum(splitClustAss[:,1])
sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
if (sseSplit + sseNotSplit) < lowestSSE:#比较SSE的值
bestCentToSplit = i
bestNewCents = centroidMat
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit
bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #更新簇结果
bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
print 'the bestCentToSplit is: ',bestCentToSplit
print 'the len of bestClustAss is: ', len(bestClustAss)
centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]
centList.append(bestNewCents[1,:].tolist()[0])#更新质心
clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss
return mat(centList), clusterAssment
下面看一下实际运行结果:
In [15]: import kMeans
...: datMat3 = mat(kMeans.loadDataSet('testSet2.txt'))
...: centList,myNewAssments = kMeans.biKmeans(datMat3,3)
...:
[[-3.03755889 -1.89572181]
[-3.22628749 -2.41623849]]
[[-0.10841893 1.56828455]
[-0.84797625 -3.576032 ]]
[[-0.03082922 3.12682161]
[-0.43154563 -2.87788837]]
[[-0.00675605 3.22710297]
[-0.45965615 -2.7782156 ]]
sseSplit, and notSplit: 453.033489581 0.0
the bestCentToSplit is: 0
the len of bestClustAss is: 60
[[-4.17486974 4.05506292]
[ 3.04575421 4.02649076]]
[[-2.94737575 3.3263781 ]
[ 2.93386365 3.12782785]]
sseSplit, and notSplit: 77.5922493178 29.1572494441
[[-1.43611292 -1.57953903]
[ 0.46179716 -0.89904296]]
[[-0.93324907 -2.59690843]
[ 0.645394 -3.20126567]]
[[-1.07894467 -2.43015258]
[ 0.46927663 -3.30031012]]
[[-1.12616164 -2.30193564]
[ 0.35496167 -3.36033556]]
sseSplit, and notSplit: 12.7532631369 423.876240137
the bestCentToSplit is: 0
the len of bestClustAss is: 40
现在看看质心结果:
In [17]: centList
Out[17]:
matrix([[-2.94737575, 3.3263781 ],
[-0.45965615, -2.7782156 ],
[ 2.93386365, 3.12782785]])
yahoo开发者那个暂时不能访问。。过几天更新