简介
工作原理:存在一个样本数据集,并且样本集中每个数据都存在标签(样本集中每一数据与所属分类的对应关系)。输入没有标签的新数据后,将新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似数据(最邻近)的分类标签。一般来说,选择样本数据集中前k个最相似的数据,最后选择k个最相似数据中出现次数最多的分类,作为新数据的分类。
例如:
电影 | 打斗次数 | 接吻次数 | 电影类型 |
---|---|---|---|
California Man | 3 | 104 | Romance |
He's Not Really into Dudes | 2 | 100 | Romance |
Beautiful Woman | 1 | 81 | Romance |
Kevin Longblade | 101 | 10 | Action |
Robo Slayer 3000 | 99 | 5 | Action |
Amped II | 98 | 2 | Action |
未知 | 18 | 90 | Unknown |
已知电影与未知电影的距离:
电影名称 | 与未知电影的距离 |
---|---|
California Man | 20.5 |
He's Not Really into Dudes | 18.7 |
Beautiful Woman | 19.2 |
Kevin Longblade | 115.3 |
Robo Slayer 3000 | 117.4 |
Amped II | 118.9 |
按照距离递增排序可以找到k个距离最近的电影,假设k=3,取最近三部电影的类型可知未知电影也是一部爱情片。
算法步骤
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的K个点;
4)确定前K个点所在类别的出现频率;
5)返回前K个点中出现频率最高的类别作为测试数据的预测分类
import numpy as np
import operator
##给出训练数据以及对应的类别
def createDataSet():
group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group,labels
###通过KNN进行分类
def classify0(inX,dataSet,labels,k):
dataSetSize = dataSet.shape[0]
# print(dataSetSize)
# print(np.tile(inX,(dataSetSize,1)))
###计算距离
diffMat = np.tile(inX,(dataSetSize,1)) - dataSet
# print(diffMat)
sqDiffMat = diffMat**2
# print(sqDiffMat)
sqDistances = sqDiffMat.sum(axis=1)
# print(sqDistances)
distances = sqDistances**0.5
print('distances:',distances)
#根据元素的值从大到小对元素进行排序,返回下标
sortedDistIndicies = distances.argsort()
# print('sortedDistIndicies:',sortedDistIndicies)
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
#选取出最多的类别
sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
return sortedClassCount[0][0]
if __name__ == '__main__':
group, labels = createDataSet()
input = [0.5,0.5]
output = classify0(input,group,labels,3)
print("测试数据为:",input,"分类结果为:",output)
>>>distances: [ 0.78102497 0.70710678 0.70710678 0.64031242]
>>>测试数据为: [0.5, 0.5] 分类结果为: B
归一化数值
在计算测试样本和样本集中数据距离的时候,有些特征的差值较大,对计算结果有较大影响。在处理这种不同取值范围的特征值时,我们通常采用的方法就是将数值归一化,如将取值范围处理为0到1或者-1到1之间。
newValue = (oldValue-min)/(max-min)
def autoNum(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - np.tile(minVals,(m,1))
normDataSet = normDataSet/np.tile(ranges,(m,1))
return normDataSet,ranges,minVals
测试算法
'''
将文本记录装换为NumPy的解析程序
'''
from collections import Counter
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = np.zeros((numberOfLines,3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFormLine = line.split('\t')
returnMat[index,:] = listFormLine[0:3]
classLabelVector.append(listFormLine[-1])
index += 1
dictClassLabel = Counter(classLabelVector)
classLabel = []
kind = list(dictClassLabel)
for item in classLabelVector:
if item == kind[0]:
item = 1
elif item == kind[1]:
item = 2
else:
item = 3
classLabel.append(item)
return returnMat,classLabel
def datingClassTest():
hoRatio = 0.1 #选取10%测试,剩下的是已知数据集
datingDataMat,datingLabels = file2matrix('datingTestSet.txt')
normMat,ranges,minVals = autoNum(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print("the classifier came back with %d, the real answer is %d"%(classifierResult,datingLabels[i]))
if(classifierResult != datingLabels[i]):errorCount += 1.0
print("the total error rate is %f" %(errorCount/float(numTestVecs)))