机器学习实战~k近邻算法(下)

上一篇中我们已经构造出第一个分类器,接下来用两个demo测试一下分类器的效果

demo_1:使用k-近邻算法改进约会网站的配对效果

三个标签:
didntLike(不喜欢的人)
smallDoses(魅力一般的人)
largeDoses(极具魅力的人)
3个样本特征:
每年获得的飞行常客里程数
玩视频游戏所耗时间百分比
每周消费的冰激凌公升数
文本样本格式

我们把三个标签用1,2,3替代


处理后文本

将文本记录转换NumPy

# #将文本记录转换为 Numpy 的解析程序
def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()#按行读取文件内容
    numberOfLines = len(arrayOLines)#文件的行数
    returnMat = np.ones((numberOfLines, 3))#创建数组。行数为numberOfLines,列数为3
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip().split('\t')#split('\t')让数据以制表符大小进行切片,strip()移除头尾的空格
        returnMat[index, :] = line[0:3]#index即为行数
        if line[-1] == 'largeDoses':#把标签处理为数字
            tt = 3
        elif line[-1] == 'smallDoses':
            tt = 2
        else:
            tt = 1
        classLabelVector.append(int(line[-1]))#最后一行作为标签
        index += 1
    return returnMat, classLabelVector

x, y = file2matrix(r'\文本路径XXX\datingTestSet2.txt')
print(x)
print(type(x))#x的类型是一个数组
print(type(y))#y的类型是一个列表
print(x.shape)
print(np.array(y).shape)#np.array(y)把list类型换成array类型
输出:
[[  4.09200000e+04   8.32697600e+00   9.53952000e-01]
 [  1.44880000e+04   7.15346900e+00   1.67390400e+00]
 [  2.60520000e+04   1.44187100e+00   8.05124000e-01]
 ..., 
 [  2.65750000e+04   1.06501020e+01   8.66627000e-01]
 [  4.81110000e+04   9.13452800e+00   7.28045000e-01]
 [  4.37570000e+04   7.88260100e+00   1.33244600e+00]]
<class 'numpy.ndarray'>
<class 'list'>
(1000, 3)
(1000,)

现在已经从文本文件中导入数据,接着需要了解数据的真实含义,采用图形化的方式直观地展示数据

分析数据: 使用Matplotlib创建散点图

import matplotlib.pyplot as plt

fig = plt.figure()#定义一个figure对象 导出一张画纸
x, y = file2matrix(r'文本路径XXX\datingTestSet2.txt')

ax1 = fig.add_subplot(2,1,1)#2行1列第一个 add_subplot创建一个subplot,图像是2*1的,选中第一个
ax1.scatter(x[:, 1], x[:, 2], 15.0 * np.array(y), 15.0 * np.array(y)) #颜色和尺寸标识了数据点的属性类别
# x[:, 1], x[:, 2]输入的数组  15.0 * np.array(y)标量 15.0 * np.array(y)颜色
#x[:, 0]取所有行的第0个数据
ax2 = fig.add_subplot(2,1,2)
ax2.scatter(x[:, 0], x[:, 1], 15.0 * np.array(y), 15.0 * np.array(y)) #颜色和尺寸标识了数据点的属性类别
plt.show()
数据直观图

准备数据,归一化数值

归一化的公式
def autoNorm(dataSet):
    minVals = dataSet.min(0)#将每列的最小值放在minVals中
    maxVals = dataSet.max(0)#将每列的最大值放在minVals中
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))#创建一个空数组
    m = dataSet.shape[0]#数据集第一维的长度为3
    normDataSet = dataSet - np.tile(minVals, (m, 1))#该列重复行1次,列3次
    normDataSet = normDataSet / np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

x, y = file2matrix(r'文本路径XXX\datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(x)
#print(normMat)#自行测试各个不太懂的数据
#print(ranges)
print(minVals)

测试算法:作为完整程序验证分类器

# 分类器针对约会网站的测试代码
def datingClassTest():
    hoRatio = 0.10
    datingDataMat, datingLabels = file2matrix(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\datingTestSet.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)#归一化
    m = normMat.shape[0]#第一维长度
    numTestVecs = int(m * hoRatio)#决定多少用于训练,多少用于测试
    errorCount = 0.0
    for i in np.arange(numTestVecs):#用于测试的数据
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)#输出训练结果
        print("thr classifier came back with: %d, the real answer is: %d" %(classifierResult, datingLabels[i]))
        if classifierResult != datingLabels[i]:#计算误差
            errorCount += 1.0
    print("the total error rate is: %f" %( errorCount / float(numTestVecs)))

datingClassTest()
输出:
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000

错误率为5%,还可以,可以通过变动datingClassTest内的变量hoRatio和变量k的值,检测错误率是否随变量值的变化而增加

使用算法:预测约会网站

# 约会网站预测函数
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']#三种可能出现的结果
    percentTats = float(input('percentage of time spent playing video games?'))#输入
    ffMiles = float(input('frequent flier miles earned per year?'))
    iceCream = float(input('liters of ice cream consumed per year?'))
    datingDataMat, datingLabels = file2matrix(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)#归一化
    inArr = np.array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)#预测结果
    print("You will probably like this person: ", resultList[classifierResult - 1])

classifyPerson()
输出:
percentage of time spent playing video games?10#输入玩游戏时间
frequent flier miles earned per year?10000#飞行里程数
liters of ice cream consumed per year?0.5#吃冰激凌公升数
You will probably like this person:  not at all#结果

demo_2:使用k-近邻算法实现手写识别系统

demo_2中的目录trainingDigits下包含了大约2000个例子,每个例子的内容如图所示:

图1

每个数字有200个样本;目录testDigits中大约900个测试数据,我们使用目录trainingDigits中的数据训练分类器,使用目录testDigits中的数据测试分类器效果
首先,将图像格式化处理为一个向量,将3232的二进制图像矩阵转换为11024的向量,这样之前的分类器就可以使用了

图像转换为向量

# 准备数据:将图像转换为测试向量
def img2vector(filename):
    returnVect = np.zeros((1, 1024))
    fr = open(filename)
    for i in np.arange(32):#按行读取
        lineStr = fr.readline()
        for j in np.arange(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect

returnVect = img2vector(r'文本路径XXXX\testDigits\0_13.txt')
测试:
print(returnVect[0, 0:31])
print(returnVect[0, 32:63])
输出:
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  1.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

测试算法:使用k-近邻算法识别手写数字

# 测试算法:使用 K-近邻算法识别手写数字

from os import listdir # 从 os 模块中导入函数 listdir ,它可以列出给定目录的文件名
def handwritingClassTest():
    hwLabels = []
    #训练集
    trainingFileList = listdir(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\trainingDigits')
    #print(trainingFileList)提取文件名
    m = len(trainingFileList)#文件的长度
    trainingMat = np.zeros((m, 1024))#创建一个空的一维数据
    for i in np.arange(m):
        fileNameStr = trainingFileList[i]#0_0.txt
        fileStr = fileNameStr.split('.')[0]#提取0_0
        classNumStr = int(fileStr.split('_')[0])#提取0
        hwLabels.append(classNumStr)#添加标签
        trainingMat[i, :] = img2vector(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\trainingDigits\%s' % fileNameStr )
    #测试集
    testFileList = listdir(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in np.arange(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\testDigits\%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d " %(classifierResult, classNumStr))
        if classifierResult != classNumStr :
            errorCount += 1.0
    print("\nthe total number of error is: %d" % errorCount)
    print("\nthe total error rate is: %f " %(errorCount / float(mTest)))

    
handwritingClassTest()
输出:
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000

参考:
第 2 章 K-近邻算法
源码

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,142评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,298评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,068评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,081评论 1 291
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,099评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,071评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,990评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,832评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,274评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,488评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,649评论 1 347
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,378评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,979评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,625评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,643评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,545评论 2 352

推荐阅读更多精彩内容