上一篇中我们已经构造出第一个分类器,接下来用两个demo测试一下分类器的效果
demo_1:使用k-近邻算法改进约会网站的配对效果
三个标签:
didntLike(不喜欢的人)
smallDoses(魅力一般的人)
largeDoses(极具魅力的人)
3个样本特征:
每年获得的飞行常客里程数
玩视频游戏所耗时间百分比
每周消费的冰激凌公升数
我们把三个标签用1,2,3替代
将文本记录转换NumPy
# #将文本记录转换为 Numpy 的解析程序
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()#按行读取文件内容
numberOfLines = len(arrayOLines)#文件的行数
returnMat = np.ones((numberOfLines, 3))#创建数组。行数为numberOfLines,列数为3
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip().split('\t')#split('\t')让数据以制表符大小进行切片,strip()移除头尾的空格
returnMat[index, :] = line[0:3]#index即为行数
if line[-1] == 'largeDoses':#把标签处理为数字
tt = 3
elif line[-1] == 'smallDoses':
tt = 2
else:
tt = 1
classLabelVector.append(int(line[-1]))#最后一行作为标签
index += 1
return returnMat, classLabelVector
x, y = file2matrix(r'\文本路径XXX\datingTestSet2.txt')
print(x)
print(type(x))#x的类型是一个数组
print(type(y))#y的类型是一个列表
print(x.shape)
print(np.array(y).shape)#np.array(y)把list类型换成array类型
输出:
[[ 4.09200000e+04 8.32697600e+00 9.53952000e-01]
[ 1.44880000e+04 7.15346900e+00 1.67390400e+00]
[ 2.60520000e+04 1.44187100e+00 8.05124000e-01]
...,
[ 2.65750000e+04 1.06501020e+01 8.66627000e-01]
[ 4.81110000e+04 9.13452800e+00 7.28045000e-01]
[ 4.37570000e+04 7.88260100e+00 1.33244600e+00]]
<class 'numpy.ndarray'>
<class 'list'>
(1000, 3)
(1000,)
现在已经从文本文件中导入数据,接着需要了解数据的真实含义,采用图形化的方式直观地展示数据
分析数据: 使用Matplotlib创建散点图
import matplotlib.pyplot as plt
fig = plt.figure()#定义一个figure对象 导出一张画纸
x, y = file2matrix(r'文本路径XXX\datingTestSet2.txt')
ax1 = fig.add_subplot(2,1,1)#2行1列第一个 add_subplot创建一个subplot,图像是2*1的,选中第一个
ax1.scatter(x[:, 1], x[:, 2], 15.0 * np.array(y), 15.0 * np.array(y)) #颜色和尺寸标识了数据点的属性类别
# x[:, 1], x[:, 2]输入的数组 15.0 * np.array(y)标量 15.0 * np.array(y)颜色
#x[:, 0]取所有行的第0个数据
ax2 = fig.add_subplot(2,1,2)
ax2.scatter(x[:, 0], x[:, 1], 15.0 * np.array(y), 15.0 * np.array(y)) #颜色和尺寸标识了数据点的属性类别
plt.show()
准备数据,归一化数值
def autoNorm(dataSet):
minVals = dataSet.min(0)#将每列的最小值放在minVals中
maxVals = dataSet.max(0)#将每列的最大值放在minVals中
ranges = maxVals - minVals
normDataSet = np.zeros(np.shape(dataSet))#创建一个空数组
m = dataSet.shape[0]#数据集第一维的长度为3
normDataSet = dataSet - np.tile(minVals, (m, 1))#该列重复行1次,列3次
normDataSet = normDataSet / np.tile(ranges, (m, 1))
return normDataSet, ranges, minVals
x, y = file2matrix(r'文本路径XXX\datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(x)
#print(normMat)#自行测试各个不太懂的数据
#print(ranges)
print(minVals)
测试算法:作为完整程序验证分类器
# 分类器针对约会网站的测试代码
def datingClassTest():
hoRatio = 0.10
datingDataMat, datingLabels = file2matrix(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\datingTestSet.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)#归一化
m = normMat.shape[0]#第一维长度
numTestVecs = int(m * hoRatio)#决定多少用于训练,多少用于测试
errorCount = 0.0
for i in np.arange(numTestVecs):#用于测试的数据
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)#输出训练结果
print("thr classifier came back with: %d, the real answer is: %d" %(classifierResult, datingLabels[i]))
if classifierResult != datingLabels[i]:#计算误差
errorCount += 1.0
print("the total error rate is: %f" %( errorCount / float(numTestVecs)))
datingClassTest()
输出:
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000
错误率为5%,还可以,可以通过变动datingClassTest内的变量hoRatio和变量k的值,检测错误率是否随变量值的变化而增加
使用算法:预测约会网站
# 约会网站预测函数
def classifyPerson():
resultList = ['not at all', 'in small doses', 'in large doses']#三种可能出现的结果
percentTats = float(input('percentage of time spent playing video games?'))#输入
ffMiles = float(input('frequent flier miles earned per year?'))
iceCream = float(input('liters of ice cream consumed per year?'))
datingDataMat, datingLabels = file2matrix(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)#归一化
inArr = np.array([ffMiles, percentTats, iceCream])
classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)#预测结果
print("You will probably like this person: ", resultList[classifierResult - 1])
classifyPerson()
输出:
percentage of time spent playing video games?10#输入玩游戏时间
frequent flier miles earned per year?10000#飞行里程数
liters of ice cream consumed per year?0.5#吃冰激凌公升数
You will probably like this person: not at all#结果
demo_2:使用k-近邻算法实现手写识别系统
demo_2中的目录trainingDigits下包含了大约2000个例子,每个例子的内容如图所示:
每个数字有200个样本;目录testDigits中大约900个测试数据,我们使用目录trainingDigits中的数据训练分类器,使用目录testDigits中的数据测试分类器效果
首先,将图像格式化处理为一个向量,将3232的二进制图像矩阵转换为11024的向量,这样之前的分类器就可以使用了
图像转换为向量
# 准备数据:将图像转换为测试向量
def img2vector(filename):
returnVect = np.zeros((1, 1024))
fr = open(filename)
for i in np.arange(32):#按行读取
lineStr = fr.readline()
for j in np.arange(32):
returnVect[0, 32 * i + j] = int(lineStr[j])
return returnVect
returnVect = img2vector(r'文本路径XXXX\testDigits\0_13.txt')
测试:
print(returnVect[0, 0:31])
print(returnVect[0, 32:63])
输出:
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
测试算法:使用k-近邻算法识别手写数字
# 测试算法:使用 K-近邻算法识别手写数字
from os import listdir # 从 os 模块中导入函数 listdir ,它可以列出给定目录的文件名
def handwritingClassTest():
hwLabels = []
#训练集
trainingFileList = listdir(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\trainingDigits')
#print(trainingFileList)提取文件名
m = len(trainingFileList)#文件的长度
trainingMat = np.zeros((m, 1024))#创建一个空的一维数据
for i in np.arange(m):
fileNameStr = trainingFileList[i]#0_0.txt
fileStr = fileNameStr.split('.')[0]#提取0_0
classNumStr = int(fileStr.split('_')[0])#提取0
hwLabels.append(classNumStr)#添加标签
trainingMat[i, :] = img2vector(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\trainingDigits\%s' % fileNameStr )
#测试集
testFileList = listdir(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\testDigits')
errorCount = 0.0
mTest = len(testFileList)
for i in np.arange(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector(r'E:\Program Files\Machine Learning\机器学习实战及配套代码\machinelearninginaction\Ch02\testDigits\%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print("the classifier came back with: %d, the real answer is: %d " %(classifierResult, classNumStr))
if classifierResult != classNumStr :
errorCount += 1.0
print("\nthe total number of error is: %d" % errorCount)
print("\nthe total error rate is: %f " %(errorCount / float(mTest)))
handwritingClassTest()
输出:
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 3
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 3, the real answer is: 3
thr classifier came back with: 2, the real answer is: 2
thr classifier came back with: 1, the real answer is: 1
thr classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000
参考:
第 2 章 K-近邻算法
源码