写在之前
本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/
这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。
欢迎转载,转载请注明出处,如有问题欢迎指正。
合集地址:https://www.zybuluo.com/hainingwyx/note/559139
算法评估及kNN
10-flod Cross Validation:将数据集分为10份,使用其中9份进行训练,另外1份用作测试,重复该过程10次。
留一法:n-flod Cross Validation。结果是随机的,不是确定值,和数据的划分有关。缺点在于计算机开销很大。分层采样的时候保证样本的均匀性很重要。
混淆矩阵:行表示测试样本的真实类别,列表示预测器所预测出来的类别。可揭示分类器性能。
# divide data into 10 buckets
import random
def buckets(filename, bucketName, separator, classColumn):
"""the original data is in the file named filename
bucketName is the prefix for all the bucket names
separator is the character that divides the columns
(for ex., a tab or comma and classColumn is the column
that indicates the class"""
# put the data in 10 buckets
numberOfBuckets = 10
data = {}
# first read in the data and divide by category
with open(filename) as f:
lines = f.readlines()
for line in lines:
if separator != '\t':
line = line.replace(separator, '\t')
# first get the category
category = line.split()[classColumn]
data.setdefault(category, []) #set the value for dic data
data[category].append(line) #all the information
# initialize the buckets [[], [], ...]
buckets = []
for i in range(numberOfBuckets):
buckets.append([])
# now for each category put the data into the buckets
for k in data.keys():
#randomize order of instances for each class
#data[k] is a list of line
random.shuffle(data[k])
bNum = 0
# divide into buckets
for item in data[k]:
buckets[bNum].append(item)
bNum = (bNum + 1) % numberOfBuckets
# write to file
for bNum in range(numberOfBuckets):
f = open("%s-%02i" % ('tmp/'+bucketName, bNum + 1), 'w')
for item in buckets[bNum]:
f.write(item)
f.close()
# example of how to use this code
buckets("data/mpgData.txt", 'mpgData',',',0)
分类器评价:Kappa统计量。相对于随机分类器而言的分类器效果。
$$
\kappa =\frac{P(c)-P(r)}{1-P(r)}
$$
$P(c)$是实际分类器的准确率,$P(r)$是随机分类器的精确率。
Kappa区间 | 性能 |
---|---|
<0 | 比随机方法性能差 |
0.01-0.2 | 轻微一致 |
0.21-0.4 | 一般一致 |
0.41-0.6 | 中度一致 |
0.61-0.8 | 高度一致 |
0.81-1 | 接近完美 |
KNN:当有一个样本是比较特别的时候,使用最近邻可能会导致特别样本的存在而出现误分类。改进的办法就是考察k个邻居。离得越近,影响因子就越大。影响因子可以用距离的倒数来表示。
def knn(self, itemVector):
"""returns the predicted class of itemVector using k
Nearest Neighbors"""
# changed from min to heapq.nsmallest to get the
# k closest neighbors
neighbors = heapq.nsmallest(self.k,
[(self.manhattan(itemVector, item[1]), item)
for item in self.data])
# each neighbor gets a vote
results = {}
for neighbor in neighbors:
theClass = neighbor[1][0]
results.setdefault(theClass, 0)
results[theClass] += 1
resultList = sorted([(i[1], i[0]) for i in results.items()], reverse=True)
#get all the classes that have the maximum votes
maxVotes = resultList[0][0]
possibleAnswers = [i[1] for i in resultList if i[0] == maxVotes]
# randomly select one of the classes that received the max votes
answer = random.choice(possibleAnswers)
return( answer)
做工程,数据量大的时候算法的效果越好。做论文还是要研究出一个具有少量性能提高的算法。