跟着coursera上的课程Applied Machine Learning in Python学习了一个机器学习的示例,很简单的对水果进行分类,下载数据之后发现,哎对文件的简单复制粘贴会改变数据文件的分隔符,本来是用十六进制09作为分隔符,但是复制粘贴一个新的文件之后分隔符变成了2到4个不等的空格20。实在是对这些编码之类的幺蛾子表示头大。弄了好一阵子,直到用ultraedit对比十六进制.
水果分类的示例用的是knn算法,算法也被称作是instance based or memory based supervised learning.What this means is that instance based learning methods work by memorizing the labeled examples that they see in the training set.And then they use those memorized examples to classify new objects later.
four criteria to specify the method:
scikit-learn by default uses p=2; k=1; no special treatment on weight function; majority vote.
We can see that when K has a small value like 1, the classifier In general with k-nearest neighbors, is good at learning the classes for individual points in the training set. But with a decision boundary that's fragmented with considerable variation比较多(碎片较显著)的空间分割.This is because when K = 1, the prediction is sensitive to noise, outliers, mislabeled data, and other sources of variation in individual data points.
using a larger k suppresses the effects of noisy individual labels. But results in classification boundaries that are less detailed.