第三章分类

这篇文章是本人学习《Hands-On-Machine-Learning-with-Scikit-Learn-and-TensorFlow》的读书笔记第三篇。整理出来是希望在巩固自己的学习效果的同时，希望能够帮助到同样想学习的人。本人也是小白，可能很多地方理解和翻译不是很到位，希望大家多多谅解和提意见。

Setup

# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

MNIST 数据集

这一章我们会使用到 MNIST 数据集，其中包含 70000张手写的数字图片，每张图片的大小为 28*28 像素值。

使用 Scikit-Learn 自带的 fetch_openml 加载 MNIST 数据集

''Warning: fetch_mldata() is deprecated since Scikit-Learn 0.20. You should use fetch_openml() instead. However, it returns the unsorted MNIST dataset, 
whereas fetch_mldata() returned the dataset sorted by target (the training set and the test set were sorted separately). ''

def sort_by_target(mnist):
    reorder_train = np.array(sorted([(target,i) for i,target in enumerate(mnist.target[:60000])]))[:,1]  #训练集 
    reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:,1]  #测试集
    mnist.data[:60000] = mnist.data[reorder_train]
    mnist.target[:60000] = mnist.target[reorder_train]
    mnist.data[60000:] = mnist.data[reorder_test + 60000]
    mnist.target[60000:] = mnist.target[reorder_test + 60000]

加载数据集

try:
    from sklearn.datasets import fetch_openml
    mnist = fetch_openml('mnist_784', version=1, cache=True)
    mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
    sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata('MNIST original')
    
mnist['data'], mnist['target']

Figure 1: MNIST data shape

some_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image, cmap=mpl.cm.binary,
          interpolation='nearest')
plt.axis('off')

save_fig('some_digit_plot')
plt.show()

Figure 2：Some digit plot

创建训练集和测试集

# split training and test set
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

将训练集打乱重排

import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

训练一个二分类器

使用 Scikit-Learn 的 SGDClassifier类来创建一个分类器，区分一张图片是否是数字5。

y_train_5 = (y_train == 5) # 修改便签为是否等于5
y_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(X_train, y_train_5)

采用准确率为衡量指标查看交叉验证的结果。

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

结果显示准确率在95%左右。

# 通过分层抽样来划分训练集和测试集，并计算正确率
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train_5[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train_5[test_index])
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

通过两种不同的方法划分训练集和测试集的分类准确率都达到了95%。那这个结果怎样呢？我们构造一个最简单的分类器--将所有的照片都归类为不是5，看看结果怎样。

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X): # 预测值全为 0
        return np.zeros((len(X),1), dtype=bool)

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring='accuracy')

这种分类器得到的准确率超过90%。这是因为我们的数据集中大概只有10%的照片是5的，所以简单地全部判断为非5也能有90%的准确率。这也说明了用准确率来作为评价指标，有时候是不准确的，尤其当数据有偏的时候。

混淆矩阵

from sklearn.model_selection import cross_val_predict
# 给出预测值
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix
# 计算混淆矩阵
confusion_matrix(y_train_5, y_train_pred)

Figure 3：Confusion Matrix

计算 Precision，Recall，F1 Score

Figure 4：Precision，Recall，F1 Score

从准确率我们可以看出，当分类器认为一张照片是5时，通常只有77%的可能是对的。从召回率知道，只有80%真正为5的照片被找出了。

Precision/Recall 的平衡

SGDClassifier基于决策函数为每个数据点计算得分，当这个得分超过某个阈值时，给予一个正类的标签；否则给予一个负类的标签。我们可以使用 decision_function()方法，返回分类器给每个函数计算的得分。

y_scores = sgd_clf.decision_function([some_digit])

那我们怎么决定选择合适的阈值呢？我们可以使用 cross_val_predict()函数得到训练集中所有数据的得分，但是这次我们需要的不是预测值，而是决策分数。有了这些分数以后，我们可以使用 precision_recall_curve()函数计算针对每个分数的 precision和 recall。

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

画出不同的 thresholds下 Precisions 和 Recalls 的图像。

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1],'b--', label='Precisions')
    plt.plot(thresholds, recalls[:-1],'g-', label='Recall')
    plt.xlabel('Threshold', fontsize=16)
    plt.legend(loc='upper left', fontsize=16)
    plt.ylim([0,1])
    
plt.figure(figsize=(8,4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000,700000])
save_fig(""precision_recall_vs_threshold_plot"")
plt.show()

Figure 5：precision_recall_vs_threshold_plot

如果我们想要取得90%的准确率，则观察到 threshold大概在70000左右，计算此时的 Recall。

Figure 6: 90% Precision

第二种方法直接画出 Precision VS Recall的图形，选择图形快速下降之前的某个点，作为平衡点。

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, 'b-', linewidth=2)
    plt.xlabel('Recall', fontsize=16)
    plt.ylabel('Precision', fontsize=16)
    plt.axis([0,1,0,1])
    
plt.figure(figsize=(8,6))
plot_precision_vs_recall(precisions, recalls)
save_fig('precision_vs_recall_plot')
plt.show()

Figure 7：precision_vs_recall_plot

ROC 曲线

ROC曲线刻画的是真正例的比例（True Positive Rate, TPR，即召回率）和假正例的比例（False Postive Rate，FPR）之间的关系。FPR = 1-TNR，TNR指真假例的比例即真实假例被正确地判定为假例的比例，也被称为特异度（Specificity）。所以ROC曲线刻画的是 Recall 和 1 - Specificity的关系。

使用roc_curve()函数，可以计算出不同 threshold 值下的 TPR 和 FPR。

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

画出 ROC曲线图

def plot_roc_curve(fpr,tpr,label=None):
    plt.plot(fpr,tpr,linewidth=2, label=label)
    plt.plot([0,1],[0,1],'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False Positive Rate',fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    
plt.figure(figsize=(8,6))
plot_roc_curve(fpr, tpr)
save_fig('roc_curve_plot')
plt.show()

Figure 8：ROC Curve

计算 ROC 曲线的面积

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

通常来说，当数据集中正例的数量偏少或者我们更侧重于假正例而不是假范例时，我们应该选择通过观察 PR 曲线来选择 Precision 和 Recall的平衡点；反之，则应该选择 ROC曲线。

我们训练一个 RandomForestClassifier，然后观察它的 ROC 曲线和计算 AUC 面积来与SGDClassifier作比较。RandomForestClassifier 有个 predict_proba()的方法，能够返回每个数据点属于该类的概率。

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5,
                                   cv=3, method='predict_proba')

y_scores_forest = y_probas_forest[:,1] # proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

ROC 曲线对比

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, 'b:', linewidth=2, label='SGD')
plot_roc_curve(fpr_forest, tpr_forest, 'Random Forest')
plt.legend(loc='lower right', fontsize=16)
save_fig('roc_curve_comparison_plot')
plt.show()

Figure 9：roc_curve_comparison_plot

计算 AUC，Precision，Recall。

Figure 10：AUC, Precision, Recall

多类别分类

多类别分类指的是最后的分类结果超过两个，有些算法能够处理多类别的分类如随机森林、朴素贝叶斯分类器。其他算法如支持向量机、线性分类器只能处理二分类。为了创建一个能够区分10个类别（0-9）的图片分类器，我们可以分别训练一个0分类器，1分类器至9分类器，共10个分类器。那么当我们需要去分类一张照片的时候，每个分类器都会给出一个决策得分，选择得分最高的那个分类器。这被称为一对所有（One-versus-All, OvA）的方法。

另一种方法是为每一对类别训练一个分类器，比如训练一个分类器辨别是1还是2，再训练一个分类器辨别是2还是3，以此类推，一共训练 N * (N - 1) / 2 个分类器，这被称为 One-versus-One的方法。

当我们要用Scikit-Learn进行多类别分类时，它会自动选择OvA的方法（SVM 会选择 OvO）。

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

Figure 11：SGD One vs All

如果想强制Scikit-Learn使用 OvO 或者 OvA 的方法时，可以使用 OneVsOneClassifier 或者 OneVsRestClassifier 类。

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty,
                                          random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

训练一个随机森林的多类别的分类器是很容易的，可以调用它的 predict_proba()方法来获取该分类器给每个数据点的预测的各类的可能性。

Figure 12：Random Forest Classifier

使用交叉验证计算 SGD 分类器的得分。

cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring='accuracy')

准确率在85%左右，使用特征缩放看能不能提高性能。

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring='accuracy')

此时准确率提升到90%以上了。

误差分析

使用混淆矩阵来分析误差.

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

Figure 13: Conf_mx

# compute the error rates 
row_sums = conf_mx.sum(axis=1, keepdims=True) 
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

Figure 14：Conf_mx_error_rate

我们可以看出类3,5,8,9比较白，说明他们是比较容易混淆的类。

多标签分类

指的是一个分类系统能够给出多个二分类的标签。

from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7) # 大于7的图片
y_train_odd = (y_train % 2 == 1) # 奇数
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

为每一个标签计算它的 F1值，然后再取平均。

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average='macro')

多结果分类

它是多标签分类的延伸，每个标签又可以分成好几类。我们建立一个系统去除照片中的噪声，它的输入是有噪声的数字图片，输出是一个没有噪音的干净图片。这个分类器的输出是多标签的（每个像素值是一个标签），每一个标签有很多值（0-255）。

noise1 = np.random.randint(0,100,(len(X_train), 784))
X_train_mod = X_train + noise1
noise2 = np.random.randint(0,100, (len(X_test), 784))
X_test_mod = X_test + noise2
y_train_mod = X_train
y_test_mod = X_test

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

Figure 15： Multioutput Classification

总结

本章主要学习了针对分类任务如何选择评价指标、做好Precision/Recall的平衡、分类器之间的比较、如何针对不同问题设计分类器。

程序

我把书中的程序都用 Python 3运行了一遍，确保没有Bug并且都加了注释，方便大家理解。原书的数据集和代码在这个网站上，我自己运行的程序在我的GitHub上。

最后编辑于：2019.05.25 17:20:26

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 215,539评论 6赞 497
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,911评论 3赞 391
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 161,337评论 0赞 351
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,723评论 1赞 290
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,795评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,762评论 1赞 294
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,742评论 3赞 416
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,508评论 0赞 271
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,954评论 1赞 308
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,247评论 2赞 331
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,404评论 1赞 345
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,104评论 5赞 340
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,736评论 3赞 324
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,352评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,557评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,371评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,292评论 2赞 352

第三章 分类