metrics.classification_report函数记录

机器学习/深度学习中，我们经常使用sklearn包中的metrics.classification_report来输出评价指标。本文主要是通过示例方式来记录该函数的常见输入与输出的含义。

示例1

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 3, 2, 2, 1, 1, 4, 3, 2, 4, 1, 0, 0]
>>> y_pred = [0, 3, 1, 2, 1, 2, 4, 3, 2, 2, 1, 3, 0]
>>> print(classification_report(y_true, y_pred))
              precision    recall  f1-score   support
           0       1.00      0.67      0.80         3
           1       0.67      0.67      0.67         3
           2       0.50      0.67      0.57         3
           3       0.67      1.00      0.80         2
           4       1.00      0.50      0.67         2
    accuracy                           0.69        13
   macro avg       0.77      0.70      0.70        13
weighted avg       0.76      0.69      0.70        13

其中，
accuracy表示准确率，也即正确预测样本量与总样本量的比值，即9/13=0.69
macro avg表示宏平均，表示所有类别对应指标的平均值，即
      precision = (1.0+0.67+0.5+0.67+1.0)/5=0.77
      recall = (0.67+0.67+0.67+1.0+0.5)/5=0.70
      f1-score = (0.8+0.67+0.57+0.8+0.67)/5=0.70
weighted avg表示带权重平均，表示类别样本占总样本的比重与对应指标的乘积的累加和，即
      precision = 1.0*3/13 + 0.67*3/13 + 0.5*3/13 + 0.67*2/13 + 1.0*2/13=0.76
      recall = 0.67*3/13 + 0.67*3/13 + 0.67*3/13 + 1.0*2/13 + 0.5*2/13=0.69
      f1-score = 0.8*3/13 + 0.67*3/13 + 0.57*3/13 + 0.8*2/13 + 0.67*2/13=0.70

示例2

>>> from sklearn.metrics import classification_report
>>> label = {0: '科技', 1: '体育', 2: '社会', 3: '娱乐', 4: '股票'}
>>> y_true = [0, 3, 2, 2, 1, 1, 4, 3, 2, 4, 1, 0, 0]
>>> y_pred = [0, 3, 1, 2, 1, 2, 4, 3, 2, 2, 1, 3, 0]
>>> print(classification_report(y_true, y_pred, target_names=['科技', '体育', '社会', '娱乐', '股票']))
              precision    recall  f1-score   support
          科技       1.00      0.67      0.80         3
          体育       0.67      0.67      0.67         3
          社会       0.50      0.67      0.57         3
          娱乐       0.67      1.00      0.80         2
          股票       1.00      0.50      0.67         2
    accuracy                           0.69        13
   macro avg       0.77      0.70      0.70        13
weighted avg       0.76      0.69      0.70        13

示例1跟示例2的区别是示例2加入了target_names参数，该参数的主要作用是将实际的类别与输出id对应起来。

示例3

>>> print(metrics.classification_report(true_y, pred_y, target_names=['体育', '社会', '娱乐', '股票', '科技']))
              precision    recall  f1-score   support
          体育       1.00      0.67      0.80         3
          社会       0.67      0.67      0.67         3
          娱乐       0.50      0.67      0.57         3
          股票       0.67      1.00      0.80         2
          科技       1.00      0.50      0.67         2
    accuracy                           0.69        13
   macro avg       0.77      0.70      0.70        13
weighted avg       0.76      0.69      0.70        13

对比示例2跟示例3，我们可以看到参数target_names中元素的顺序与输出id的大小顺序相同。

示例4
对于classification_report，我们通常会看到如下的输出

>>> import numpy as np
>>> from sklearn.metrics import classification_report
>>> y_true = np.array([[1, 0, 1, 0, 0],
                       [0, 1, 0, 1, 1],
                       [1, 1, 1, 0, 1]])
>>> y_pred = np.array([[1, 0, 0, 0, 1],
                       [0, 1, 1, 1, 0],
                       [1, 1, 1, 0, 0]])
>>> print(classification_report(y_true, y_pred, digits=3))
              precision    recall  f1-score   support

           0      1.000     1.000     1.000         2
           1      1.000     1.000     1.000         2
           2      0.500     0.500     0.500         2
           3      1.000     1.000     1.000         1
           4      0.000     0.000     0.000         2

   micro avg      0.750     0.667     0.706         9
   macro avg      0.700     0.700     0.700         9
weighted avg      0.667     0.667     0.667         9
 samples avg      0.722     0.639     0.675         9

对比示例1跟示例4两个classification_report函数的输出，我们可以看到，示例1输出的是一个accuracy，外加两个平均值，而示例4输出的是四个平均值。
造成这种不同的原因是示例4是多标签分类，而示例1是一个单标签分类。

关于示例4输出的四个平均值，我们看看官方的解释

average : {'binary', 'micro', 'macro', 'samples','weighted'},
default=None
If None, the scores for each class are returned. Otherwise, this
determines the type of averaging performed on the data:
'binary':
Only report results for the class specified by pos_label.
This is applicable only if targets (y_{true,pred}) are binary.
'micro':
Calculate metrics globally by counting the total true positives,
false negatives and false positives.
'macro':
Calculate metrics for each label, and find their unweighted
mean. This does not take label imbalance into account.
'weighted':
Calculate metrics for each label, and find their average weighted
by support (the number of true instances for each label). This
alters 'macro' to account for label imbalance; it can result in an
F-score that is not between precision and recall.
'samples':
Calculate metrics for each instance, and find their average (only
meaningful for multilabel classification where this differs from
:func:accuracy_score).

其中，micro avg、macro avg和weighted avg针对的对象都是label，而samples avg针对的对象则是instance。
label表示示例4中值为1的三个样本中的元素，而instance表示实际的三个样本，具体不同可参考下面四个平均值的计算过程。

说明：classification_report函数的输出结果标签中0、1、2、3、4表示的是每个样本的5列，每列代表一个标签。因此，对于标签0，表示样本中元素处于第一列的1；对于标签1，表示样本中元素处于第一列的1；以此类推。

micro avg表示微平均，表示所有类别中预测正确量与总样本量的比值，即
      precision = (2+2+1+1) / 8 = 0.750
      recall = (2+2+1+1) / 9 = 0.667
      f1-score = 2*precision*recall/(precision+recall) = 0.706
macro avg表示宏平均，表示所有类别对应指标的平均值，即
      precision = (1.0+1.0+0.5+1.0+0.0)/5 = 0.700
      recall = (1.0+1.0+0.5+1.0+0.0)/5 = 0.700
      f1-score = (1.0+1.0+0.5+1.0+0.0)/5 = 0.700
weighted avg表示带权重平均，表示类别样本占总样本的比重与对应指标的乘积的累加和，即
      precision = 1.0*2/9 + 1.0*2/9 + 0.5*2/9 + 1.0*1/9 + 0.0*2/9 = 0.667
      recall = 1.0*2/9 + 1.0*2/9 + 0.5*2/9 + 1.0*1/9 + 0.0*2/9 = 0.667
      f1-score = 1.0*2/9 + 1.0*2/9 + 0.5*2/9 + 1.0*1/9 + 0.0*2/9 = 0.667
samples avg表示带权重平均，表示类别样本占总样本的比重与对应指标的乘积的累加和，即
      precision = (1/2 + 2/3 + 3/3) / 3 = 0.722
      其中，1/2表示第1行中，标签值为1的预测准确率；2/3表示第二行中标签值为1的预测准确率，以此类推
      recall = (1/2 + 2/3 + 3/4) / 3 = 0.639
      f1-score = ((2*(1/2)*(1/2))/(1/2+1/2) + (2*(2/3)*(2/3))/(2/3+2/3) + (2*(3/3)*(3/4))/(3/3+3/4)) / 3 = 0.675
      其中，f1-score的计算过程中，分别为三个样本中标签值为1的f1值的平均值。

至此，classification_report函数的常见输出结果介绍完成了，在此做个记录，方便自己与他人后续查阅。

metrics.classification_report函数记录

推荐阅读更多精彩内容