《机器学习（周志华）》学习笔记（四）

Q：决策树是什么？

决策树是模拟人类决策过程，将判断一件事情所要做的一系列决策的各种可能的集合，以数的形式展现出来，的一种树形图。

决策树的外观，取自网络

举个例子，我的舍友今天在想要不要去看电影，根据今天的天气，今天他口袋有没有钱，今天她女票有没有空这三个小决策来判断。如果今天天气不是晴天，就不去；如果是晴天——继续判断，今天口袋里没有钱，就不去；如果有钱，继续判断——如果今天女票没空，就不去；如果有钱，就去。

预测舍友行动

如果我掌握了这棵决策树，我就可以预测今天我舍友的行动（去不去看电影）。类似，我们可以用数据来训练出一颗决策树，来帮助我们进行各种预测工作。决策树是一种简单却又有用的学习算法。

Q：决策树的结构是怎样的？

决策树与普通树一样，由节点和边组成。树中每一个节点都是一个属性（特征），或者说是对特征的判断。根据一个节点的判断结果，决策（预测）流程走向不同的子节点，或者直接到达叶节点，即决策（预测）结束，得到结果。

Q：决策树是怎么训练出来的？

典型的决策树的训练过程如下，以根据色泽、根蒂、敲声预测一个西瓜是否好瓜为例——

西瓜数据集

输入一个数据集
生成一个节点node
如果数据集中的样本全部属于同一类，比如西瓜样本全部是“好瓜”，那么node就是叶子节点（好瓜）
如果样本中的数据集不属于同一类，比如西瓜中既有好瓜也有坏瓜，那就选择一个属性，把西瓜根据选好的属性分类。比如按照“纹理”属性，把西瓜分为清晰、模糊、稍糊三类。

一次分支尝试

把第4步得到的几个数据子集作为输入数据，分别执行上面的第1到5步、直到不再执行第4步为止（也就是叶子节点全部构建完成，算法结束）。

Q：整个决策树的训练算法很简洁，但是第4步的“选择一个属性，把西瓜根据选好的属性分类”，怎样来选择合适的属性呢？

假设我们现在只有两种属性选择，一种是“色泽”、另一种是“触感”。

若选择“色泽”，我们可以把西瓜分成三类——“浅白”全是坏瓜；“墨绿”全是好瓜；“青绿”85%是好瓜，15%是坏瓜。
选择“触感”，我们可以把西瓜分为两类——“硬滑”一半是好瓜，一半是坏瓜；“软粘”40%是好瓜、60%是坏瓜。

稍一思考，我们自然会选“色泽”作为本次的属性。因为“色泽”可以一下把很多好瓜和坏瓜区分开来，也就是说我们知道了一个西瓜的色泽后有很大几率正确判断它是好瓜还是坏瓜。而目前来说，知道“触感”却对我们的判断没什么帮助。

因此，我们选择“能给我们带来更多信息”的属性，或者说能够“减少混淆程度”的属性。

假设Pk是指当前集合中第k类样本占的比例。比如10个西瓜中纹理清晰、模糊、稍糊的各有3，3，4个，那么p1=3/10，p2=3/10，p3=4/10。于是根据信息的定义，我们有信息熵：

信息熵

衡量哪一个属性“能给我们带来更多信息”，我们用“信息增益”这个指标：

信息增益

信息增益越大，越能带来信息。因此每一次在上述算法第4步选择属性作为节点时，我们对待选属性都做一次信息增益的计算，选择信息增益最大的属性。

def entropy(branch:list):
    '''
    compute entropy of a branch of a tree split
    
    Parameters
    ----------
    branch: list
        number of all instances for each class in a branch
        
    Returns
    -------
    etp: float
        Entropy

    Examples
    --------
        >>> l_branch1 = [5, 5]; entropy(l_branch1)
        1.0
        >>> l_branch2 = [10, 8]; entropy(l_branch2) 
        0.9910760598382222
        >>> l_branch3 = [10, 0]; entropy(l_branch3)
        0
    '''
    etp = 0
    for datum in branch:
        if datum == 0:
            continue
        prob = datum / sum(branch)
        etp += prob * np.log2(prob)
    return (-1)*etp
    
def info_gain(parent:list, children:list):
    '''
    compute information gain of a tree split

    Parameters
    ----------
    parent: list
        number of all instances for each class in a node (parent node)
    children: list (of lists)
        number of instances for each class in each child node

    Returns
    -------
    ig: float
        information gain of a split

    Examples
    --------
        >>> parent = [10, 10]
        >>> children1 = [[5, 5], [5, 5]]; children2 = [[10, 8], [0, 2]]
        >>> info_gain(parent, children1)
        0.0
        >>> info_gain(parent, children2)
        0.10803154614559995
    '''
    return entropy(parent) - sum([(sum(i)/sum(parent))*entropy(i) for i in children])

除了信息增益以外，我们还可以选择其他指标作为分支标准。

信息增益对于可能取值多的属性有偏好。如果我们把训练样本中每个西瓜编号，然后把编号也作为待选择属性，那么编号肯定能带来最多的信息，最大程度降低混淆程度，所以编号这个属性肯定会被选中（若训练集有17个样本，则会产生17个分支，每个子节点都是只有一个样例的叶子节点）。

然并卵，因为这样训练出来的算法对于新的样本根本不起作用。所以我们不会选择编号作为属性。类似的，如果有些属性的可选项特别多，比如色泽现在有浅白、白、青绿、绿、墨绿五个可选项，那么色泽被选中的几率比其他属性要大。所以可以考虑用增益率代替信息增益：

增益率

def gain_ratio(parent:list, children:list):
    '''
    compute information gain ratio of a tree split

    Examples
    --------
        >>> parent = [10, 10]
        >>> children1 = [[5, 5], [5, 5]]; children2 = [[10, 8], [0, 2]]
        >>> gain_ratio(parent, children1)
        0.0
        >>> gain_ratio(parent, children2)
        0.2303466122545442
    '''
    num_each_attr_value = [sum(branch) for branch in children]
    IV = entropy(num_each_attr_value) # IV is essentially a type of entropy
    return info_gain(parent, children) / IV

基尼系数则是另一项可以考虑的指标：

基尼系数

基尼系数表示从一堆样本中随机抽取两个样本，这两个样本不同类的概率，也就刚是一个是好瓜一个是坏瓜的概率。这样，基尼系数越低，越意味着样本集中某一类比另一类样本多，也就是说，混淆程度越低。若选择某个属性后各个划分子集的基尼系数最低，那么就选择这个属性。

def gini(branch:list):
    '''
    compute Gini index of a branch of a tree split
    
    Examples
    --------
    >>> l_branch1 = [5, 5]; gini(l_branch1)
    0.5
    >>> l_branch2 = [10, 8]; gini(l_branch2)
    0.49382716049382713
    >>> l_branch3 = [10, 0]; gini(l_branch3)
    0.0
    '''
    return 1 - sum([(d/sum(branch))**2 for d in branch])

def gini_index(split):
    '''
    compute Gini index of a tree split
    
    Examples
    --------
    >>> spl1 = [[5, 5], [5, 5]]; gini_index(spl1)
    0.5
    >>> spl2 = [[10, 8], [0, 2]]; gini_index(spl2)
    0.4444444444444444
    >>> spl3 = [[10, 0], [0, 10]]; gini_index(spl3)
    0.0
    '''
    num_D = sum([sum(branch) for branch in split])
    return sum([sum(branch)/num_D * gini(branch) for branch in split])

Q：决策树如何防止过拟合？

决策树的过拟合风险很大。普通的决策树生成过程会不断将数据集划分，以生成更多分支，知道某一分支内部所有样本同属一个类为止。此时，生成的决策树在训练集上的准确率为100%。但此时的过拟合风险最大——把决策树想象成一套分类规则，每一条路径（tree path）就是一条决策链，一组AND规则（if *** AND if *** AND *** => Class1）。这么复杂的规则，既表明了决策树完全地学习到了训练数据地特征，也意味着，对于训练集以外地数据，决策树的规则显得复杂而挑剔，容易过拟合。

我们可以通过简化决策树的结构，裁剪掉一些不太重要的规则（分支），来防止过拟合，这被称为剪枝。剪枝分为预剪枝和后剪枝。

预剪枝过程就是在生成决策树时，用一些新的样本去测试刚刚训练好的节点，如果这个节点的存在并不能让决策树的泛化性能提高，也就是说分类精度或者其他衡量指标不再提高了，就把这个刚训练出来的节点舍弃。考察scikit-learn的决策树API，我们发现有一个max_depth的参数，亦即可以设置让决策树生成到一定深度后就不再分支了，由此可见，简单粗暴地设置一个最大深度，也是防止过拟合的手段之一。当然，这加大了欠拟合的风险。

后剪枝过程是用一些新的样本来测试这课刚刚训练出来的决策树，从下往上开始测试，如果某个节点被换成叶节点后整一棵决策树的泛化性能提高了，也就是说分类精度或者其他衡量指标提高了那么就直接替换掉。

Talk is cheap, show me the code!

下面这段代码只实现了最基本的离散变量决策树分类器，没有处理连续型变量的逻辑，也没有处理缺失值的逻辑，更没有剪枝的逻辑，还要求待测样本中所有的值都包含在训练集中。

"""
Simple Decision Tree Classifier

:file: supervised.py
:author: Richy Zhu
:email: rickyzhu@foxmail.com
"""
import numpy as np

class DTreeNode:
    def __init__(self, attr_id, X, y, children, label):
        self.attr_id = attr_id 
        self.X = X
        self.y = y
        self.children = children
        
        self.label = label # only make sense when node is a leaf
    
    def __str__(self):
        return '<attr_id: {0}, label: {1}, y: {2}>'.format(self.attr_id, self.label, self.y)
    
    def __repr__(self):
        return self.__str__()
    
    def print_all(self):
        print('attr_id:', self.attr_id)
        print('X:', self.X)
        print('y:', self.y)
        print('children:', self.children)
        print('label:', self.label)

class MyDecisionTreeClassifier:
    '''
    Warning: this model a product of code practice,
    it cannot handle continuous value, nor could it handle
    missing value. This tree requires value in test samples 
    all included in training samples.
    '''
    def __init__(self):
        self.X = None
        self.y = None
        self.tree = None
    
    def _gini(self, branch:list):
        '''compute Gini index of a branch of a tree split'''
        return 1 - sum([(d/sum(branch))**2 for d in branch])

    def _gini_index(self, split):
        '''compute Gini index of a tree split'''
        num_D = sum([sum(branch) for branch in split])
        return sum([sum(branch)/num_D * self._gini(branch) for branch in split])
    
    def _count_samples_in_split(self, split):
        '''
        count instances in each branches of a split
        
        Examples
        --------
        >>> spl = [{'Data': np.array([['乌黑', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑'],
        ...          ['乌黑', '蜷缩', '浊响', '清晰', '凹陷', '硬滑'],
        ...          ['乌黑', '稍蜷', '浊响', '稍糊', '稍凹', '软粘'],
        ...          ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '硬滑'],
        ...          ['乌黑', '稍蜷', '沉闷', '稍糊', '稍凹', '硬滑'],
        ...          ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '软粘']]),
        ...   'labels': np.array(['好瓜', '好瓜', '好瓜', '好瓜', '坏瓜', '坏瓜'])},
        ...  {'Data': np.array([['浅白', '蜷缩', '浊响', '清晰', '凹陷', '硬滑'],
        ...          ['浅白', '硬挺', '清脆', '模糊', '平坦', '硬滑'],
        ...          ['浅白', '蜷缩', '浊响', '模糊', '平坦', '软粘'],
        ...          ['浅白', '稍蜷', '沉闷', '稍糊', '凹陷', '硬滑'],
        ...          ['浅白', '蜷缩', '浊响', '模糊', '平坦', '硬滑']]),
        ...   'labels': np.array(['好瓜', '坏瓜', '坏瓜', '坏瓜', '坏瓜'])}]
        >>> count_samples_in_split(split)
        [[4, 2], [1, 4]]
        '''
        split_of_numbers = []
        classes = set(self.y)
        for b in split:
            split_of_numbers.append([
                len(b['Data'][b['labels']==c]) for c in classes
            ])
        return split_of_numbers

    def _one_split(self, Data, labels, attr_id):
        '''split a a set of samples by biven attribute'''
        attr_vals = set(Data[:, attr_id])
        split = []
        for v in attr_vals:
            split.append({
                'Data': Data[Data[:, attr_id]==v], 
                'labels': labels[Data[:, attr_id]==v]
                })
        return split

    def _split(self, Data, labels):
        '''
        try all attributes to partition a set of samples 
        and find the best split with lowest impurity
        '''
        partition = []     # all possible splits
        gini_indices = []  # gini indices of each possible split

        for attr_id in range(Data.shape[1]):
            one_split = self._one_split(Data, labels, attr_id)
            partition.append(one_split)
            gini_indices.append(self._gini_index(
                self._count_samples_in_split(one_split)))

        attr_id = np.argmin(gini_indices)  # attribute that produce best split
        return attr_id, partition[attr_id]

    
    def _build_tree(self, Data, labels):
        '''recursively build a decision tree'''
        if len(set(labels)) == 1:
            # all instances belong to one class, make a leaf node
            return DTreeNode(None, Data, labels, None, labels[0])

        attr_id, split = self._split(Data, labels)
        children = {}
        for branch in split:
            attr_val = branch['Data'][:, attr_id][0]
            # build a sub-tree given a subset of data
            children[attr_val] = self._build_tree(branch['Data'], branch['labels'])

        return DTreeNode(attr_id, Data, labels, children, None)

    def _print_tree(self, node, depth):
        '''recursively preint the decision tree'''
        for child_key, child_val in node.children.items():
            print("|      " * depth, child_key , "+---", child_val)
            if child_val.children is not None:
                self._print_tree(child_val, depth+1)
        
    def print_tree(self):
        '''preint the decision tree structure'''
        if self.tree is None:
            raise Exception("Model hasn't been trained")
        print(self.tree)
        self._print_tree(self.tree, 0)

    def fit(self, X, y):
        '''
        Train a decision tree classifier model

        Parameters
        ----------
        X: ndarray of shape (m, n)
            sample data where row represent sample and column represent feature
        y: ndarray of shape (m,)
            labels of sample data

        Returns
        -------
        self
            trained model
        '''
        self.X = X,
        self.y = y
        self.tree = self._build_tree(X, y)
        return self
        
    def _predict(self, x):
        '''recursively traverse the tree and find and label for sample x'''
        node = self.tree
        while node.children is not None:
            node = node.children.get(x[node.attr_id])
        return node.label
    
    def predict(self, X):
        '''
        Make prediction by the trained model.

        Parameters
        ----------
        X: ndarray of shape (m, n)
            data to be predicted, the same shape as trainning data

        Returns
        -------
        C: ndarray of shape (m,)
            Predicted class label per sample.
        '''
        if self.tree is None:
            raise Exception("Model hasn't been trained")
        assert len(X.shape)==2, 'Input X must be a 2d array'
        results = []
        for x in X:
            results.append(self._predict(x))
        return np.array(results)

简单地试用一下

import numpy as np
from sklearn.model_selection import train_test_split

print('\nDecision Tree')
print('---------------------------------------------------------------------')
xigua2 = np.array([
        ['青绿', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
        ['乌黑', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '好瓜'],
        ['乌黑', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
        ['青绿', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', '好瓜'],
        ['浅白', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', '好瓜'],
        ['青绿', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '好瓜'],
        ['乌黑', '稍蜷', '浊响', '稍糊', '稍凹', '软粘', '好瓜'],
        ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '硬滑', '好瓜'],
        # ----------------------------------------------------
        ['乌黑', '稍蜷', '沉闷', '稍糊', '稍凹', '硬滑', '坏瓜'],
        ['青绿', '硬挺', '清脆', '清晰', '平坦', '软粘', '坏瓜'],
        ['浅白', '硬挺', '清脆', '模糊', '平坦', '硬滑', '坏瓜'],
        ['浅白', '蜷缩', '浊响', '模糊', '平坦', '软粘', '坏瓜'],
        ['青绿', '稍蜷', '浊响', '稍糊', '凹陷', '硬滑', '坏瓜'],
        ['浅白', '稍蜷', '沉闷', '稍糊', '凹陷', '硬滑', '坏瓜'],
        ['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '软粘', '坏瓜'],
        ['浅白', '蜷缩', '浊响', '模糊', '平坦', '硬滑', '坏瓜'],
        ['青绿', '蜷缩', '沉闷', '稍糊', '稍凹', '硬滑', '坏瓜']
    ])
mytree = MyDecisionTreeClassifier()
X = xigua2[:, :-1]
y = xigua2[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
mytree.fit(X_train, y_train)
print('predicted:', mytree.predict(X_test), 'true:', y_test)
print('')
mytree.print_tree()

结果如下

$ python supervissed_examples.py
Decision Tree
---------------------------------------------------------------------
predicted: ['坏瓜' '坏瓜'] true: ['坏瓜' '坏瓜']

<attr_id: 1, label: None, y: ['坏瓜' '坏瓜' '好瓜' '坏瓜' '好瓜' '坏瓜' '好瓜' '好瓜' '好瓜' '坏瓜' '好瓜' '坏瓜' '坏瓜' '好瓜'
 '好瓜']>
 硬挺 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜']>
 蜷缩 +--- <attr_id: 3, label: None, y: ['坏瓜' '坏瓜' '好瓜' '好瓜' '好瓜' '好瓜' '坏瓜' '好瓜']>
|       稍糊 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜']>
|       模糊 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜' '坏瓜']>
|       清晰 +--- <attr_id: None, label: 好瓜, y: ['好瓜' '好瓜' '好瓜' '好瓜' '好瓜']>
 稍蜷 +--- <attr_id: 4, label: None, y: ['坏瓜' '好瓜' '坏瓜' '好瓜' '坏瓜' '好瓜']>
|       凹陷 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜']>
|       稍凹 +--- <attr_id: 3, label: None, y: ['坏瓜' '好瓜' '好瓜' '坏瓜' '好瓜']>
|      |       稍糊 +--- <attr_id: 2, label: None, y: ['坏瓜' '好瓜']>
|      |      |       浊响 +--- <attr_id: None, label: 好瓜, y: ['好瓜']>
|      |      |       沉闷 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜']>
|      |       清晰 +--- <attr_id: 0, label: None, y: ['好瓜' '坏瓜' '好瓜']>
|      |      |       乌黑 +--- <attr_id: 5, label: None, y: ['坏瓜' '好瓜']>
|      |      |      |       软粘 +--- <attr_id: None, label: 坏瓜, y: ['坏瓜']>
|      |      |      |       硬滑 +--- <attr_id: None, label: 好瓜, y: ['好瓜']>
|      |      |       青绿 +--- <attr_id: None, label: 好瓜, y: ['好瓜']>

本作品首发于简书和博客园平台，采用知识共享署名 4.0 国际许可协议进行许可。