Machine Learning: Decision Tree

Understanding Decision Trees

What are decision trees?

Decision tree is a powerful machine learning algorithm capable of performing regression and classification tasks. Also decision trees are the fundamental components of a range of ensemble models, such as Random Forest and XGBoost. In decision tree, we predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.

How do decision trees work?

We use recursive binary splitting to grow trees. Starting at the root node, we will calculate the metric, either entropy or gini, for every feature and select one feature with the minimum entropy or gini. Then we split the training set into two subnodes, leaf node and decision node, and repeat the splitting procedure for the decision node until all nodes are pure.

Comparison between decision trees and linear models

Advantage:

  • Scale. Unlike linear models, which are greatly influenced by scales, decision trees are not sensitive to input feature scale because it is a logical model consisting of step functions.
  • More flexible. Decision trees can explain not only simple linear relationships but also more complex non-linear relations, while the use of linear models is limited to data showing a linear relation.
  • Decision trees are able to handle qualitative variables without creating dummy variables.
  • Decision trees are very easy to explain to people because they are very similar to people's decision-making.

Disadvantage:

  • Overfitting: A small change in data can cause a huge change in the final estimated tree. The problem can be solved by regularizing model parameters and pruning.
  • Inaccurate Prediction: Decision trees are not a good fit for continuous variables and thus produce predictions not as accurate as other regression or classification techniques.

Now I will explain classification trees and regression trees separately using iris dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

# load iris dataset
iris = load_iris()
# create dataframe for iris data
iris_df = pd.DataFrame(iris.data,columns=iris.feature_names)
display(iris_df.head())
image.png
from sklearn import tree

# filter out sepal width and sepal length for building decision tress
X = iris.data[:,2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

fig = plt.figure(figsize=(15,10),dpi=100)
_ = tree.plot_tree(tree_clf,
                   feature_names=iris.feature_names,
                   class_names=iris.target_names,
                   filled=True)
plt.show()

image.png

Explanation of the decision tree graph:

  • Samples Attribute indicates how many training instances the node applies to. For example, sample=150 means that 150 training instances have a sepal length smaller than 2.45.
  • Value Attribute tells us how many training instances of each class the node applies to. For example, value = [50,50,50] means that 50 training observations belong to first-class, 50 training observations belong to the second class, etc.
  • Gini Attribute measures the impurity of the node. gini=0 means the node is pure.

When looking at the graph of the decision tree, we may encounter a series of unfamiliar terminologies. Now I will explain them:

Root node

Root node represents all training observations and the root node will be further divided into two subnodes based on either gini or entropy. In the case, Root node is the top text box, which asks the question "Is the sepal length smaller or equal to 2.45?" If it is, then we move down to the left leaf node; otherwise, we move down to the right node, which is decision node.

Leaf node

In leaf node, all samples belong to only one class and thus it cannot be further splitted. In the case, the leaf node does not ask any question and gini score is 0.

Decision node

A subnode is a decision node if it is further split into two subnodes. For example, the middle right text box is a decision node, which asks the question "Is the sepal width smaller than or equal to 1.75?". If it is, then we move down to the left green text box; otherwise, move down to the right purple text box.

Evaluate Decision Trees

As we can see, growing trees completely depends on how we split, and the decision of splitting greatly impacts the accuracy of decision trees. So how can we know which feature should we use for splitting? The answer depends on two criteria, Gini Index and Entropy. I will explain them one by one.

Gini

Gini impurity measures the possibility that a randomly selected observation is misclassified.

Gini index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen.

Gini impurity (named after Italian mathematician Corrado Gini) is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. -- Wikipedia

Formula:

G_i = 1 - \sum_{k=1}^nP_{i,k}^2

  • P_{i,k} is the ratio of class k instances to training instances in the i^{th} node.

Take the decision tree we just built as an example, let's try to calculate the gini score for root node.

\because P_{i,k} = \frac{1}{3}

\therefore Gini_{root} = 1-\frac{1}{3}^2-\frac{1}{3}^2-\frac{1}{3}^2 = 0.67

Now we compute the total gini score of the tree, which is a weighted average of gini score of each node.

G = \sum_{i=1}^nP_iG_i

  • P_i is the ratio of instances in one node to the training instances of the parental node.

G = 0.667 + \frac{1}{3}\times 0 + \frac{2}{3}\times 0.5 + \frac{54}{100}\times 0.168 + \frac{46}{100}\times0.043 = 1.111

Entropy

What is entropy?

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. --Wikipedia

In fact, entropy measures the amount of information needed to communicate the outcome of an event. Given a random variable X and a series of outcomes x_1,\dots,x_n, which occurs with the probability of P(x_1),\dots,P(x_n). The entropy is defined as:

H(X) = -\sum_{i=1}^nP(x_i)\log_2{P(x_i)}

Now let me explain the definition given by Wikipedia. What does it mean by the amount of surprise? Suppose that we toss a fair coin and we know that the result of the event is either head or tail. In the case, the event provides little information for us and, in other words, has little surprise because the result is basically within our expectations. Hence, the more likely or deterministic an event is, the less information or surprise it contains, and thus entropy is lower.

Let's try to calculate the entropy of flipping a coin: -(\frac{1}{2}\log{\frac{1}{2}}+\frac{1}{2}\log{\frac{1}{2}}) = 1

Let me introduce another example about machine learning. Look at the image:

image.png

If we need to decide which node is easiest to classify, the answer must be the third one because all dots have the same color and so it requires less information to describe it. In this case, the entropy of it is zero. Looking at the other two nodes, we find that we need more information to describe them and so the event of classification becomes more uncertain. So the entropy of the two nodes is higher.

Conditional Entropy

Conditional entropy measures the amount of information still needed to communicate the outcome of an event y given that the information about x is known.

The conditional entropy of Y given X is defined as:

H(Y|X) = \sum Pr(Y|X)\log{Pr(Y|X)}

Information Gain

Information gain measures the reduction in entropy after we split a certain nodes using a particular feature. So it is usually used to identify which feature should be used for splitting.

\text{Information Gain = }H(X) - H(Y|X)

tree_clf = DecisionTreeClassifier(max_depth=2,criterion="entropy")
tree_clf.fit(X,y)

fig = plt.figure(figsize=(15,10),dpi=100)
_ = tree.plot_tree(tree_clf,
                   feature_names=iris.feature_names,
                   class_names=iris.target_names,
                   filled=True)
plt.show()
image.png
# Information Entropy root node
H0 = -1/3*np.log2(1/3)*3

# calculate the entropy for orange leaf node with samples of 50
H1 = -1*np.log2(1)

# entropy for white decision node with question of "Is sepal width smaller or equal to 1.75?"
H2 = -(1/2*np.log2(1/2)+1/2*np.log2(1/2))

# entropy for green leaf node
H3 = -(49/54*np.log2(49/54)+5/54*np.log2(5/54))

# entropy for purple decision node
H4 = -(1/46*np.log2(1/46)+45/46*np.log2(45/46))

print("Information Entropy:")
print(f"{round(H0,3)} bits")
print(f"{round(H1,3)} bits")
print(f"{round(H2,3)} bits")
print(f"{round(H3,3)} bits")
print(f"{round(H4,3)} bits")

>
Information Entropy:
1.585 bits
-0.0 bits
1.0 bits
0.445 bits
0.151 bits

Now I explain how to calculate the conditional entropy. Let's take the splitting occurring at depth 1 as an example. Specifically, I will compute the conditional entropy for the decision node with the question "Is sepal width smaller or equal to 1.75?"

# list proportions of each child node
p1 = 54/100
p2 = 46/100

h1 = 0.445
h2 = 0.151

# implement calculation using formula
cond_entropy = p1*h1+p2*h2
print(f"Conditional Entropy: {round(cond_entropy,3)} bits")

>Conditional Entropy: 0.31 bits

Based on the result, we can obtain the information gain of the split.

ig = H2 - cond_entropy
print(f"Information Gain: {round(ig,3)} bits")

>Information Gain: 0.69 bits

The result indicates that the split helps to reduce the entropy by 0.69 bits using the feature of sepal width.

Hyperparameter Tuning

As I mentioned before, decision trees easily suffer from the issue of overfitting. We usually reduce the risk of overfitting by regularization, which means putting constraints on the growth of trees. Of course, there are other approaches to mitigating the problem, such as bagging, boosting, and stacking, which I will talk about in coming passages.

Now let's look at what kind of hyperparameters are available for tuning:

max_depth

The depth of the tree. If set to default, the tree is expanded until all leaves are pure or leaves contain less than min_sampels_split samples. If we set the parameter to a very large number, the tree is very likely to get overfitting.

max_depth is closely tied to min_samples_split, which specifies the minimum number of samples in order for one node to be split. But now Let's just observe the change in our model when we change max_depth.

import sys

sys.path.append("/Users/ray/Desktop/pythonProject")

from helper_01 import visualize_tree
from sklearn.model_selection import cross_val_score


fig,axes = plt.subplots(1,4,figsize=(16,3),dpi=100)
fig.subplots_adjust(left=0.02, right=0.98, wspace=0.1)

for axi, depth in zip(axes, range(1, 5)):
    model = DecisionTreeClassifier(max_depth=depth,criterion="gini")
    score = cross_val_score(model,X,y,cv=10,scoring="accuracy")
    visualize_tree(model, X, y, ax=axi)
    axi.set_title('depth = {0}, Accuracy= {1}'.format(depth,round(np.mean(score),2)))
    
plt.show()
image.png

As observed from the graph, higher depth can greatly improve the accuracy of our model but also entails higher risk of overfitting. At depth of 3, the model performs best with little misclassification. But when the depth increases to 4, the accuracy begins to decrease slightly. In fact, depth of 4 indicates overfitting and the model definitely performs very badly on new cases that are not an original part of the train set. My intuition is that a depth of 2 or 3 is good enough for the dataset.

min_samples_split

The parameter specifies the number of samples required to split an internal decision node. It can takes both integer and float.

  • If int, a decision node has to contain at least min_samples_split for one split.
  • If float, a decision node has to contain at least min_samples_split*n_samples for one split.

min_samples_leaf

The parameter specifies the number of samples required in a leaf node. Say we specify min_samples_leaf=2 and one split produces a leaf node with only one sample, then the split is not allowed. It also takes int or float.

According to An empirical study on hyperparameter tuning of decision trees, the ideal min_samples_split values tend to be between 1 and 40 for the CART algorithm. And min_samples_leaf should be between 1 and 20.

max_features

The parameter specifies how many features are considered for each split. There are a number of options available:

  • int/float we specify a specific number for the parameter
  • sqrt max_features = sqrt(n_features)
  • log2 max_features = log2(n_features)
  • None max_features = n_features

Recall that each time algorithm splits the tree, it looks at a number of features and selects one with the minimum metric using gini or entropy, and then creates two subnodes on the feature. So it is computationally expensive to consider all features at each split.

The rule of thumb is that we can set the parameter as log2 if the model has overfitting or the dataset is enormous. On the condition of the result, we can slightly bring the parameter up, such as trying sqrt or a specific number.

Reference:

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,163评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,301评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,089评论 0 352
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,093评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,110评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,079评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,005评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,840评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,278评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,497评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,667评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,394评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,980评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,628评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,649评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,548评论 2 352

推荐阅读更多精彩内容