decision tree

reference
https://www.slideshare.net/jaseelashajahan/decision-trees-91553243
https://blog.clairvoyantsoft.com/entropy-information-gain-and-gini-index-the-crux-of-a-decision-tree-99d0cdc699f4
https://www.geeksforgeeks.org/decision-tree-introduction-example/
https://towardsai.net/p/programming/decision-trees-explained-with-a-practical-example-fe47872d3b53

Decision Tree.png

The decision tree algorithm is one of the widely used methods for inductive inference. It approximates discrete-valued target functions while being robust to noisy data and learns complex patterns in the data.

The family of decision tree learning algorithms includes algorithms like ID3, CART, ASSISTANT, etc. They are supervised learning algorithms used for both, classification and regression tasks. They classify the instances by sorting down the tree from root to a leaf node that provides the classification of the instance. Each node in the tree represents a test of an attribute of the instance and a branch descending from that node indicates one of the possible values for that attribute. So, classification of an instance starts at a root node of the tree, tests an attribute at this node, then moves down the tree branch corresponding to the value of the attribute. This process is then repeated for the subtree rooted at the new node.

The main idea of a decision tree is to identify the features which contain the most information regarding the target feature and then split the dataset along the values of these features such that the target feature values at the resulting nodes are as pure as possible. A feature that best separates the uncertainty from information about the target feature is said to be the most informative feature. The search process for a most informative feature goes on until we end up with pure leaf nodes.

The process of building a decision tree involves asking a question at every instance and then continuing with the split- When there are multiple features that decide the target value of a particular instance, which feature should be chosen as the root node to start the splitting process? And in which order should we continue choosing the features at every further split at a node? Here comes the need to measure the informativeness of the features and use the feature with the most information as the feature to split the data on. This informativeness is given by a measure called ‘information gain’. And for this, we need to understand the entropy of the dataset.

What is Entropy

It is used to measure the impurity or randomness of a dataset. Imagine choosing a yellow ball from a box of just yellow balls (say 100 yellow balls). Then this box is said to have 0 entropy which implies 0 impurity or total purity.

If we now adding other different balls to the original set, e.g. let’s say 30 of these balls are replaced by red and 20 by blue. If we now draw another ball from the box, the probability of drawing a yellow ball will drop from 1.0 to 0.5. Since the impurity has increased, entropy has also increased while purity has decreased. Shannon’s entropy model uses the logarithm function with base 2 (log2(P(x)) to measure the entropy because as the probability P(x) of randomly drawing a yellow ball increases, the result approaches closer to binary logarithm 1 as shown in the graph below.


image.png

When a target feature contains more than one type of element (balls of different colors in a box), it is useful to sum up the entropies of each possible target value and weigh it by the probability of getting these values assuming a random draw. This finally leads us to the formal definition of Shannon’s entropy which serves as the baseline for the information gain calculation
Entropy(x) = - \Sigma(P(x=k) * log_2{(P(x=k)})
Where P(x=k) is the probability that a target feature takes a specific value, k.

Logarithm of fractions gives a negative value and hence a "-" sign is used in entropy formula to negate these negative values. The maximum value for entropy depends on the number of classes.
2 classes: Max entropy is 1
4 Classes: Max entropy is 2
8 Classes: Max entropy is 3
16 classes: Max entropy is 4

Finding the Information Gain

To find the best feature which serves as a root node in terms of information gain, we first use each descriptive feature and split the dataset along the values of these descriptive features and then calculate the entropy of the dataset. This gives us the remaining entropy once we have split the dataset along the feature values. Then, we subtract this value from the originally calculated entropy of the dataset to see how much this feature splitting reduces the original entropy which gives the information gain of a feature and is calculated as
infoGain(X_i) = Ent(Dataset) - Ent(X_i) \big({X_i: feature}\big) vs \big(Dataset\big)

The feature with the largest information gain should be used as the root node to start building the decision tree.

ID3 algorithm uses information gain for constructing the decision tree.

Gini Index

It is calculated by subtracting the sum of squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values.
Gini = 1-\Sigma{(P(x=k))}^2

A feature with a lower Gini index is chosen for a split

The classic CART algorithm uses the Gini Index for constructing the decision tree.

Example

loan status problem

id age serial number is married estate loan or not
1 >30 high no yes no
2 >30 high no no no
3 20-30 high no yes yes
4 <20 medium no yes yes
5 <20 low no yes yes
6 <20 low yes no no
7 20-30 low yes no yes
8 >30 medium no yes no
9 >30 low yes yes yes
10 <20 medium no yes yes
11 >30 medium yes no yes
12 20-30 medium no no yes
13 20-30 high yes yes yes
14 <20 medium no no no
# For the betterment of markdown user who wants to know how the 
# flowchar has been built, the following code is the mermaid 
# extension available within typora
graph TD
    A[age]
    A---|age<20|B[has estate]
        B---|no|E((loan-no))
        B---|yes|F((loan-yes))
    A---|age between 20-30|C((loan-yes));
    A---|age>30|D(is married);
        D---|no|G((loan-no))
        D---|yes|H((loan-yes))
image.png
  • Find the Entropy of the total dataset
    Entropy = - \big({p\over{p+n}}log_2{p\over{p+n}}+{{n\over{p+n}}log_2{n\over{p+n}}}\big)

p - number of positive cases
n - number of negative cases

Entropy(dataset) = - \big({9\over14}log_2{9\over14}+{{5\over14}log_2{5\over14}}\big) = .940

  • Compute each column, i.e. each feature entropy
    Gain(Dataset, Feature)=Ent(Dataset)-\Sigma^n_{i=1}{|Dataset_i|\over|Dataset|}Ent(Dataset_i)
    Since the dataset could be divided by age in {<20;20-30;>30}, then we can use these 3 age range and compute the entropy respectively, as well as the total entropy of the entire dataset
  1. ">30" Ent(D_{age>30}) = - {\Bigg({2\over5}*log_2{2\over5}+{3\over5}*log_2{3\over5}}\Bigg) = .971
  2. "20~30" Ent(D_{age(20-30)}) = - {\Bigg({4\over4}*log_2{2\over5}+0}\Bigg) = 0
  3. "<20" Ent(D_{age>30}) = - {\Bigg({3\over5}*log_2{3\over5}+{2\over5}*log_2{2\over5}}\Bigg) = .971
  • Find the particular entropy
    e.g. age, since age is the most significant property in this case
    Gain(dataset, age) = Entropy(dataset) - \Sigma^n_{i=1} {|age\in {>30;20-30;<20}|\over{Total \ Dataset}} = {.940 - {{5\over14}*.971}-{4\over14}*0}-{5\over14}*.971 = .246

  • Keep doing the same computation, until we found the best way to divide the dataset, in this case, age looks like having the most significant impact, i.e. the least entropy \rightarrow the most purity when dividing the dataset and finding the leaves

The computation process in simple words

  1. Compute the total data entropy
  2. Compute each property entropy
  3. Compute the Gain from each property, the lesser the better for which if we choose to divide the dataset by that property

Conlusion

Information is a measure of a reduction of uncertainty. It represents the expected amount of information that would be needed to place a new instance in a particular class. These informativeness measures form the base for any decision tree algorithms. When we use Information Gain that uses Entropy as the base calculation, we have a wider range of results whereas the Gini Index caps at one.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,163评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,301评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,089评论 0 352
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,093评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,110评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,079评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,005评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,840评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,278评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,497评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,667评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,394评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,980评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,628评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,796评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,649评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,548评论 2 352

推荐阅读更多精彩内容