reference
https://www.slideshare.net/jaseelashajahan/decision-trees-91553243
https://blog.clairvoyantsoft.com/entropy-information-gain-and-gini-index-the-crux-of-a-decision-tree-99d0cdc699f4
https://www.geeksforgeeks.org/decision-tree-introduction-example/
https://towardsai.net/p/programming/decision-trees-explained-with-a-practical-example-fe47872d3b53

Decision Tree.png

The decision tree algorithm is one of the widely used methods for inductive inference. It approximates discrete-valued target functions while being robust to noisy data and learns complex patterns in the data.

The family of decision tree learning algorithms includes algorithms like ID3, CART, ASSISTANT, etc. They are supervised learning algorithms used for both, classification and regression tasks. They classify the instances by sorting down the tree from root to a leaf node that provides the classification of the instance. Each node in the tree represents a test of an attribute of the instance and a branch descending from that node indicates one of the possible values for that attribute. So, classification of an instance starts at a root node of the tree, tests an attribute at this node, then moves down the tree branch corresponding to the value of the attribute. This process is then repeated for the subtree rooted at the new node.

The main idea of a decision tree is to identify the features which contain the most information regarding the target feature and then split the dataset along the values of these features such that the target feature values at the resulting nodes are as pure as possible. A feature that best separates the uncertainty from information about the target feature is said to be the most informative feature. The search process for a most informative feature goes on until we end up with pure leaf nodes.

The process of building a decision tree involves asking a question at every instance and then continuing with the split- When there are multiple features that decide the target value of a particular instance, which feature should be chosen as the root node to start the splitting process? And in which order should we continue choosing the features at every further split at a node? Here comes the need to measure the informativeness of the features and use the feature with the most information as the feature to split the data on. This informativeness is given by a measure called ‘information gain’. And for this, we need to understand the entropy of the dataset.

What is Entropy

It is used to measure the impurity or randomness of a dataset. Imagine choosing a yellow ball from a box of just yellow balls (say 100 yellow balls). Then this box is said to have 0 entropy which implies 0 impurity or total purity.

If we now adding other different balls to the original set, e.g. let’s say 30 of these balls are replaced by red and 20 by blue. If we now draw another ball from the box, the probability of drawing a yellow ball will drop from 1.0 to 0.5. Since the impurity has increased, entropy has also increased while purity has decreased. Shannon’s entropy model uses the logarithm function with base 2 (log2(P(x)) to measure the entropy because as the probability P(x) of randomly drawing a yellow ball increases, the result approaches closer to binary logarithm 1 as shown in the graph below.

image.png

When a target feature contains more than one type of element (balls of different colors in a box), it is useful to sum up the entropies of each possible target value and weigh it by the probability of getting these values assuming a random draw. This finally leads us to the formal definition of Shannon’s entropy which serves as the baseline for the information gain calculation
$Entropy(x) = - \Sigma(P(x=k) * log_2{(P(x=k)})$
Where P(x=k) is the probability that a target feature takes a specific value, k.

Logarithm of fractions gives a negative value and hence a " $-$ " sign is used in entropy formula to negate these negative values. The maximum value for entropy depends on the number of classes.
2 classes: Max entropy is 1
4 Classes: Max entropy is 2
8 Classes: Max entropy is 3
16 classes: Max entropy is 4

Finding the Information Gain

To find the best feature which serves as a root node in terms of information gain, we first use each descriptive feature and split the dataset along the values of these descriptive features and then calculate the entropy of the dataset. This gives us the remaining entropy once we have split the dataset along the feature values. Then, we subtract this value from the originally calculated entropy of the dataset to see how much this feature splitting reduces the original entropy which gives the information gain of a feature and is calculated as
$infoGain(X_i) = Ent(Dataset) - Ent(X_i)$ $\big({X_i: feature}\big)$ vs $\big(Dataset\big)$

The feature with the largest information gain should be used as the root node to start building the decision tree.

ID3 algorithm uses information gain for constructing the decision tree.

Gini Index

It is calculated by subtracting the sum of squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values.
$Gini = 1-\Sigma{(P(x=k))}^2$

A feature with a lower Gini index is chosen for a split

The classic CART algorithm uses the Gini Index for constructing the decision tree.

Example

loan status problem

id	age	serial number	is married	estate	loan or not
1	>30	high	no	yes	no
2	>30	high	no	no	no
3	20-30	high	no	yes	yes
4	<20	medium	no	yes	yes
5	<20	low	no	yes	yes
6	<20	low	yes	no	no
7	20-30	low	yes	no	yes
8	>30	medium	no	yes	no
9	>30	low	yes	yes	yes
10	<20	medium	no	yes	yes
11	>30	medium	yes	no	yes
12	20-30	medium	no	no	yes
13	20-30	high	yes	yes	yes
14	<20	medium	no	no	no

# For the betterment of markdown user who wants to know how the 
# flowchar has been built, the following code is the mermaid 
# extension available within typora
graph TD
    A[age]
    A---|age<20|B[has estate]
        B---|no|E((loan-no))
        B---|yes|F((loan-yes))
    A---|age between 20-30|C((loan-yes));
    A---|age>30|D(is married);
        D---|no|G((loan-no))
        D---|yes|H((loan-yes))

image.png

Find the Entropy of the total dataset
$Entropy = - \big({p\over{p+n}}log_2{p\over{p+n}}+{{n\over{p+n}}log_2{n\over{p+n}}}\big)$

p - number of positive cases
n - number of negative cases

$Entropy(dataset) = - \big({9\over14}log_2{9\over14}+{{5\over14}log_2{5\over14}}\big) = .940$

Compute each column, i.e. each feature entropy
$Gain(Dataset, Feature)=Ent(Dataset)-\Sigma^n_{i=1}{|Dataset_i|\over|Dataset|}Ent(Dataset_i)$
Since the dataset could be divided by age in {<20;20-30;>30}, then we can use these 3 age range and compute the entropy respectively, as well as the total entropy of the entire dataset

">30" $Ent(D_{age>30}) = - {\Bigg({2\over5}*log_2{2\over5}+{3\over5}*log_2{3\over5}}\Bigg) = .971$
"20~30" $Ent(D_{age(20-30)}) = - {\Bigg({4\over4}*log_2{2\over5}+0}\Bigg) = 0$
"<20" $Ent(D_{age>30}) = - {\Bigg({3\over5}*log_2{3\over5}+{2\over5}*log_2{2\over5}}\Bigg) = .971$

Find the particular entropy
e.g. age, since age is the most significant property in this case
$Gain(dataset, age) = Entropy(dataset) - \Sigma^n_{i=1} {|age\in {>30;20-30;<20}|\over{Total \ Dataset}} = {.940 - {{5\over14}*.971}-{4\over14}*0}-{5\over14}*.971 = .246$
Keep doing the same computation, until we found the best way to divide the dataset, in this case, age looks like having the most significant impact, i.e. the least entropy $\rightarrow$ the most purity when dividing the dataset and finding the leaves

The computation process in simple words

Compute the total data entropy

Compute each property entropy

Compute the Gain from each property, the lesser the better for which if we choose to divide the dataset by that property

Conlusion

Information is a measure of a reduction of uncertainty. It represents the expected amount of information that would be needed to place a new instance in a particular class. These informativeness measures form the base for any decision tree algorithms. When we use Information Gain that uses Entropy as the base calculation, we have a wider range of results whereas the Gini Index caps at one.

decision tree