Chapter 6 Decision Trees

1. What is the approximate depth of a decision tree trained (without restrictions) on a training set with one million instances?

The depth of a well-balanced binary tree containing m leaves is equal to log₂(m), rounded up. log₂ is the binary log; log₂(m) = log(m)/log(2). A binary Decision Tree (one that makes only binary decisions, as is the case with all trees in Scikit-Learn) will end up more or less well balanced at the end of training, with one leaf per training instance if it is trained without restrictions. Thus, if the training set contains ome million instances, the Decision Tree will have a dept of log₂(1000000) ≈ 20 (actually a bit more since te tree will generally not be perfectly well balanced).

2. Is a node’s Gini impurity generally lower or higher than its parent’s? Is it generally lower/higher, or always lower/higher?

A node's Gini impurity is generally lower than its parent's. This is due to te CART trraining algorithm's cost function, which splits eac node in a way that minimizes the weigted sum of its children's Gini impurities. However, it is possible for a node to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease in the other child impurity. For example, consider a node containing four instances of class A and one of class B. Its Gini impurity is 1 – (1/5)² – (4/5)² = 0.32. Now suppose te dataset is one-dimensional and te instances are lined up in te following order: A, B, A,A,A. you can verify that the algorithm will split this node after the second instance, producing one child node with instances A, B, and the other child node with instances A, A, A. The first child node's Gini impurity is 1 – (1/2)² – (1/2)² = 0.5, which is higher than its parent's. This is compensated for by the fact that the other node is pure , so its overall weighted Gini impurity is 2/5 × 0.5 + 3/5 × 0 = 0.2, which is lower than the parent's Gini impurity.

3.If a decision tree is overfitting the training set, is it a good idea to try decreasing max_depth?

If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regrlarizing it.

4.If a decision tree is underfitting the training set, is it a good idea to try scaling the input features?

Decision Trees don't care whether or not the traning data is scaled or centered; that's one of the nice things about them. So if a Decision Tree underfits the training set, scaling the input features will just be a waste of time.

5.If it takes one hour to train a decision tree on a training set containing one million instances, roughly how much time will it take to train another decision tree on a training set containing ten million instances? Hint: consider the CART algorithm’s computational complexity.

The computational complexity of training a Decision Tree is O(n × m log₂(m)). So if you multiply the training set size by 10, the training time will be multiplied by k = (n × 10 m × log₂(10 m)) / (n × m × log₂(m)) = 10 × log₂(10 m) / log₂(m). if m = 1000000, then k ≈ 11.7. so you can expect the training time to e roughtly 11.7 hours.
you can expect the traning time to e roughtly 11y hours.

6.If it takes one hour to train a decision tree on a given training set, roughly how much time will it take if you double the number of features?
if the number of features doubles, then the training time will also roughly double.