Decision tree is a classification algorithm by dividing the data set into different subsets by the explanatory variables. The content presented in this section is mainly based on the knowledge learned from Decision Tree Youtube tutorials playlist by Victor Lavrenko.
The highlevel idea is grow the tree (segment the population into buckets such that the each bucket (leaf) has the same target variable). At each node, the it can only be segment into 2 child-nodes. When the explanatory variable is categorical, we try all the combinations (\(C^{m}_{1}, C^{m}_{2}, ...C^{m}_{m/2}\)) to find the one combination that miniize the impurity (or has the highest the information gain as explained in the following paragraph.
Entropy is a measure to measure the uncertainty of the outcome. If the outcome is certain (either all positive or all negative), then the entropy is 0. The highest entropy value is 1 when the probablity of outcome is 50-50.
\[ H(S) = - p{(+)}log_2p_{(+)} - p_{(-)}log_2p_{(-)} \] impure (5 positive, 5 negative): \[ H(S) = -\frac{5}{10}log_2 \frac{5}{10} - \frac{5}{10}log_2 \frac{5}{10} = 1 \] pure set (5 positive, 0 negative): \[ H(S) = -\frac{5}{5}log_2 \frac{5}{5} - \frac{0}{5}log_2 \frac{0}{5} = 0 \] (note that \(log_2 \frac{0}{5}\) is mathematically undefined. In this case, \(0log_20\) is taken as 0.)
\(\Rightarrow\) The smaller the entropy the better!
\[ Gain(S,A) = H(S) - \sum \frac{|S_V|}{|S|}H(S_V) \]
where \(V\) is the possible value of \(A\), \(S\) is the set of examples \(X\) and \(S_V\) is the subset where \(X_A = V\)
e.g. set \(A\) has 10 positive and 6 negative and can be devided to subset \(B\) (8 positive, 4 negative)and subset \(C\) (2 positive, 2 negative)
Then, the entropy before the split is \(H(S) = -\frac{10}{16}log_2 \frac{10}{16} - \frac{6}{16}log_2 \frac{6}{16} = 0.954434\)
and the information gain after the split is: \[ \begin{aligned} IG &= H(S_A) - \frac{12}{16}H(S_B) - \frac{4}{16}H(S_C) \\ &= 0.954434-\frac{12}{16}0.9182958-\frac{4}{16}1 \\ &= 0.01571215 \end{aligned} \] The larger the IG (information gain) the better!
The algorithm will keep splitting data until each node is a pure set. (Assuming that the data is splitable. That is, there is no two data points with all x variables that are identical, but different y.) Hence the size of the tree will become too deep (with lots of layers and lots of nodes) and become too specific to the training data.
Moreover, when there is a new data comes in and it is different from any of the historica data, then the algorithm doesn’t know how to classify it.
There is one drawback of the information gain. That is, it tends to choose the feature that has many tiny subsets. The most extreme case is that if each data point has an unique ID, and then the entropy after spliting would be: \[H(S) = \sum \frac{1}{n}0 = 0 \] To penalize choosing a feature of a large number of distinct values, the Information gain ratio (\(IGR\)) is prefered than the information gain (\(IG\)).
Information gain ratio: \[ IGR = \frac{IG}{IV} \] Where \(IV\) is the Intrinsic value: \[ IV(S,A) = -\sum \frac{|S_V|}{|S|}log_2\frac{|S_V|}{|S|} \]
When the explanatory variable is continuous (numeric data), then it has to be transform to catergorical data by setting some thresold.
In the case where the target variable is not binary, but a continuous variable, we can use regression tree. Similar to the classification tree, we are trying to segment the popuation into 2 child-node for each node. The predicted value of the child-node is the average of the samples in the node. In this case, as opposed to finding the segmentation that minimize the impurity (in classification problem), we are trying to find the segmentation that minizie the sum of square errors.