Decision Trees - Feature Selection for a Split
In the previous two articles "Decision Trees- How to decide the split?" and "Decision Trees - Homogeneity Measures", I have laid the foundations for what we will look at in this post.
There are many algorithms that are used to decide the best feature for splitting at a particular point in the decision tree build-up. To name a few:
CART (Classification and Regression Trees)
C4.5
C5.0
CHAID
ID3
Each of them has its own nuances and is suitable in different scenarios. However, CART is one of the basic ones and that is what we will look at today. This is mainly used for binary splits whereas CHAID is for multi-way splits.
CART is also the default implementation provided in many libraries like the python scikit library.
CART Algorithm
What this does is that it calculates the impurity of a node before a split and then after a split based on a feature. It then looks at the delta or the decrease in impurity. That feature which reduces the impurity the most is selected as the "Split-Feature" at that step.
The change in impurity is also known as purity gain or information gain, sometimes.
To succinctly put it, the algorithm iteratively runs through these three steps:
Use the Gini Index to calculate the pre and the post-impurity measure
Calculate the delta or the purity gain/information gain
Do a split based on the feature with maximum information gain.
This is repeated till we meet an end criteria for the decision tree creation. The end criteria could be that the purity of all the leaf nodes is above an expected threshold. For example, if the purity is greater than 70%, you do not want to split further.
There are also other criteria that can be used to stop the creation of Decision trees.
Walkthrough with an Example Dataset
Suppose we have a dataset of a population of 100 people - some with diabetes and some without. We also know their gender and fasting sugar levels. Now we want to decide, whether should we first split based on gender or based on fasting sugar levels.
Here is what the original data looks like based on gender and fasting sugar levels:
Just reading through the data: There are 45 and 20 non-diabetics, male and female respectively. There are 20 and 15 diabetics similarly. Totally a 65:35 male to female ratio and a similar non-diabetic to diabetic ratio.
On similar lines, 60 have fasting sugars < 120mg/dl and 5 >120mg/dl among non-diabetics. There are 5 <120 mg/dl and 30 > 120 mg/dl among diabetics. While the choice of the feature to split by may seem obvious from these 2 features, it may not be the same when we have many features and large data sets.
Split based on Gender
Let us now try to split based on the feature "Gender". How would it look? The node before the split and after can be represented this way:
Let us now calculate the Gini index before and after the split, to see the purity gain or information gain.
Gini Impurity before the split:
We first calculate the probability of a data point being that of a non-diabetic and then the probability of a data point being that of a diabetic.
Using these probability values, the Gini impurity is calculated as:
Gini Impurity after the split:
The probability of the diabetic and the non-diabetic classes in the male cluster turns out to be
Therefore the Gini Impurity of the male node becomes:
Similalry, let us calculate the probability of the diabetic and the non-diabetic patients in the female node and using these, the Gini Impurity itself.
Overall Impurity after the split:
To get the overall impurity after the split, we need to take the weighted sum of the impurities measured on both the split nodes. How do we take the weighted sums?
We see the probability of being male and multiple that with the Gini impurity of the male node and then similarly, multiply the probability of a female in the dataset and multiply that with the Gini index of the female node.
The probabilities of the genders are:
Therefore the Gini Impurity of the split nodes, based on gender, taken as a weighted sum is
Reduction in Impurity:
The final reduction in the impurity of the information gain is given by:
Split Based on Fasting Sugars:
If we were to split the root node based on the fasting sugar values, here is what we would get:
Based on these values, we want to calculate the Gini index of the root node (already done above) and the Gini index of the child nodes, to come up with the change in impurity levels.
The Gini impurity of the node with all people with < 120 mg/dl would turn out to be:
And the Gini Impurity for the node with sugars > 120 is
Overall Impurity after the split:
We find the weights of the two classes and then find the weighted sum of impurities of the split nodes.
Final Reduction in Impurity by splitting based on fasting sugars:
Conclusion:
From the above set of Gini delta calculations, you see that the change in impurity for the split based on Gender is only 0.0263 while that of the split based on fasting sugars is 0.273, almost 10 times more. Hence the CART algorithm would choose to first split based on fasting sugars rather than based on Gender.
There is another conclusion that we derive based on these calculations. That is, fasting Sugar is a more important feature for predicting diabetes than Gender. This is common knowledge in this example but is extremely helpful in deciding the feature importance even in not such obvious feature sets.
With this, we have looked at all the basic aspects of decision trees and we are ready for a code-based deep dive into implementing a decision tree.
Hope you gained some knowledge here.
Well explained Sai. Thanks for sharing the details at this level.