Introduction
In the previous article, How to split a decision tree: the quest to achieve pure nodes, understood the basics of decision trees, like division, the ideal division and the pure nodes. In this article, we will see one of the most popular algorithms to select the best division in decision trees: Gini impurity.
Note: If you are more interested in learning concepts in an audiovisual format, we have this full article explained in the video below. If that is not the case, you can keep reading.
PD: if you have not read the previous article, you may have a hard time understanding this article.
Then, so far we have seen that the attribute “Class” is able to estimate student behavior, about playing cricket or not. And this attribute is working much better compared to the two remaining variables, What “the height” Y “performance in class”. If you remember, we did a division of all the available functions and then we compared each division to decide which was the best. This is how the decision tree algorithm works too.
A decision tree first divides the nodes into all available variables and then selects the division that results in the most homogeneous subnodes.
Homogeneous here means having a similar behavior with respect to the problem we have. If the nodes are completely pure, each node will only contain a single class and, Thus, they will be homogeneous. So you can intuitively imagine that The higher the purity of the nodes, the greater the homogeneity.
Gini impurity: a decision tree algorithm to select the best division
There are several algorithms that the decision tree uses to decide the best division for the problem.. Let's first look at the most common and popular of them all, What is it Gini impurity. Measures the impurity of the nodes and is calculated as:
Let's first understand what Gini is and then I will show you how you can calculate Gini impurity for division and decide the correct division. Let's say we have a node like this-
Then, what Gini says is that if we choose two points from a population at random, pinks highlighted here, then they must be of the same class. Let's say we have a completely pure node
Can you guess what the probability would be that a point chosen at random belongs to the same class?? Good, obviously will be 1 since all the points here belong to the same class. Then, no matter which two points you have chosen, will always belong to that class and, Thus, the probability will always be 1 if the node is pure. And that's what we want to achieve with Gini.
Gini varies from zero to one, since it is a probability and the higher this value is, the greater the purity of the nodes. Y, of course, a smaller value means smaller pure nodes.
Gini impurity properties
Let's see its properties before calculating the Gini impurity to decide the best division.
We decide the best division based on the Gini impurity and, as we discussed before, Gini's impurity is:
Here Gini denotes purity and, therefore, Gini's impurity tells us about the impurity of the nodes. If Gini impurity is reduced, we can safely infer that the purity will be higher and, Thus, a higher probability of homogeneity of the nodes.
Gini works only in those scenarios where we have categorical objectives. Doesn't work with continuous targets.
A very important point to keep in mind to keep in mind. For instance, if you want to predict the price of the house or the number of bikes that have been rented, Gini is not the right algorithm. Only perform binary divisions, whether yes or no, success or failure, etc. Therefore, will only split one node into two subnodes. These are the properties of the Gini impurity.
Steps to calculate the Gini impurity for a Split
Let's now see the steps to calculate the Gini division. First, we calculate the Gini impurity for the subnodes, as you have already discussed, and i'm sure you already know:
Gini impurity = 1 – Gini
Here is the sum of the squares of the probabilities of success for each class and is given as:
Considering that there are n classes.
Once we have calculated the Gini impurity for the subnodes, we calculate the Gini impurity of the division using the weighted impurity of both subnodes of that division. Here, the weight is decided by the number of sample observations at both nodes. Let's see these calculations using an example, which will help you understand this even better.
For the division on class performance, Do you remember that this was the division?
Divide into class performance
We have two categories, one is “above average” and the other is “Below average”. When we focus on the above average, have 14 students of which 8 they play cricket and 6 no. The probability of playing cricket would be 8 divided by 14, what is around 0,57, and similarly, not to play cricket, the probability will be 6 divided by 14, what will be around 0,43. Here for simplicity, I have rounded the calculations instead of taking the exact number.
In the same way, when we look below average, we calculate all the numbers and here they are: the probability of playing is 0,33 and not to play is 0,67-
Let's now calculate the Gini impurity of the subnodes above average and here is the calculation:
It will be, one minus the square of the probability of success for each category, What is it 0,57 to play cricket and 0,43 not to play cricket. Then, after this calculation, Gini comes to light 0,49. The Lower than Average node will do the same calculation as Gini. Below average:
Comes around 0.44. Just pause and analyze these numbers.
Now, to calculate the Gini impurity of the division, we will take the weighted Gini impurities from both nodes, above average and below average. In this case, the weight of a node is the number of samples at that node divided by the total number of samples at the parent node. Then, for the above average node here, the weight will be 14/20, since there are 14 students who performed above the average of the total of 20 students we had.
And the weight below average is 20/6. Then, the weighted Gini impurity will be the weight of that node multiplied by the Gini impurity of that node. Gini's weighted impurity for performance in divided class comes out to be:
Similarly, here we have captured the impurity of Gini to the class division, that comes out to be around 0,32–
Now, if we compare the two Gini impurities for each division-
We see that the Gini impurity for the division into Class It is less. And therefore, the class will be the first division of this decision tree.
Divide into class
Similarly, for each division, we will calculate the Gini impurities and the division that produces the minimum Gini impurity will be selected as division. And know, that the minimum Gini impurity value means that the node will be purer and more homogeneous.
Final notes
In this article, we saw one of the most popular division algorithms in decision trees: Gini's impurity. Can only be used for categorical target variables. There are other algorithms that are also used to divide, that if you want to understand you can let me know in the comments section.
If you are looking to start your data science journey and want all topics under one roof, your search stops here. Take a look at DataPeaker's certified AI and ML BlackBelt Plus Program
If you have any question, Let me know in the comment section!