A simple analogy to explain the decision tree versus the random forest
Let's start with a thought experiment that will illustrate the difference between a decision tree and a random forest model..
Suppose a bank has to approve a small loan amount for a customer and the bank needs to make a decision quickly. The bank checks the person's credit history and financial situation and discovers that the previous loan has not yet been repaid. Therefore, the bank rejects the request.
But here's the problem: the loan amount was too small for the bank's huge coffers and they could have easily approved it in a very low-risk measure. Therefore, the bank lost the opportunity to earn some money.
Now, another loan application will arrive in a few days, but this time the bank presents a different strategy: multiple decision-making processes. Sometimes, first check the credit history and, sometimes, first check the financial condition of the client and the loan amount. Later, the bank combines the results of these multiple decision-making processes and decides to grant the loan to the client.
Even if this process took longer than the previous one, the bank benefited from this method. This is a classic example where collective decision-making outperformed a single decision-making process. Now, here is my question for you: Do you know what these two processes represent?
These are decision trees and a random forest!! We will explore this idea in detail here, we will delve into the main differences between these two methods and answer the key question: Which machine learning algorithm should I use?
Table of Contents
- Brief introduction to decision trees
- An overview of random forests
- Random Forest Clash and Decision Tree (In code!)
- Why did Random Forest outperform a decision tree?
- Decision tree versus random forest: When should you choose which algorithm?
Brief introduction to decision trees
A decision tree is a supervised machine learning algorithm that can be used for classification and regression problems. A decision tree is simply a series of sequential decisions that are made to achieve a specific result.. Here is an illustration of a decision tree in action (using our example above):
Let's understand how this tree works.
First, check if the client has a good credit history. Based on that, classifies the customer into two groups, namely, clients with good credit history and clients with bad credit history. Later, checks the customer's income and again classifies it into two groups. Finally, verifies the loan amount requested by the client. According to the results of the verification of these three characteristics, the decision tree decides whether the customer's loan should be approved or not.
The characteristics / attributes and conditions may change depending on the data and complexity of the problem, but the general idea remains the same. Then, a decision tree makes a series of decisions based on a set of characteristics / attributes present in the data, which in this case were the credit history, income and loan amount.
Now, you may be wondering:
Why did the decision tree check credit score first and not income?
This is known as the importance of the characteristic and the sequence of attributes to be checked is decided on the basis of criteria such as IndexThe "Index" It is a fundamental tool in books and documents, which allows you to quickly locate the desired information. Generally, it is presented at the beginning of a work and organizes the contents in a hierarchical manner, including chapters and sections. Its correct preparation facilitates navigation and improves the understanding of the material, making it an essential resource for both students and professionals in various areas.... de impureza de Gini o Information gain. Explanation of these concepts is beyond the scope of our article here., but you can check out any of the resources below to learn all about decision trees:
Note: The idea behind this article is to compare decision trees and random forests. Therefore, I won't go into the details of the basics, but I will provide the relevant links in case you want to explore further.
An overview of Random Forest
The decision tree algorithm is quite easy to understand and interpret. But often, a single tree is not enough to produce effective results. This is where the Random Forest algorithm comes in..
Random Forest is a tree-based machine learning algorithm that harnesses the power of multiple decision trees to make decisions.. As the name suggests, it's a “Forest” of trees!
But, Why do we call it forest “random”? That is because it is a forest of randomly created decision trees. Each node in the decision tree works on a random subset of characteristics to compute the output. The random forest then combines the output of individual decision trees to generate the final output.
In simple words:
The random forest algorithm combines the output of multiple decision trees (randomly created) to generate the final output.
This process of combining the output of multiple individual models (aka weak students) is named Joint learning. If you want to read more about how the random forest and other ensemble learning algorithms work, see the following articles:
Now the question is, How can we decide which algorithm to choose between a decision tree and a random forest? Let's see them both in action before jumping to conclusions!!
Random Forest Clash and Decision Tree (In code!)
In this section, We will use Python to solve a binary classification problem using both a decision tree and a random forest. Then we will compare your results and see which one best suits our problem..
We will be working on the Loan Prediction Data Set de DataPeaker’s Plataforma DataHack. This is a binary classification problem in which we have to determine whether a person should receive a loan or not based on a certain set of characteristics.
Note: can go to DataHack platform and compete with other people in various online machine learning competitions and get a chance to win exciting prizes.
Ready to code?
Paso 1: load libraries and dataset
Let's start by importing the required Python libraries and our dataset:
The data set consists of 614 rows and 13 features, including credit history, the civil state, the loan amount and gender. Here, the variableIn statistics and mathematics, a "variable" is a symbol that represents a value that can change or vary. There are different types of variables, and qualitative, that describe non-numerical characteristics, and quantitative, representing numerical quantities. Variables are fundamental in experiments and studies, since they allow the analysis of relationships and patterns between different elements, facilitating the understanding of complex phenomena.... de destino es Loan_Status, which indicates whether a person should receive a loan or not.
Paso 2: data preprocessing
Now comes the most crucial part of any data science project: Data preprocessing Y fenatural engineering. In this section, I will deal with the categorical variables in the data and also impute the missing values.
I will impute the missing values in the categorical variables with the mode, and for continuous variables, with the average (for the respective columns). Besides, we will label encoding the categorical values in the data. You can read this article to learn more about Label encoding.
Paso 3: creation of test sets and trains
Now, let's divide the data set into a 80:20 relationship for training and testing, respectively:
Let's take a look at the created train shape and test sets:
Excellent! Now we are ready for the next stage where we will create the decision tree and random forest models!!
Paso 4: construction and evaluation of the model
Since we have the training and test sets, it's time to train our models and classify loan applications. First, we will train a decision tree on this data set:
Then, we will evaluate this model using F1-Score. F1-Score is the harmonic mean of precision and recovery given by the formula:
You can learn more about this and other evaluation metrics here:
Let's evaluate the performance of our model using the F1 score:
Here, you can see that the decision tree works well in the in-sample evaluation, but its performance drops dramatically in the out-of-sample evaluation. Why do you think that is the case? Unfortunately, our decision tree model is overfitted to the training data. Will the random forest solve this problem?
Building a random forest model
Let's see a random forest model in action:
Here, we can clearly see that the random forest model performed much better than the decision tree in the out-of-sample evaluation. Let's discuss the reasons behind this in the next section..
Why did our random forest model outperform the decision tree?
Random forest harnesses the power of multiple decision trees. It does no depend on the importance of the characteristic given by a single decision tree. Let's take a look at the importance of the feature given by different algorithms to different features:
As you can clearly see from the above graphic, the decision tree model places great importance on a particular set of characteristics. But the random forest picks features at random during the training process. Therefore, does not rely heavily on any specific set of features. This is a special feature of the random forest on bagged trees. You can read more about the bag.ing tree classifier here.
Therefore, random forest can generalize data in a better way. This random selection of features makes the random forest much more accurate than a decision tree..
Then, Which should I choose: decision tree or random forest?
Random Forest is suitable for situations where we have a large data set and interpretability is not a major concern.
Decision trees are much easier to interpret and understand. Since a random forest combines several decision trees, becomes more difficult to interpret. Here is the good news: it is not impossible to interpret a random forest. Here is an article that talks about interpret the results of a random forest model:
Besides, Random Forest has a higher training time than a single decision tree. You must take this into account because as we increase the number of trees in a random forest, the time it takes to train each of them also increases. That can often be crucial when working with a tight deadline on a machine learning project..
But i will say this: despite instability and dependence on a particular set of characteristics, decision trees are really useful because they are easier to interpret and faster to train. Anyone with very little knowledge of data science can also use decision trees to make quick data-driven decisions..
Final notes
That is essentially what you need to know in the decision tree in the face of the random forestry debate.. It can get tricky when you're new to machine learning, but this article should have clarified the differences and similarities for you.
You can reach me with your queries and thoughts in the comment section below.