- Learn how to build a decision tree model with Weka
- This tutorial is perfect for newcomers to machine learning and decision trees, and for those people who are not comfortable with coding.
“The bigger the obstacle, more glory is obtained by overcoming it”.
Machine learning can be intimidating for non-technical people. All machine learning jobs seem to require a healthy understanding of Python (the R).
Then, How do non-programmers get coding experience? It is not a kid's game!
This is the good news: there are many tools that allow us to perform machine learning tasks without having to code. You can easily create algorithms as decision trees from scratch in a beautiful graphical interface. Isn't that the dream? These tools, como Put, they mainly help us deal with two things:
- Quickly build a machine learning model, as a decision tree, and understand how the algorithm is performing. This can be modified later and based on
- This is ideal to show the customer / your leadership team what you are working with
This article will show you how to solve classification and regression problems using decision trees in Weka without any prior programming knowledge!!
But if you are passionate about getting your hands dirty with programming and machine learning, I suggest you take the following wonderfully selected courses:
Table of Contents
- Classification vs. Regression in Machine Learning
- Understanding Decision Trees
- Exploring the dataset in Weka
- Classification using the decision tree in Weka
- Decision tree parameters in Weka
- Viewing a Decision Tree in Weka
- Regression using the decision tree in Weka
Classification vs. Regression in Machine Learning
Let me first quickly summarize what classification and regression are in the context of machine learning. It is important to know these concepts before diving into decision trees.
A classification trouble it's about teaching your machine learning model how to categorize a data value into one of many classes. It does this by learning the characteristics of each type of class. For instance, to predict whether an image is of a cat or a dog, the model learns the characteristics of the dog and cat in the training data.
A regression trouble it's about teaching your machine learning model how to predict the future value of a continuous quantity. It does this by learning the pattern of the amount in the past affected by different variables. For instance, a model that attempts to predict the future price of a company's stock is a regression problem.
You can find these two problems in abundance in our Plataforma DataHack.
Now, let's learn about an algorithm that solves both problems: Decision trees!
Understanding Decision Trees
Decision trees they are also known as Classification and regression trees (CART). They work by learning answers to a hierarchy of questions if / if not that lead to a decision. These questions form a tree-like structure, and hence the name.
For instance, let's say we want to predict whether a person will order food or not. We can visualize the following decision tree for this:
Each node in the tree represents a question derived from the characteristics present in your data set.. Your dataset is divided based on these questions until the maximum depth of the tree is reached.. The last node doesn't ask a question, it represents to which class the value belongs.
- The top node of the decision tree is called Rnodo oot
- The lowest node is called Leaf node
- A node divided into subnodes is called Parent node. The subnodes are called Secondary nodes
If you want to understand decision trees in detail, I suggest you check out the following resources:
What is Weka? Why should you use Weka for machine learning?
“Put is free open source software with a range of built-in machine learning algorithms that you can access via a graphical user interface. “
WEKA It represents Waikato Environment for Knowledge Analysis and was developed at the University of Waikato, New Zealand.
Weka has multiple built-in functions to implement a wide range of machine learning algorithms, from linear regression to neural network. This allows you to implement the most complex algorithms in your dataset with the click of a button!! Not only this, Weka provides support for accessing some of the most common Python and R machine learning library algorithms!
With Weka you can preprocess the data, classify them, group them and even view them. This can be done in different data file formats like ARFF, CSV, C4.5 and JSON. Weka even allows you to add filters to your dataset through which you can normalize your data., standardize them, swap functions between nominal and numeric values, and much more!
I could go on about the wonder that is Weka, but for the scope of this article, let's try to explore Weka in a practical way by creating a decision tree. Now go ahead and download Weka from your official website!
Exploring the dataset in Weka
I'll take the breast cancer data set from UCI machine learning repository. I recommend that you read about the problem before proceeding further..
Let's first load the dataset into Weka. To do that, follow the steps below:
- Open the Weka GUI
- Select the “Explorer” option.
- Please select “Open document” and choose your dataset.
Your Weka window should now look like this:
You can see all the functions in your dataset on the left side. Weka automatically creates charts for its features that you will notice as you browse through its features.
You can even see all the parcels together if you click on the “View all” button.
Now let's train our classification model!
Classification using Decision tree in Weka
Implementing a decision tree in Weka is quite simple. Just complete the following steps:
- Click on the “Sort out” tab on top
- Click on the “Choose” button
- In the drop-down list, select “trees” which will open all the algorithms in the tree
- Finally, select the “RepTree” decision tree
”Reduced error pruning tree (RepTree) is a quick decision tree learner that builds a decision tree / regression using information gain as a division criterion and pruned it using a reduced error pruning algorithm “.
“The decision tree divides the nodes into all available variables and then selects the division that results in the most homogeneous subnodes”.
The information gain is used to calculate the homogeneity of the sample in a division.
You can select your target role from the drop-down menu just above the “Start” button. If you don't, WEKA automatically selects the last function as a target for you.
the “Percentage division” specifies the amount of data you want to keep to train the classifier. The rest of the data is used during the testing phase to calculate the precision of the model..
With “Cross validation folding” can create multiple samples (or folds) from the training data set. If you decide to create N folds, the model is iteratively executed N times. And each time one of the folds is held for validation, while the remaining N-1 folds are used to train the model. The result of all the folds is averaged to get the cross-validation result.
The greater the number of cross validation folds you use, the better your model will be. This causes the model to be trained on randomly selected data, what makes it more robust.
Finally, press the “Start” Button for the classifier to do its magic!!
Our classifier has an accuracy of 92,4%. Weka even prints the Confusion matrix for you, what offers different metrics. You can study the confusion matrix and other metrics in detail here.
Decision tree parameters in Weka
Decision trees have many parameters. We can adjust them to improve the overall performance of our model. This is where practical knowledge of decision trees really plays a crucial role..
You can access these parameters by clicking on your decision tree algorithm at the top:
Let's talk briefly about the main parameters:
- Maximum depth – Determine the maximum depth of your decision tree. By default, it is -1, which means the algorithm will automatically control the depth. But you can manually modify this value to get the best results on your data.
- could not – Pruning means automatically reducing a leaf node that does not contain much information. This makes the decision tree simple and easy to interpret..
- numFolds – The specified number of data folds will be used to prune the decision tree. The rest will be used to grow the rules.
- minNum – Minimum number of instances per sheet. If not mentioned, the tree will keep splitting until all leaf nodes have only one associated class.
You can always experiment with different values for these parameters to get the best precision in your data set..
Viewing your decision tree in Weka
Weka even allows you to easily visualize the decision tree built on your dataset.:
- Go to the “Results list” section and right click on your trained algorithm
- Choose the “View tree” option
Your decision tree will look like below:
Interpreting these values can be a bit intimidating, but it's actually pretty easy once you get the hang of it.
- The values of the lines that join the nodes represent the division criteria based on the values of the main node function.
- In the leaf node:
- The value before the parentheses denotes the rank value.
- The first value in the first parentheses is the total number of instances of the training set on that sheet. The second value is the number of misclassified instances on that sheet.
- The first value in the second parentheses is the total number of instances of the pruning set on that sheet. The second value is the number of misclassified instances on that sheet.
Regression using the decision tree in Weka
As I said before, decision trees are so versatile that they can work in both classification and regression problems. For this, I will use the “Predict the number of votes in favor“Problem of Plataforma DataHack de DataPeaker.
Here, we need to predict the rating of a question asked by a user on a question and answer platform.
As usual, we will start by loading the data file. But this time, the data also contains a “ID” column for each user in the dataset. This would not be useful in prediction. Then, we will delete this column by selecting the “Remove” option below column names:
We can make predictions about the data set as we did for the Breast cancer problem. RepTree will automatically detect the regression problem:
The evaluation metric provided in the hackathon is the RMSE score. We can see that the model has a very poor RMSE without any feature engineering. This is where it comes in: ahead, experiment and improve the final model!
And so, You have created a decision tree model without having to do any programming! This will be of great help in your quest to master how machine learning models work..
If you want to learn and explore the programming part of machine learning, I suggest you follow these wonderfully curated courses on the Vidhya Analytics website: