The Simplest Guide to Decision Trees
The Decision Tree Algorithm is one of the best machine learning models that exist, and fortunately, it is also very easy to build in python. The goal of this article is to not only understand how Decision Trees work but also how to create one of your own.
Before you start this tutorial, it is it would be wise to familiarize yourself with some of the terms used in machine learning so that this tutorial is easier to follow. If you don’t have any experience with machine learning you should read this article first.
With that said, let’s begin!
What is a Decision Tree?
A decision tree is essentially a series of yes or no questions that lead to output, here is a simple example to predict whether someone is hungry or not.
Decision Trees are the most logical and questioned-based approach to machine learning and while this may seem extremely simple, the technical part lies in how the questions(also called nodes) about the data are formed.
The node on top is the root node, the middle nodes are the internal nodes, and the bottom nodes are called leaf nodes. The lines connecting the nodes are called branches.
The goal of the decision is to form nodes containing the largest proportion of data points from a single group by finding values in the features that cleanly divide the data into different groups.
They use a layered splitting process, where at each layer they try to split the data into two or more groups so that data that fall into the same group are most similar to each other (homogeneity), and groups are as different as possible from each other (heterogeneity).
The best way to understand this is by coding a very simple decision tree. The problem we are going to start off with is a binary classification problem. Given two data points(between 1–3) it returns an output(either 1 or 0).
This is the data we will be working with, there are 6 data points (called samples) and there are two labels (called inputs).
There are a lot of different algorithms for decision tree splitting, here is a list of them and a simple explanation.
ID3: The Iterative Dichotomiser 3 performs the algorithm iteratively and dichotomizes features into the appropriate groups.
C4.5: Developed as a second level to ID3, and does the same thing just with less need for intense iteration.
CART: The classification and regression trees does not use an iterative method, but instead follow simple rules based on variable values which are used to get the best possible split based on the data.
Chi-square automatic interaction detection: Completes multi-level splits.
MARS: Multivariate adaptive regression spline extends decision trees for numerical data.
We will be using CART since it is the most common, and most efficient for small datasets like the one for this example.
import numpy as np
X = np.array([[2, 2], [2, 1],[2, 3], 1, 2], [1, 1],[3, 3]])
y = np.array([0, 1, 1, 1, 0, 1])
In decision trees, there is something called entropy, which measures the randomness/impurity of the data. For example, say there is a box of 3 apples, the impurity level would be 0. If there was a box of 1 banana, 1 orange, and 1 apple, the impurity level would be higher. The formula for entropy is as follows.
Essentially for every classification in a dataset the classified data points, are put as a fraction over all data points. The negative of the fraction is multiplied by the log base 1, of the fraction. For our dataset, the entropy calculation would look like this,
From here we can then import Sckit-Learn and fit a classifier in just three lines of code,
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=RSEED)
And that is it, from here you can print out the count of nodes by using the line,
and the depth (height of tree) with this line,
Libraries like Sci-kit learn to make machine learning implementation super simple, you barely have to do any programming for it to work, but it is still important to understand how decision trees work for harder problems.
To get a sense of how the decision tree “thinks”, we need to see the entire structure. We can use sklearn to plot out the actual tree for us to see using the following code,
from sklearn.tree import export_graphviz
export_graphviz(tree, 'tree.dot', rounded = True,
feature_names = ['x1', 'x2'],
class_names = ['0', '1'], filled = True)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=400']);
from IPython.display import Image
This code delivers the following image,
The first line of each node is the question based on where the data is, then depending on the answer it goes to another node. The ones at the bottom of the tree with no following branch don’t have a question.
The bottom line is the class, which is just the output and classification.
The Gini Impurity is the probability of the class being incorrect.
The samples line is the minimum number of data points that can be derived under that node
The value line is the number of samples for each classification.
The visual of the decision tree partitions a graph based on the lines created.
Everything is pretty self-explanatory except for the Gini index because as of now you don’t know how to calculate it.
Without Gini, you would not be able to make an accurate decision tree because this is the only practical way to calculate accuracy for this model, and that calculation is as follows.
In other words, the Gini Impurity of this node n is 1 minus the sum of the number of data points for a specific class over all the datapoints squared. So if there are 2 data points for class 0 and 4 data points for class 1 with a total of 6 classes all together the Gini calculation for the root node would look like this,
This formula is repeated in every node so that the nodes at the bottom are always 0, and the samples are 1. That way there is no chance that a point randomly selected from that node would be misclassified.
You now know how decision trees work, and how to build one with python. It’s important to recall that the tree didn’t make any mistakes with the training data. Because we provided the tree the answers and didn’t limit the maximum depth, we anticipate this to be the case. A machine learning model’s goal is to generalize effectively to new data that it hasn’t seen previously.
You could prevent this with pruning to lessen the number of branches so it's not as specific to the dataset. But still is pretty specific and not compatible with newer data. To avoid this we must use the concept of Random Forests, where instead of one tree we use a forest of trees, but that is a topic for another article.