The Simplest Guide to the Random Forest Algorithm

Ryan Rana
7 min readAug 10, 2021
Photo by Andrew Coelho on Unsplash

In my previous article on Decision Trees, I covered everything about Decision Trees and how to build one with Python. The Random Forest Algorithm is a successor to Decision Trees as it is composed of many trees. I finished my article in the following paragraph,

It’s important to recall that the tree didn’t make any mistakes with the training data. Because we provided the tree the answers and didn’t limit the maximum depth, we anticipate this to be the case. A machine learning model’s goal is to generalize effectively to new data that it hasn’t seen previously.

What this means is that when just using one tree, that data gets partitioned super specifically and is unable to handle other data because it doesn’t fit into those partitions, for example, the original data is on the left and after a decision tree algorithm, it splits data and creates the right image.

This is the best possible split (it is pruned so there is a minimal number of branches) and even then new data could still get classified wrong. This is called overfitting.

Overfitting occurs when we have a very flexible model which memorizes the training data and fits it closely. A flexible model is said to have high variance because the learned parameters will vary considerably with the training data. On the other hand, an inflexible model is said to have a high bias because it makes assumptions about the training data. An inflexible model may not have the capacity to fit even the training data and in both cases.

Although this isn’t an image of a decision tree the concept applies. When it is underfitting there is no variance and a high bias, and overfitting has too much variance and is hyper-specific.

So neither a flexible or inflexible model can do the trick, therefore there needs to be a balance between bias and variance, to optimize the model.

A single decision tree is forever flexible because it can keep making a split for every single data point. If we try to change this then it increases the bias. Therefore an alternative to limiting the depth of the tree, we can combine many trees to make a random forest.

Photo by Geran de Klerk on Unsplash

Forests are a classic example of dividing-and-conquering to succeded. The main principle behind this type of method is that a group of “weak learners” can come together to form a “strong learner”. The tree is a weak learner and a forest is a strong learner.

We now know why it is called a forest, but what about the random part? It is random for two reasons.

The first one is the random choice of training points. Instead of making a decision tree for all the points, it takes a subset of random points. It does this over and over again and then averages out all the results.

The second reason is a random choice of features that correspond to an output. It does this over and over again and then averages out all the results.

Random Forests allow for thousands upon thousands of decision tree samples to create variance and bias balance to cancel out the noise. We rely on numerous sources for information in real life, thus not only is a decision tree natural, but the notion of mixing them in a random forest is as well.

So that's all there are too many random forests at the end of the day, so now we can actually build one!

Building a Random Forest Classifier in Python

In order to understand how to implement a random forest model in Python, we’ll do a very simple example with the Iris flower classification data set. You can download it from this Kaggle dataset.

Then once you have it downloaded, you can open a new file in the same directory as the data, and then you can import pandas(a tool for handling data) into your program.

import pandas as pd

Then we can read the file with pandas,

df = pd.read_csv("Iris.csv")

Now we create two variables, the first one is X for our input variables which is everything but the output column(species) and the ID column. The second variable is Y which is just for our output column.

X = df.drop(‘Species’, 'Id', axis=1)
y = df[‘Species’]

Now we have to split the data into training and testing. If you don’t know the purpose of this step you should read this article. We can split the training and testing ratio as 75/25 with sckit-learn. Sckit-learn is a tool for building machine learning models in the easiest way possible.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=66)

Now, we can create the random forest model and fit it for our training data.

from sklearn import model_selection
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)

Then we can test how good the model is by making predictions on the testing data.

rfc_predict = rfc.predict(X_test)

We can also calculate the probabilities for the class.

rfc_probs= rfc.predict_proba(X_test)[:, 1]

To check accuracy for the model with the ROC Curve algorithm. A receiver operating characteristic curve is a graph showing the performance of a classification model at all classification thresholds. AUC stands for “Area under the ROC Curve.” AUC measures the area under the line from (0,0) to (1,1). Essentially getting a 1 is the best and 0 is the worst, random guesses get 0.5.

from sklearn.metrics import roc_auc_score 
roc_value = roc_auc_score(y_test, rfc_probs)

Cross-validation helps us get accuracy for each prediction, we can score it as such.

rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring=’roc_auc’)

Now, we’ll print out the results.

print(rfc_cv_score) #prints all of them
print(rfc_cv_score.mean()) #prints average of all of them

This is the output.

[0.77703704 0.74407407 0.77814815 0.67962963 0.74481481 0.83777778 0.83148148 0.88851852 0.77461538 0.87]
0.7926096866096867

This is a good result but we can improve it using certain parameters that will change how the classifier works. We can do this also with sckit learn using RandomizedSearchCV.

from sklearn.model_selection import RandomizedSearchCV

The 3 most influential hyperparameters are n_estimators, max_features, and max_depth. There are more but it will take longer to run the program, so we only use these by setting up the variables.

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = [‘auto’, ‘sqrt’]
max_depth = [int(x) for x in np.linspace(100, 500, num = 11)]
max_depth.append(None)

These features then need to be added to a grid so it is properly formatted for the function.

random_grid = {
‘n_estimators’: n_estimators,
‘max_features’: max_features,
‘max_depth’: max_depth
}

Random Search CV can then take in the classifier as an estimator and the distribution of parameters can just be taken as the random grid.

rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

The rfc_random classifier fits the training and then it will print the parameters of the result with this line,

rfc_random.fit(X_train, y_train)
print(rfc_random.best_params_)

My results were: ‘n_estimators’ = 600; ‘max_features’ = ‘sqrt’; ‘max_depth’: 300. Now we can plug these back into the model to see if it improved our performance.

rfc = RandomForestClassifier(n_estimators=600, max_depth=300, max_features='sqrt')
rfc.fit(X_train,y_train)
rfc_predict = rfc.predict(X_test)
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score)
print(rfc_cv_score.mean())

This is the following output,

[0.79259259 0.82888889 0.83259259 0.73592593 0.81333333 0.85777778 0.86851852 0.91074074 0.79884615 0.85384615]
0.8293062678062677

We can see from the output that there was a slight improvement in the results. Our roc_auc score improved from .793 to .829.

Conclusion

You know now what a Random Forest is, how it works, why it's better than a decision, and also how to build one with python. You also created a special classifier for iris flowers that you can post on Github and add to your portfolio. Good luck to you and all your coding endeavors!

Works Cited

“Build a Random Forest Algorithm with Python.” Enlight, enlight.nyc/projects/random-forest.

Huneycutt, Jake. “Implementing a Random Forest Classification Model in Python.” Medium, Medium, 21 May 2018, medium.com/@hjhuney/implementing-a-random-forest-classification-model-in-python-583891c99652.

Koehrsen, Will. “An Implementation and Explanation of the Random Forest in Python.” Medium, Towards Data Science, 31 Aug. 2018, towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76.

amandp13@amandp13. “Random Forest Classifier Using Scikit-Learn.” GeeksforGeeks, 5 Sept. 2020, www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/.

“Sklearn.model_selection.RandomizedSearchCV¶.” Scikit, scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html.

--

--