The Simplest Guide to Naive Bayes Classifiers

Ryan Rana
6 min readAug 15, 2021
Photo by Lisa Therese on Unsplash

The Naive Bayes algorithm is one of the most commonly used machine learning algorithms out there. The goal of this article is to not only teach you how Naive Bayes works but also how to build one with Python.

Classification

First off before diving into Naive Bayes, we first need to understand the classification part. Classification refers to a type of predictive modeling where the model attempts to predict the output for a set of input data. The simplest example of this is spam filtering in your email inbox. Other examples include classifying images of objects, classifying different types of iris flowers, or classifying handwritten numbers.

Classification of Handwritten Numbers

These are all projects that have been done before, and today we are going to create a classifier of our own. We are going to be classifying news articles and whether they are fake or not.

Bayes Theorem

Before we can start building our news classifier we still have to look into what Bayes Theorem actually is.

This is Bayes Theorem, and in simple terms here is what is happening. Bayes Theorem calculates the probability that A is true given event B based on the inverse probability, probability of B given A. This is called conditional probability.

So essentially is B is true, what is the chance that A is also true. This is just the simple theorem that Naive Bayes is built upon.

Naive Bayes

Naive Bayes assumes that every feature is independent of the other features and equally contributing to output.

For example, let's say we are trying to solve a problem where we are trying to classify whether the weather is good enough to go running. The 3 made-up features are wind, humidity, and temperature.

This is a completely made-up dataset, and the concept of Naive Bayes assumes that no pair of features are dependent so for example, the temperature is ‘Hot’ has nothing to do with the humidity or the wind. Each of the features is treated as equally important and it is given the same weight as the others.

In our news classifier, we are assuming that the words in an article have no correlation to each other and each word is equally important to the others.

Coding our Naive Bayes.

You know to understand what a Naive Bayes is, now it's time to code it for our fake news classier.

Loading Data

To get started let's get our dataset from Kaggle. You have to download the True.csv and the Fake.csv and put them in the same directory as your new file. To load the data into our code we have to import pandas.

Using pandas we can now read our CSV's into the code.

Cleaning Data

We have our data now, but we also have to clean it up a bit and combine it into one CSV. This will take several steps.

First, we to convert the actual classifications from “real_news” and “fake_news” to 0 and 1. “fake_news” happens to be the name of the column.

If you look at the True dataset you will notice that all the articles in the True dataset have the location and the word “(Reuters)” at the top, while the False dataset doesn’t. This is going to hinder our model because it will quickly see this and only make predictions based on the presence of Reuters, and when real articles don’t have Reuters written it will make an incorrect classification. We have this from all our rows immediately.

This snippet takes the text color as a separate variable, extracts all after hyphens, cuts out the indices, and puts the new text into the original variable.

Combining all Data

The data has been cleaned, and now it has to be combined to run our actual model.

We add the data into a separate database by stacking it on top of each other (axis=0), then we remove all the coloums that are not text or classifications (axis=1), lastly add them to a full CSV.

Now we can load in the clean data and finally start the classifier.

Train/Test Split

We now have to split the X and Y columns into separate variables.

Sklearn is a tool to automate the building of AI models, the first step using sklearn is to split our data into training and testing. If you don’t know why this is necessary you should read the following article.

Build a Model

Our algorithm requires a lexicon in order to identify news as false or not. We’ll use CountVectorizer to transform our text into a dictionary that maps every unique word to the number of times it appears in the data in this project. In addition, this feature extractor will transform the words into vectors that we can use in our model.

We finished creating our Vectorizer for our training data(not the test), so we can finally put it into our model.

We are finally done! We can test our model on our testing data, and then get an accuracy score based on the predictions.

Our result is approximately 0.93%, this is good but it could be better. We could tune the hyperparameters of our model to slightly improve the outcome.

Tuning the Hyperparameters

The 3 most influential hyperparameters are n_estimators, max_features, and max_depth. There are more but it will take longer to run the program, so we only use these by setting up the variables.

This 12-liner prints out the hyperparameters. My results were: ‘n_estimators’ = 600; ‘max_features’ = ‘sqrt’; ‘max_depth’: 300. Now we can plug these back into the model to see if it improved our performance.

Now when we run the code we get an improved result.

Using your model

You can test it out by putting some articles into our model to get a prediction. If you want to find real articles you can look for content on the Washington Post, NYT, and the Guardian. If you want fake articles you can look for content on the Onion or Hard Times. Simply copy and paste the article in your code and assign it to a variable (I’ll call mine “final”) and run the following code.

Conclusion

You now know how to build a Naive Bayes classifier and built a model to classify fake news. The next step builds your own Naive Bayes classifiers and adds them to your portfolio. All the code and data for this project can be found here. Good luck with all your coding endeavors!

Works Cited

“1.9. Naive Bayes¶.” Scikit, scikit-learn.org/stable/modules/naive_bayes.html.

“Build a Naive Bayes Classifier with Python.” Enlight, enlight.nyc/projects/build-a-naive-bayes-classifier.

Gandhi, Rohith. “Naive Bayes Classifier.” Medium, Towards Data Science, 17 May 2018, towardsdatascience.com/naive-bayes-classifier-81d512f50a7c.

“Naive Bayes Classifiers.” GeeksforGeeks, 15 May 2020, www.geeksforgeeks.org/naive-bayes-classifiers/.

“Naïve Bayes Algorithm: Everything You Need to Know.” KDnuggets, kdnuggets.com/2020/06/naive-bayes-algorithm-everything.html.

--

--