The Simplest Guide to Logistic Regression

Ryan Rana
7 min readAug 5, 2021

Logistic Regression is one of the best and easiest machine learning models that exist. In this article, we will cover not only how a Logistic Regression(LR) works but also code one with python.

Before you start this tutorial, it is it would be wise to read this Linear Regression Tutorial. This is a comprehensive guide to Linear Regression, and LR uses a lot of similar concepts as Linear Regression.

It would also be helpful to know the basics of Python Programming if you want to code an LR for yourself.

With that said, let’s begin!

What is Logistic Regression?

A simple visual for what LR looks like.

That image doesn’t make any sense right now but it will soon.

In the Machine Learning world, LR is a kind of classification model, even though it appears as a regression model. Classification basically means you can take in an input and classify what category that input falls under. For example, if you feed an LR a picture of a dog it will classify it as a dog like this,

This means that logistic regression models have a certain fixed number of parameters(columns of data) that depend on the number of input features, and they output categorical prediction, like for example if a piece of clothing is a pant or a shirt.

The general concept of LR is very similar to Linear Regression (now it’s a good time to read that article above). In short Linear Regression, plots all the data onto a graph (of x and y), fits all the data to a best-fit line, and then makes predictions for inputs as the corresponding y, like this,

A simple visual for what Linear Regression looks like.

Logistic Regression on the other hand fits all the data to an S-curve and there are only two possible outputs (two classifications), which are represented as the top and bottom lines,

This is called a binary classification (they can only be used to distinguish between 2 different categories for example if a sample of hair belongs to a male or female). The two classifications on a Y-axis as 0 and 1. This means that our data has two kinds of observations. In more involved problems harder there would be a lot more features that are involved in classification.

The S cure actually doesn’t mean it looks like a letter S, it actually stands for sigmoid function. This is because the sigmoid function fits very well our goal of classifying samples in two different categories. The formula for sigmoid looks like this where x is the inputs,

In English, the sigmoid is just the calculation of probability based on the weighted sum of the input features. The formula for the weighted sum is as follows,

The O and b variables need to be calculated using some sort of algorithm, the iterative approach would be the Gradient Descent algorithm and the probabilistic method is Maximum likelihood.

In Gradient Descent the idea is that the O and b variables start at random values and are slowly tweaked to minimize the error. In the graph, the bottom of the blue line is closest to 0 which is the closest accuracy that can exist from the data.

Maximum likelihood is a method where the parameter values are found such that they maximize the likelihood that the parameters deliver correct outputs from the data. The black columns represent datasets and the colored line represents different parameters the height it peaks is the O and the part where it peaks is the b.

I didn’t explain the detailed mathematics because this is not a mathematical tutorial, it is an overall guide and if you to learn more in-depth about these concepts I recommend reading more professional academic papers, and those are more advanced for the target demographic of the article.

Now that you have your parameters you can make predictions on your data. You simply input the weight and then a probability will be calculated, based on whether the property is above or below 50% a classification would be made.

How to Code a Logistic Regression

Now that you know how an LR works, we can actually begin to build one in Python.

The problem we are going to tackle is hand-written digit classification. The dataset we are going to use is called the MNIST dataset which has several thousand classified images of digits with labels (0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image.

We are going to use the scikit-learn library, which is a tool for machine learning classification, with a 4-step modeling pattern that makes it easy to code a machine learning classifier. Let's begin!

If you don’t have scikit-learn installed on your computer you can install it with the line,

pip install -U scikit-learn

or if that doesn’t work try using pip3, for newer software.

Once you have done that you can create a new python file and import the library into your program with the following line,

from sklearn.datasets import load_digits

Now you can convert the dataset into a variable by simply calling it.

digits = load_digits()

The cool thing with sckit learn is that there are a bunch of functions you could use on your data for example you can see how many values of data there are with the following function,

print(digits.data.shape)

This would show that there are 1797 in the sckit-learns subset of MNIST. To visualize some of the data you are working with can plot using Matplotlib which is a graphing tool and NumPy which converts your data into something that can be graphed, so install these libraries with pip if you haven’t already and import them into your program as follows,

import numpy as np 
import matplotlib.pyplot as plt

Then we need to create an empty figure and put our images into it afterward, this line does that,

plt.figure(figsize=(20,4))

We are going to plot the first 5 images in our dataset by using a for loop with the following line,

for index, (image) in enumerate(zip(digits.data[0:5])):

Then the next two lines create a subplot and show a grayscaled image of the digits,

    plt.subplot(1, 5, index + 1)
plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)

That whole plotting step was not essential, but I find it useful and its common practice.

The next steps are essential though, we have to make training and testing datasets, if you have no idea what I’m talking read this article because it is a more general guide to machine learning projects. In short training, datasets are a subset of data used for training and improving your model, and testing subsets are used for testing and finding the accuracy of your model. If you test on the training data then you will always get 100% accuracy because they would be the exact same.

Anyway we split the data 75/25 for both the images and labels(in sckit-learn a label is called a target value)

x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)

Now it's time to put the training into an LR model by importing it and setting it to a variable(also called making an instance),

from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()

Then the training data will need to be fitted, this is the part where the sigmoid is created.

logisticRegr.fit(x_train, y_train)

Now to test the data we can make predictions based on it, with the testing data.

predictions = logisticRegr.predict(x_test)

There are many ways to find the accuracy in more advanced ways but for this example, we will just divide the correct points by the total points and display it to the user,

print(logisticRegr.score(x_test, y_test))

Conclusion

Great now you finished building a sckit-learn model with Logistic Regression! You now know the way that Logistic Regression works as well as how to build a simple one in python. Obviously, this isn’t the best model, but it’s something you’ll come across while working on models. Usually, your initial try isn’t the greatest. There will always be things to improve and work on, so good luck on your coding adventure!

Works Cited

“Build a Logistic Regression Model with Python.” Enlight, enlight.nyc/projects/build-a-logistic-regression-model.

Galarnyk, Michael. “Logistic Regression Using Python (Scikit-Learn).” Medium, Towards Data Science, 4 Feb. 2021, towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a.

“Installing Scikit-Learn¶.” Learn, scikit-learn.org/stable/install.html.

Li, Susan. “Building A Logistic Regression in Python, Step by Step.” Medium, Towards Data Science, 27 Feb. 2019, towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8.

Rana, Ryan. “The Simplest Guide to Linear Regression.” Medium, Medium, 26 July 2021, theryanrana.medium.com/the-simplest-guide-to-linear-regression-e8b93e1f76de.

“Reducing Loss: An Iterative Approach | Machine Learning Crash Course.” Google, Google, developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach.

The Ultimate Guide to Logistic Regression for Machine Learning, www.keboola.com/blog/logistic-regression-machine-learning.

z_ai. “Probability Learning III: Maximum Likelihood.” Medium, Towards Data Science, 14 Nov. 2020, towardsdatascience.com/probability-learning-iii-maximum-likelihood-e78d5ebea80c.

--

--