The Simplest Guide to Linear Regression

Ryan Rana
5 min readJul 26, 2021
Unsplash: Sven Mieke

Linear algebra is extremely important in the topic of computer science and especially machine learning. Linear regression is a simple supervised machine learning model directly built off linear algebra. The goal of this article is not to give you a math lesson but instead walk you through the algebra that makes up a regression model and then how to build one in python. Let’s begin!

What is Linear Regression?

Linear Regression is a function that finds the best relationship between the input variable (x) and the output variable (y). What this means is that all the x and y data are plotted on a coordinate plane, and the function is finding the best fit line that represents all the points as close as it can. The linear relationship between these two variables can be represented by a straight line called a regression line.

For example, look at this graph, where each piece of data has an x and y,

The only way that a prediction can be made on this data is if there is a relationship between x and y so that when given an input of x there is a corresponding y, that can only be done with a best-fit line.

So now when given a value for x there will be a spot for y on the line. This article explains how to get this line.

How does Linear Regression Work?

The first step is to import all the necessary software, and for this case, we just need NumPy, Pandas, and Matplotlib.

  • NumPy — A mathematical tool for data analysis.
  • Pandas — A tool for handling data.
  • MatPlotLib — A tool for graphing data.

You use pip or pip3 to install these in the terminal, then in your program, you import it as such,

import numpy as np   
import matplotlib.pyplot as plt

Now we have to import the data, for this article we will use regression to predict housing prices in Boston based on the number of rooms. The data lives in a CSV file right here. Download boston.csv and move it into the same folder as your program. You can now utilize the data in your program with a single line,

boston_df = pd.read_csv("boston.csv")

You can scatter your data on a graph using Matplotlib to plot the room count (RM) and final pricing (MEDV).

plt.scatter(boston_df.RM,boston_df.MEDV)

This delivers the following result,

It will be easier to set variables these two values to X and Y variables. After you do that we can start breaking down the actual math part. In pre-algebra, you may have learned the slope-intercept form, if you forgot it looks like this.

y = mx+b

M is equal to the slope of the line and B is the y-intercept. If you are not familiar with those terms this is a basic introduction to slope-intercept form. In machine learning, our y value is the predicted label, b is the bias, the slope m is the weight, and the x value is an input (also known as the feature). To define the weight we use the following formula,

m = (( x̅ * y̅) — x̅y̅) / ( (x̅)²-(x̅²) )

In python, this could be represented as,

def best_fit_line(x,y):     
m=(((x.mean()*y.mean())-(x*y).mean())/((x.mean())** 2 -(x**2).mean()))
b =y.mean()-m * x.mean()
return m, b

M should be equal to [-34.67062078] and B should be [[9.10210898]]

print(f"y = {round(m,2)}x + {round(b,2)}"))

Now you have the equation of the best fit line, so you could actually make a prediction by plugging in an x. For example, say that you have x as 2, you can plug it into the question and get, [[-16.46640281]].

You could also plot your line graph just like the original plotting expect you have a line going through it (The line is just a series of values with corresponding x’s and y’s saved to the variable regression_line).

regression_line = [(m*x_)+b for x_ in x]
plt.scatter(x, y)
plt.scatter(x_prediction, y_prediction)
plt.plot(x, regression_line)
plt.show()

This is a really mathematical way to code this, it is important you understand the math but in your future project, you could just use a built-in function as such,

from sklearn import linear_model
linreg = linear_model.LinearRegression()
linreg.fit(x,y)
print(linreg.intercept_)
print(linreg.coef_

Although we have a completed linear regression model, it’s not possible for it to be 100% accurate as we have a lot of variation in our data. There are way to fix this though. The r^2 value, or coefficient of determination, allows us to numerically represent how good our linear model actually is. The closer the value is to one, the better the fit of the regression line. If the r^2 value is one, then the model is perfect. The formula for r^2 is as such,

r² = 1 — ((Squared Error of Regression Line) / (Squared Error of y Mean Line))

The Square Error is the sum of all the values of the line. In code, it would look like this,

def squared_error(ys_orig, ys_line):     
return sum((ys_line - ys_orig) * (ys_line - ys_orig))
def r_squared_value(ys_orig,ys_line):
squared_error_regr = squared_error(ys_orig, ys_line) y_mean_line = [mean(ys_orig) for y in ys_orig]
squared_error_y_mean = squared_error(ys_orig, y_mean_line)
return 1 - (squared_error_regr/squared_error_y_mean)
r_squared = r_squared_value(y_values, regression_line) print(f"r^2 value: {round(r_squared,2)}")

or if you use functions it would look like this,

y_pred = reg.predict(X_test)  
R2 = r2_score(y_test, y_pred)

This would print out the error between 0 and 1.

Conclusion

To conclude, Linear Regression is a simple yet effective tool to predict output values given input values. You know how Linear Algebra works, how to import libraries, how to import data with pandas, how to plot graphs with Matplotlib, create regression lines, and calculate an error.

With that said you can use your knowledge to build your own machine learning application and learn other models.

Good luck with all your programming endeavors!

--

--