The Simplest Guide to K-Means Clustering

Ryan Rana
5 min readJul 25, 2021

K-Means Clustering is one of the best segmentation models in Machine Learning. It may be the best-unsupervised method there is. This article is a guide through not only what K-Means Clustering is but also how to build one with python.

What is a K-Means Clustering?

K-means clustering is an unsupervised machine learning algorithm, where its job is to find clusters within similar data. The model will then figure out where an input falls under based on the cluster it belongs to.

This visual is a good demonstration of how clusters are made. Data points are plotted on a graph and then based on the closeness of the surrounding data it joins into a cluster.

How does K-Means work?

To cluster all the data properly here is the simple 5 steps needed

Step 1: The input data will become plotted onto a graph.

Step 2: A range of K points is going to be randomly plotted on a plane (In this image there will be 3).

Step 3: The algorithm searches the surrounding data (often called neighbors) and adds that to the cluster.

Step 4: The random K points are moved toward the center of all its cluster by the calculated average distance.

These 4 simple steps are actually super simple and easy to understand but it raises 2 issues that remain unsolved.

  • What if the cluster groupings are incorrect?
  • How do we know what is the best number for K?

These are super big issues when it comes to segmenting out our clusters properly, but luckily there are algorithms to help with this.

What if the cluster groupings are incorrect?

The cluster groups could be incorrect if the random points are put in undesirable positions. To prevent this from happening we have to repeat the 5 step process many times, hopefully ensuring that one of those groupings is accurate. Then a function is used to decide which of the groups is best.

This function is known as the euclidean function and it is represented as such,

This is pretty complicated looking but this is the general presence, It looks at all the different groups of clustering, then it calculates the distance from centroids to data points, these distances get squared and summed, the one with the lowest value gets chosen.

How do we know what is the best number for K?

The higher K is, the less the overall function value would be, and if K is equal to the amount of data then the function would be 0. a number of clusters are equal to rows of data, then the function value would be 0.

To decide how many clusters k to include we plot our function values also known as inertia, against a number of clusters K. We may get a result that looks like this for some datasets:

For you to decide what the best number for K is we have to create what is called an elbow graph. A graph with two colors, one is the K value and the other is the corresponding function value.

To find the optimal number of clusters, we select the value of K at the elbow of the graph. In this example, the optimal number is 3.

How to code a K-Means Model.

Coding a K-Means Model is far easier than understanding one. For this example, we will use random data and return a visual model with the proper clusters.

The first step is to import all the necessary software, and there are a few tools needed,

  • ScikitLearn — A tool for data prediction
  • NumPy — A mathematical tool for data analysis
  • MatPlotLib — A tool for graphing data.

You use pip or pip3 to install these in the terminal, then in your program, you import it as such,

from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

Now we can create the data as an np.array and I’ll just use random values.

x1 = np.random.rand(1,8)
x2 = np.random.rand(1,8)

These two arrays are merged together via a new variable with simple NumPy files

X =np.array(list(zip(x1, x2))).reshape(len(x1), 2

The np.arrays are now plotted visually using a basic MatPlotLib.

plt.plot() 
plt.scatter(x1, x2)
plt.show()

This is the function used to cluster models when K is equal to values 1–9.

distortions =[]
inertias =[]
mapping1 ={}
mapping2 ={}
K =range(1, 11)
for i in K:
kmeanModel =KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X,kmeanModel.cluster_centers_,'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
mapping1[k] = sum(np.min(cdist(X,kmeanModel.cluster_centers_,'euclidean'), axis=1)) / X.shape[0]
mapping2[k] = kmeanModel.inertia_

For each of 10 tables made with K at different values, it is now time to decide what value of K is best using the elbow method, which you would need to see for yourself.

plt.plot(K, inertias, 'bx-') plt.show()

A simple graph would do the job, and based on the elbow value for K, the cluster model will only use the cluster with K.

Conclusion.

In conclusion, you know now how to build a K-Means model, there are so many applications of this amazing technology. Image Segmentation, Crime Detection, Market Research, and more. You can access the source code right here.

--

--