A simplified explanation of Computer Vision.

7 min readJul 25, 2021

Computer Vision is by far the most exciting aspect of Artificial Intelligence. The fact that a computer can actually take in an image and figure out what is on it is quite phenomenal. Self-driving cars, object tracking, video surveillance, and image captioning are just a few of the extraordinary applications of this technology. In this paper, we are going to break down the different levels of computer vision and how each one actually works.

Different levels to Computer Vision.

There are 3 different levels to Computer Vision. They're all built off of each other with the next being more advanced than the previous.

Image Classification:

Image Classification is the technology where when given an image of a singular object it is able to classify it.

Object Detection:

Object Detection is the technology where when given an image of multiple objects it is able to put a bounding box over all the different individual “things” for lack of a better word. Then each object is actually classified.

Image Segmentation:

Just like Recognition it locates the images and classifies them, but instead of using a box it actually partions out the image to fit its shape.

With that said we are going to tackle classification and how to identify different objects.

What does your camera see?

In order to understand the concept of Computer Vision, you need to understand how a computer actually views an image. Unlike computers, humans have developed over millions of years to be able to view an image without thinking much about it.

Computers on the other hand see an array of values that represent colored pixels on a screen.

Classification

For this example let’s use a picture of Abraham Lincoln as a dataset image (not an input, but rather a picture in the dataset also called a training value). We can use this image as a face and after this process, we can compare it to other images to decide whether those are faces or not. That would be the classification.

First off we have to convert the images in a dataset to an array of values because that’s the only way a computer can “see” an image. So, first, take your image and convert it to grayscale, in python this could be done with a simple function.

This is a very large, high-quality image with a lot of pixels. To avoid your program from crashing or taking too long processing the image, the image has to be properly condensed, so it will look something like this.

Now each of these pixels can be represented as a value based on their color. Every color can be associated with numbers. This conversion can be done however you want, but it's usually done with RGB values (Python has a built-in function for RGB conversions), you could read more about RGB values here.

Alright, now that the training image has been processed and now it is just a 2D array of numbers we can take an input image, and repeat the same process. For example, let's take my high school picture as an input,

Now that we have two arrays of values, we can compare the two and if the value is within said 10 values away and if they all are true then the image would be classified as a face.

Here is why the more training values the better. There are a lot of different types of faces, the only reason my face passed is that I and Mr.Lincoln have similar features. Black hair, tan skin tone, black eyes. If I put a darker-colored person with blond hair, it wouldn’t work. That’s why all sorts of different faces must be added to the database then a broader group of people's faces can be correctly classified.

Detection

I have already gone over how one object gets classified, but what about multiple objects? It’s actually pretty simple, if we can classify one object then a larger, more complex image could just be broken down into smaller images and then each of these could be correctly identified.

This is assuming there is a trained dataset for each of the images in the picture because that is the only way to classify it.

Let's use this image as a test case (If you know anything about Computer Vision before reading this you would know it is perhaps the most famous image in the field, I have no idea why).

This image has clearly had 3 things in it, a dog, a bike, and a car. The computer doesn’t know that, so it has to create smaller images with just one object in it. The computer then makes x amount of bounding boxes.

Within all these smaller boxes it runs a classification function (the same one before this section), so it converts the contents of each bounding box into an array of values and compare it with different pre-classified objects and if they are close enough then each would be classified and a bounding box will remain there while the rest that doesn’t get classified get removed, the end result looks like this,

And this is proper object detection with both localization and classification.

Segmentation

Segmentation is the most difficult of the Computer Vision levels. “Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images”(Wikipedia).

The steps required to perform segmentation is built off object detection. Given that bounding boxes are made for each object, then the process can begin. There are multiple ways to approach this process and here are the best ones.

Region-based Segmentation

Regional-based segmentation has two sub-methods within it, the global threshold method and the local threshold method.

Global: “…divides the image into two regions of the target and the background by a single threshold.” (Yuheng, Sao 1)
Local: “…needs to select multiple segmentation thresholds and divides the image into multiple target regions and backgrounds by multiple thresholds.”(Yuheng, Sao 1)

A threshold in this case is single-pixel borders around every “region” of similar pixels. Depending on how similar the pixels values are, the more or fewer regions there will be.

Edge-based Segmentation

When a section of a line terminates due to shadows or other factors, the discontinuity is frequently recognized using derivative operations, and derivatives are calculated using differential operators. But often these operations fail, which makes it easier to use regional-based segmentation (which happens to be the most common one).

Cluster-based Segmentation

By now it’s clear there is no fool-proof method for image segmentation. Perhaps the most difficult but most fool-proof way is the clustering method. To partition the pixels in the image space with the appropriate object space points The object space is segmented based on their aggregate in the object space, and then the segmentation result is colored.

Conclusion

This article summarizes the three types of computer vision, classification, recognition, and segmentation. Each of these playing an important role in the growing field of not only images and photography but technology as a whole. The next step from here is to learn how to code and program each of the respective types in whatever programming language you choose. Keep in mind the Python is one of the fastest-growing and most compatible with different technologies. You can apply your dogmas to different products in all sorts of different industries like medicine or transportation. Some even try making their own ways to go about computer vision by building better models to get even more accurate results.

With that said, best of luck in all your coding endeavors!

Works Cited

Brownlee, Jason. “A Gentle Introduction to Object Recognition With Deep Learning.” Machine Learning Mastery, 26 Jan. 2021, machinelearningmastery.com/object-recognition-with-deep-learning/.

“Derivative and Operations.” Math Tutor — Derivatives — Theory — Derivative, math.fel.cvut.cz/mt/txtc/1/txe3ca1d.htm.

Khandelwal, Renu. “Implementing YOLO on a Custom Dataset.” Medium, Towards Data Science, 23 Nov. 2019, towardsdatascience.com/implementing-yolo-on-a-custom-dataset-20101473ce53.

Parmar, Ravindra. “Detection and Segmentation through ConvNets.” Medium, Towards Data Science, 2 Sept. 2018, towardsdatascience.com/detection-and-segmentation-through-convnets-47aa42de27ea.

Team, Towards AI. “Imbalanced Classification.” Towards AI — The Best of Tech, Science, and Engineering, 18 Jan. 2021, towardsai.net/p/data-science/imbalanced-classification.

Yuheng, Song, and Yan Hao. “Image Segmentation Algorithms Overview.” ArXiv.org, 7 July 2017, arxiv.org/abs/1707.02051.