Convolutional Neural Networks: From An Everyday Understanding to a More Technical Deep Dive | by Benjamin McCloskey | Jun, 2022

By Jessie Hobb On Jun 22, 2022

Wait, computers know how to see?!?

I recently had to conduct research using Convolutional Neural Networks (CNN) and was constantly trying to understand what they were on my own. While studying the different features was not difficult, the hardest part was trying to explain CNNs to my friends and family in a way that made sense and was not too technical. Today I wanted to provide the technical definition of CNNs coupled with the nontechnical definition to help you gain a greater grasp of what a CNN is, as well as help you explain them to the people you know!

Image from Author

How do computers see? Convolutional Neural Networks (CNN)! These machine learning algorithms are currently at the highest echelon for computer vision and interpreting images and videos. This post will provide a technical definition coupled with an “I know absolutely nothing about machine learning” definition to provide a deeper understanding of a CNN and a way for you to explain them to your friends and colleagues in day-to-day conversations. For this example, I am going to use a picture of my dog, Biscotti, as a reference!

Now, be aware since we see in the Red, green, and blue (RGB) spectrum, each of those colors will technically have its own layer. For the purpose of today’s discussion, I am putting aside that factor to reduce overcomplicating the explanations.

“I know nothing about machine learning” Explanation

A convolutional layer of a neural network acts as the eye of a computer and scans an image just like the human eye! Over time, the layer helps the network associate different features to a certain class of samples to make classifications. For example, if I show CNN various pictures of dogs and humans, it may learn to associate anything with 4 legs to dogs and two legs to humans! Essentially, the CNN will get the information from each part of an image. It will create a small section for each part of the image. For example, a kernel might pick up certain parts of my dog Biscotti through the first layer of the network.

As the kernel continues to move along the image, it will pick up other attributes of our image. For example, it could find the nose of our dog in one of the nodes.

Using different patterns, the layers of the CNN will learn how to pick up features and certain aspects of those features (the outline of the nose and the different colors which make up the nose). You might be wondering, does the computer see colors? Well, sort of. An image has to be transformed into an array of values where each value represents a certain color. One way to explain this is to imagine the computer turning an image into a grid of numbers, where numbers that are closer together represent similar colors.

Images become Pixel Values (Image from Author)

In the convolutional layer of a CNN, the layers target densely correlated points which ultimately are the important features within an image (1). The number of neurons placed within a layer relies on the complexity of the dataset, computing power, and memory availability. Convolutional layers are named after the convolutional matrix operation they perform on a given tensor from an image. The convolutional operation takes a tensor

of the input image I and conducts element-wise multiplication between the image segment and

where kI is the kth convolutional kernel indexed by the Ith layer. The area
of the image where data is being extracted is given by the coordinates x and y. c is the index of the channel (1 for greyscale, 3 for RGB).

Convolutional Equation (Image from Author)

P is the total number of rows in the feature matrix and Q is the total number of columns in the feature matrix.

“I know nothing about machine learning” Explanation

Essentially, the kernel is the part of the CNN that gets the information from each part of an image. Each kerne; will have a different pattern so the computer can find different features of the sample. For example, some filters will find wavy patterns in the image while other filters may find straight, horizontal patterns. It all depends! One filter might find the edge of the dog, while the other is able to pick up on the patterns of fur which make up the dog’s coat.

Technical Explanation

The filters in a convolutional layer (also known as “kernels”) break the input sampled into the layer down by placing different weights on each neuron,
producing various feature maps. Neurons will ignore all information except for the data gathered in their receptive field. For example, a neuron with a kernel representing a horizontal line through the middle of its nxn receptive field will multiply all inputs by zero except for the part of the input crossing the central horizontal line. The movement of the convolutional kernels across an image is dictated by the layer’s stride.

“I know nothing about machine learning” Explanation

The stride tells us how much our computer should move to pick up its next feature box of the image. For example, choose a stride too big and you might miss out on features of the image. If we select a giant stride we may force the CNN to skip over the face of the dog and leave the computer thinking dogs to have no faces and two combined ears. That would be terrible!

This strides way to big! (Image from Author)

Technical Explanation

The stride is the distance between each receptive field (how much a kernel moves from subsection to subsection). Generally, a stride of 2 is used since it also has the effect of downsampling the image as well.

“I know nothing about machine learning” Explanation

Imagine at this point that CNN has scanned a bunch of smaller images of my dog. Maybe it will have some important features, like the nose of my dog

And maybe there will be some less important features, like the background noise of the image

By pooling features of an image, the computer can start to discard less important features and focus on features of interest, combining the information gathered from those important features together. Over time, the computer wants to get rid of the background information of an image and just focus on the combination of important features depicting the entity (in this case a nose is important to a dog but a rug is not!), leading to a model classifying with some probability the type of object within an image.

Technical Explanation

After the convolutional layer, a pooling layer is implemented with the purpose of performing dimensionality reduction on the image, reducing the number of parameters and the model’s complexity (2). In the pooling layer, “outputs of several nearby feature detectors are combined into a local or global ‘bag of features, in a way that preserves task-related information while removing irrelevant details,” (3). The goal of the pooling layer is to obtain a sub-sample of the image at each step and unveil the invariant features (4) Two methods popular in image classification are average pooling and maximum pooling. Average pooling takes the average of all elements inside a kernel at each stride step.

Maximum pooling takes the maximum number inside a filter at each stride step. For object recognition, maximum pooling has been found to decrease error and provide improvements in image downsampling for feature extraction.

“I know nothing about machine learning” Explanation

The size of the image going in must be the same size as the image going out! Enough said!

Technical Explanation

Since the input size of a sample is reduced as it moves through the layers of a
CNN, padding ensures that the output sample size of the network is the same as the input sample size and shape. Zero padding adds zeros to any boundary of the input image which is discarded when passing down into the layers of the neural network.

“I know nothing about machine learning” Explanation

Imagine now the computer has found all of the important features and needs to figure out what it’s looking it. The fully connected layer is the “brain” of the CNN and will take the information from the features given to it and make a prediction of what all of those features resemble.

Technical Explanation

Once an image is passed through the convolutional layers of a CNN architecture, it must be flattened into a one-dimensional array to go through a series of fully connected layers. The fully connected layers of the CNN architecture mirror a typical artificial neural network (ANN). The information is passed through the layers and finally into an activation function. Backpropagation is then used with support from the network’s optimizer
for updating the fully connected layers’ weights to output more accurate results. Over time, continuously feeding the fully connected layer with batches of samples will help it learn to make the correct classification predictions of images parsed apart by the CNN.

And there you have it! Now you have a high-level understanding of how computers see! There is so much research being done around CNNs and I highly recommend you find out more and try implementing them yourself!

If you enjoyed today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore (This really helps me out more than you can imagine)! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

Sources

Asifullah Khan, Anabia Sohail, Umme Zahoora, and A. Qureshi. A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, pages 1–62, 2020.
Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR, abs/1511.08458, 2015
Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 111–118, Madison, WI, USA, 2010. Omnipress.
Aurelien Geron. Hands On Machine Learning With Scikit Learn & TensorFlow. O Reilly Media Inc., Sebastopol, 2017.

Wait, computers know how to see?!?

“I know nothing about machine learning” Explanation

As the kernel continues to move along the image, it will pick up other attributes of our image. For example, it could find the nose of our dog in one of the nodes.

of the input image I and conducts element-wise multiplication between the image segment and

P is the total number of rows in the feature matrix and Q is the total number of columns in the feature matrix.

“I know nothing about machine learning” Explanation

Technical Explanation

“I know nothing about machine learning” Explanation

Technical Explanation

“I know nothing about machine learning” Explanation

Imagine at this point that CNN has scanned a bunch of smaller images of my dog. Maybe it will have some important features, like the nose of my dog

And maybe there will be some less important features, like the background noise of the image

Technical Explanation

“I know nothing about machine learning” Explanation

The size of the image going in must be the same size as the image going out! Enough said!

Technical Explanation

“I know nothing about machine learning” Explanation

Technical Explanation

Sources

Asifullah Khan, Anabia Sohail, Umme Zahoora, and A. Qureshi. A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, pages 1–62, 2020.
Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. CoRR, abs/1511.08458, 2015
Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 111–118, Madison, WI, USA, 2010. Omnipress.
Aurelien Geron. Hands On Machine Learning With Scikit Learn & TensorFlow. O Reilly Media Inc., Sebastopol, 2017.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.