Residual Blocks in Deep Learning. Residual block, first introduced in the… | by Harsh Yadav | Jul, 2022

By Jessie Hobb On Jul 11, 2022

Residual block, first introduced in the ResNet paper solves the neural network degradation problem

Figure 0: Real Life Analogy of Degradation in Deep Neural Networks as they go deeper (Image by Author)

Deep neural networks are the powerhouse of major machine learning algorithms. These neural networks are a collection of stacked layers (each layer has some neurons) which combined perform the given task. So as we stack more layers together, i.e. going deeper or increasing the depth of the model, we look forward to increasing the performance.

But, these deep neural networks are troubled by degradation issues. So what exactly is degradation in neural networks? How to solve this issue? These are some of the questions that Residual Block decipher.

In this blog, we are going to learn the following set of fundamental concepts (in sequential order) that empowers the innovation capability in the Data Science domain.

Introduction to deep neural networks
Degradation problem in deep NN
Residual Block (approach to solving the problem)
Identity Function
Residual block in different types of NN
Conclusion

Let’s start …

But, before starting if you wish to have a deeper understanding of convolution (Residual block was developed keeping Convolutional Neural Networks in mind), please read the following blog: https://towardsdatascience.com/computer-vision-convolution-basics-2d0ae3b79346

Figure 1: Deep Neural Network for Binary Classification (Image by Author)

The above figure represents a deep neural network made up of stacked CNN, for binary classification (yes or no) on the input image.

So let’s consider the input as ‘x’, and output as “H(x)”. The CNN operation is linear in nature and an activation function (let’s say relu) is used to learn the abstract features (non-linear). The entire purpose of this network is to find the most optimal mapping (fit), i.e. function

F(x) -> H(x)

Now, if we want to improve the model performance, does stacking more layers helpful? Ideally, it should improve the performance because more number of neurons are available to learn the abstract features. But this doesn’t work every time. This network faces multiple problems, the first one being the problem of vanishing/exploding gradient. This problem can be effectively solved by normalised initialisation and intermediate normalisation layers, clipping of weights and many more. It enables the model to start converging for stochastic gradient descent (or any other optimizer) with backpropagation.

Once the deeper models start converging (after accounting vanishing/exploading gradient) we possibly have another problem. It is degradation.

So what exactly is degradation?

As we increase the depth of the network, the accuracy gets saturated. Maybe the layers have sufficiently learned all the intricacies of our data. But, we are not satisfied with the accuracy and we stack up a few more layers. Now, the model starts degrading. This is unexpected, but this degradation is not because of overfitting (the training error increases). The additional layers lead to a higher training error.

Figure 2: Training error (left) and test error (right). Red is a deeper network and has higher training and test error. (Image by Kaiming)

Degradation, indirectly hints that not all systems are similar or easy to optimise. If a shallow network is learning the features and let’s say getting 90% accuracy, then the deep variant of the same model may even get lesser accuracy than 90%. This is because optimisation becomes very difficult as the depth of the network increases.

So, how to tackle this problem?

Let’s consider two networks, network “A”: a shallow network and network “B”: a deeper counterpart of network A. Now, if not better, at least we want to achieve the same performance as network B. So, given the network A, we would add the identity mapping as extra layers to construct B. This setup should ensure that the deeper model, B, doesn’t produce higher training error than it shallower counterpart. But experiments show that this setting was unable to find solutions which are comparably better or of the same performance.

This proves that it is very difficult for the model to learn identity mapping during optimisation.

To solve the problem of degradation, we have a setup called skip-connections. Alternatively called shortcut connections, it skips some of the layers in the architecture and feeds the output of the previous layer to the current position, thus helping in optimization.

Figure 3: Residual block with skip connection (Image by Kaiming)

In the above figure, the skip-connection is skipping two layers and directly feeding the input, ‘x’, as the output. This is called a shortcut/skip connection because it doesn’t involve any additional parameters as we are just passing the previous information to the layer.

A framework, called deep residual learning, is used to address the problem of degradation.

In the earlier section, we learned about the network learning the correct mapping, i.e. F(x) -> H(x), where x is the input, H(x) is the underlying mapping to be fit (expected output) and we try to fit the network and get F(x) that resembles H(x).

Now, if we add a skip connection (which works as an identity function) to the same setup, we have the following equations:

Figure 4: Updated equation of the network with a residual block (Image by Author)

From the above figure, we have a new F(x), i.e. a residual (difference between expected output and input). The stacked layers try to learn the mapping for the residual and ta-da! this is why it is called as a residual block.

The original mapping F(x) -> H(x) is now H(x)-x (residual). The stacked non-linear layers try to fit the residual.

Figure 5: Forward Propagation with Residual Block (Image by Author)

Figure 5 proves that the skip-connection simply performs the identity mapping. Their output is added to the output of stacked layers and for some reason, if F(x) tends to zero, our model would still have the non-zero weights because of the identity mapping. This removes the degradation.

Consider a neural network, where some of the stacked layers are adding values whereas some are simply zero. But because of the residual block, we are able to preserve the weights and continuously optimise and get better accuracy. To this extreme, it is easier to push the residual, F(x) to zero than to fit an identity mapping by a stack of non-linear layers. Skip connections are a blessing to remove the degradation of deeper neural networks.

The entire idea of residual block is derived from the hypothesis that if multiple nonlinear layers can approximate complicated functions, then it is equivalent to hypothesize that they can also approximate the residual functions, i.e. H(x)-x.

Now, what if the dimensions of x and F(x) are different?

The shortcut connection (identity mapping) introduces neither extra parameters nor computation complexity.

y = F(x, {Wi})+ x, where x is the identity mapping

But, if the dimension of identity mapping and the stacked layers is different, then we simply can’t add them. To solve this issue, we perform a linear projection (using CNN) or extra zero entries (called padding) in the identity function to match the dimensions. We have the following equation:

y = F(x, {Wi}) + W*x, where W is the linear projection

Figure 6: Residual Block with skip connection (left) and residual block with linearly transformed skip connection (right) to match the dimension (Image by Author)

Generally, the residual function has two or three stacked layers, whereas more layers are possible but it can reduce the effect of the residual block. But, if F(x) has only a single layer, it is very much similar to a linear layer, such that:

y = W1*x + x

This setup doesn’t help with the problem of degradation. So we can assume that normally a residual block has 2–3 stacked layers with a skip connection. We can use stacked residual blocks to create a much deeper neural network.

Finally!! We have learned the residual block that enables the deeper neural network to optimize and continue to learn.

So, we just explored the residual block on a simple neural network. But how do we apply the same block to solve the computer vision tasks, to be precise Convolutional Neural networks or to solve the tasks where sequential networks are used.

Residual Block can be used without any modification with Convolutional Neural Network. In CNN, the output of the stacked layers changes but the approach is exactly the same.

For the sequential networks, we have a network termed highway network. Highway networks are shortcut connections with additional gates. These gates are data-dependent and have some parameters, whereas the shortcut connection in a residual network doesn’t have any parameters. These gates determine whether to pass the information or not. It acts as a modulated skip connection to regulate the flow of information. These networks are inspired by LSTM-type networks, where we have multiple gates to forget, update and output the information.

Figure 7: LSTNet Architecture using Recurrent skip layer, similar functioning to residual block with skip connections (Image by Guokun)

Alternatively, for the sequential networks, let’s take the example of time-series data. Just like the residual block, we can create a skip-RNN connection to skip a few timesteps (by skipping some timesteps in an ordered fashion) in the input. LSTNet is a state-of-the-art model that implements skip-RNN connection for the time-series data. One can have a look to develop a deeper understanding of skip-RNN.

Residual Block is the foundational cell of ResNet, the SOTA model for extracting features from an image. It is continued to be used to tackle the degradation in deep neural networks. In today’s world, more than 90% of the architectures use skip connection-based networks to develop a feature embedding. From ResNet to Transformer to BERT, the importance of residual blocks has been proved to be of immense importance, and this would continue to be a part of many innovations coming forward.

After reading the entire blog you should be able to answer the following questions:

What is degradation in neural networks?
How degradation is different from overfitting?
How to solve the degradation issue in deep networks?
What is skip connection?
What is a residual block?
Why residual block is called a residual block?
What is a highway network?
How are skip-connection and highway networks related?
Can we use residual blocks in sequential networks?
What is identity (function) in the residual block?
Is the dimension of identity and output of stacked layers always the same? If not, how to deal with this scenario?

These are a few questions this blog answers.