Weight Decay is Useless Without Residual Connections | by Guy Dar | Feb, 2023

By Jessie Hobb On Feb 21, 2023

How do residual connections secretly fight overfitting?

Photo by ThisisEngineering RAEng on Unsplash

The idea in broad strokes is fairly simple: we can render weight decay practically useless by making it arbitrarily small. Just a quick recap of what weight decay is: weight decay is a regularization technique that is used to prevent neural networks from converging to solutions that do not generalize to unseen data (overfitting). If we train the neural network to only minimize the loss on the training data we might find a solution specifically tailored to this particular data and its idiosyncrasies. To avoid that, we add a term that corresponds to the norm of the weight matrices of the network. This is supposed to encourage the optimization process to converge to solutions that might not be optimal for the training data, but have smaller weight matrices in terms of norm. The thinking is that models that have high norm weights are less natural and might be trying to fit specific data points in order to lower the loss a bit more. In a way, this is a way to integrate Occam’s razor (the philosophical idea that simpler solutions are probably the right ones) into the loss — where simplicity is captured by the norm of the weights. We will not discuss deeper justifications for weight decay here.

TL;DR: in this article, we show that in ReLU feedforward networks with LayerNorm that don’t have residual connections, the optimal loss value is not changed by weight decay regularization.

Positive Scaling of ReLU Networks

Both linear and ReLU networks share the following scaling feature: Let a > 0 be a positive scalar. Then ReLU(ax) = a ReLU(x). Eventually, in any network that is composed of a stack of matrix multiplication followed by a ReLU activation, this property will still hold. This is the most vanilla kind of neural network — no normalization layers and no residual connections. Yet, it’s rather surprising that such feedforward (FF) networks that have been ubiquitous not so long ago demonstrate such a structured behavior: multiply your input by a positive scalar and, lo and behold, the output is scaled exactly by the same factor. This is what we call (positive) scale equivariance (meaning that scaling in the input translates to scaling in the output, unlike invariance where the output is not affected at all by scaling in the input). But there’s more: if we do this with any of the weight matrices along the way (and the corresponding bias terms) — the same effect takes place: the output will be multiplied by the same factor. Nice? For sure. But can we use it? Let’s see.

Let’s see what happens when we add LayerNorm. First, what is LayerNorm? Quick recap:

where µ is the average of the entries of x, ◦ stands for elementwise multiplication, and β, γ are vectors.

So, what happens to the scaling property when we add LayerNorm? Of course, nothing changes before the point we added the LayerNorm, so if we scaled a weight matrix by a>0 before this point, the input to the LayerNorm is scaled by a, and then what happens is:

So, we get a new property, this time scaling just leaves the output unchanged — positive scale invariance. And it’s about to regret that…

Note: while we discuss LayerNorm, other forms of normalization, such as BatchNorm, satisfy the positive scale invariance property and so they are as susceptible as LayerNorm to the discussed problems.

How to Disappear Completely

Let’s remind ourselves what we’re trying to minimize:

where the training set is represented as a set of pairs of {(x_i, y_i)}, and the parameters (weights) of the neural network f are designated by Θ. The expression is made of two parts: the empirical loss to minimize — the loss of the neural network on the training set, and the regularization term designed to make the model reach “simpler” solutions. In this case, simplicity is quantified as the weights of the network having low norms.

But here’s the problem, we found a way to bypass restrictions on the weight scale. We can scale every weight matrix by an arbitrarily small factor and still get the same output. Said otherwise — the function f that both networks, the original one and the scaled one, implement is exactly the same! The internals might differ, but the output is the same. It holds for every network with this architecture, regardless of the actual values of the parameters.

Recall that generalization to unseen data is our goal. If the regularization term goes to zero, the network is free to overfit the training data, and the regularization term becomes useless. As we have seen, for every network with such architecture, we can design an equivalent network (i.e. computing exactly the same function) with arbitrarily small weight matrix norms, meaning the regularization term can go to zero without affecting the empirical loss term. In other words, we can remove the weight decay term and it’s not going to matter.

A word of caution is due: while theoretically, the model should find a solution that overfits the training data, it has been observed that optimization might converge to generalizing solutions even without explicit regularization. This has to do with the optimization algorithm. We use local optimization algorithms such as gradient descent, SGD, Adam, AdaGrad, etc. They are not guaranteed to converge to the most globally optimal solution. This sometimes happens to be a blessing. An interesting line of work (e.g., [Neyshabur, 2017]) suggests that these algorithms are a form of implicit regularization, even when explicit regularization is missing! It’s not bulletproof, but sometimes the model converges to a generalizing solution — even without regularization terms!

Let me remind you what residual connections are. A residual connection adds the input of the layer to the output. If the original function that the layer is computing is f(x) = ReLU(Wx) then the new function is x + f(x).

Now, the scaling property on weights breaks for this new layer. This is because there is no coefficient learned in front of the residual part of the expression. So the f(x) part is being scaled by a constant because of the weight scaling, but the x part remains unchanged. Now, when we apply LayerNorm to this, the scaling factor can no longer cancel out: LayerNorm(x + a f(x)) ≠ LayerNorm(x + f(x)). Importantly, it is the case only when the residual connection is applied before LayerNorm. If we apply LayerNorm and only then the residual connection, it turns out that we still get the scaling invariance of LayerNorm: x + LayerNorm(a f(x)) = x + LayerNorm(f(x)).

The first variant is often referred to as the pre-norm variant (more precisely, it is actually x + f(LayerNorm(x)) that is called this way, but we can attribute the LayerNorm to the previous layer, and take the next layer’s LayerNorm yielding the above expression, apart for the edge cases of the first and last layers). The second variant is called the post-norm variant. These terms are often used in transformer architectures, which are out of the scope of this article. However, it might be interesting to mention that a few works such as [Xioang et al, 2020] found that pre-norm is easier to optimize (they discuss different reasons for the problem). Note however this may not be related to the scaling invariance discussed here. Transformer pre-training datasets often contain huge amounts of data, and overfitting becomes less of a problem. Also, we haven’t discussed transformer architectures per se. It is nonetheless still something to think about.

In this article, we saw some interesting properties of feedforward neural networks without pre-norm residual connections. Specifically, we saw that if they don’t contain LayerNorm, they propagate input scaling and weight scaling to the output. If they do contain LayerNorm, they are scale-invariant, and weight/input scaling does not affect the output at all. We used this property to show that (arbitrarily close to) the optimal solutions to such networks can avoid any weight norm penalty, and so the network can converge to the same solution it would have converged to without them. While this is a statement about optimality, there is still the question of whether these solutions are actually found using gradient descent. We might tackle this in a future post. We also discussed how (pre-norm) residual connections break the scale invariance and thus seem to resolve the above theoretical problem. It is still possible that there will be similar properties that residual connections could not avoid that I failed to consider. As always, I want to thank you for reading and I’ll see you in the next post!

F. Liu, X. Ren, Z. Zhang, X. Sun, and Y. Zou. Rethinking residual connection with layer normalization, 2020.

B. Neyshabur. Implicit regularization in deep learning, 2017. URL https://arxiv.org/abs/1709.01953.

R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T.-Y. Liu. On layer normalization in the transformer architecture, 2020. URL https://arxiv.org/abs/2002.04745.