Kaiming He Initialization in Neural Networks — Math Proof | by Ester Hlav | Feb, 2023

By Jessie Hobb On Feb 15, 2023

Deriving optimal initial variance of weight matrices in neural network layers with ReLU activation function

Initialization techniques are one of the prerequisites for successfully training a deep learning architecture. Traditionally, weight initialization methods need to be compatible with the choice of an activation function as a mismatch can potentially affect training negatively.

ReLU is one of the most commonly used activation functions in deep learning. Its properties make it a very convenient choice for scaling to large neural networks. On one hand, it is inexpensive to calculate the derivative during backpropagation because it is a linear function with a step-function derivative. On the other hand, ReLU helps reduce feature correlation as it is a non-negative activation function, i.e. features can only contribute positively to subsequent layers. It is a prevalent choice in convolutional architectures where the input dimension is large and neural networks tend to be very deep.

In “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” ⁽¹ ⁾ by He et al. (2015), the authors present a methodology to optimally initialize neural network layers using a ReLU activation function. This technique allows the neural network to start in a regime with constant variance between inputs and outputs both in terms of forward and backward passes, which empirically showed meaningful improvement in training stability and speed. In the following sections, we provide a detailed and complete derivation behind the He initialization technique.

Notation

A layer in a neural network, composed of a weight matrix Wₖ and bias vector bₖ, undergoes two consecutive transformations. The first transformation is yₖ = xₖ Wₖ + bₖ, and the second is xₖ ₊ ₁ = f(yₖ)
xₖ is the actual layer and yₖ is the pre-activation layer
A layer has nₖ units, thus xₖ ∈ ℝⁿ⁽ ᵏ⁾, Wₖ ∈ ℝⁿ⁽ ᵏ⁾˙ⁿ⁽ ᵏ ⁺ ¹⁾, bₖ ∈ℝⁿ⁽ ᵏ ⁺ ¹⁾
xₖWₖ + bₖ has dimension ( 1 × nₖ ) × ( nₖ × nₖ ₊ ₁) + 1 × nₖ ₊ ₗ = 1 × nₖ ₊ ₁
The activation function f is applied element-wise and does not change the shape of a vector. As a result, xₖ ₊ ₁= f(xₖ Wₖ+ bₖ)∈ ℝⁿ⁽ ᵏ ⁺ ¹⁾
For a neural network of depth n, the input layer is represented by x₀ and the output layer by xₙ
The loss function of the network is represented by L
Δx = ∂L/∂x denotes gradients of the loss function with respect to vector x

Assumptions

Assumption 1:
We assume for this initialization setup a non-linear activation function ReLU defined as f(x) = ReLU(x) = max(0, x). As a function defined separately on two intervals, its derivative has a value of 1 on the strictly positive half of ℝ and 0 on the strictly negative half. Technically, the derivative of ReLU is not defined in 0 due to the limits of both sides not being equal, that is f’(0⁻⁻) = 0 ≠ 1 = f’(0⁺). In practice, for backpropagation’s purpose, ReLU’(0) is assumed to be 0.
Assumption 2:
It is assumed that all inputs, weights, and layers in the neural network are independent and identically distributed (iid) at initialization, as well as the gradients.
Assumption 3:
The inputs are assumed to be normalized with zero mean and the weights and biases are initialized from a symmetric distribution centered at zero, i.e. 𝔼[x₀] = 𝔼[Wₖ] = 𝔼[bₖ] = 0. This means that both xₖ and yₖ have an expectation of zero at initialization, and yₖ has a symmetric distribution at initialization due to f(0) = 0.

Motivation

The aim of this proof is to determine the distribution of the weight matrix by finding Var[W] given two constraints:

∀k, Var[yₖ] = Var[yₖ ₋ ₁], i.e. constant variance in the forward signal
∀k, Var[Δxₖ] = Var[Δxₖ ₊ ₁], i.e. constant variance in the backward signal

Ensuring that the variance of both layers and gradients is constant throughout the network at initialization helps prevent exploding and vanishing gradients in neural networks. If the gain is above one, it will result in exploding gradients and optimization divergence, while if the gain is below one, it will result in vanishing gradients and halt learning. The above two equations ensure that the signal gain is precisely one.

The motivation as well as the derivations in this paper are following the Xavier Glorot initialization⁽²⁾ paper published five years prior. While the previous work uses post-activation layers for constant variance in the forward signal, the He initialization proof uses pre-activation layers. Similarly, for the backward signal, He’s derivation uses post-activation layers instead of pre-activation layers in Glorot’s initialization. Given that these two proofs share some similarities, looking at both helps gain insights into why controlling for weights’ variance is so important in any neural network. (See “Xavier Glorot Initialization in Neural Networks — Math Proof” for more details)

I. Forward Pass

We are looking for Wₖ such that the variance of each subsequent pre-activation layer y is equal, i.e. Var[yₖ] = Var[yₖ ₋ ₁].

We know that yₖ = xₖ Wₖ+ bₖ.

For simplicity, we look at the i-th element of the pre-activation layer yₖ and apply the variance operator on both sides of the previous equation.

In the first step, we remove bₖ entirely, as following Assumption 1 it is initialized at value zero. Additionally, we leverage the independence of W and x to transform the variance of the sum into a sum of variances, following Var[X+Y] = Var[X] + Var[Y] with X ⟂ Y.
In the second step, as W and x are i.i.d., each term in the sum is equal, hence the sum is simply a nₖ times repetition of Var[xW].
In the third step, we follow the formula for X ⟂ Y which implies that Var[XY] = E[X²]E[Y²] – E[X]²E[Y]². This allows us to separate W and x contributions to the pre-activation layer’s variance.

In the fourth step, we leverage Assumption 3 of zero expectation for weights and layers at initialization. This leaves us with a single term involving a squared expectation.
In the fifth step, we transform the squared expectation into a variance since Var[X] = E[( X – E[X])²] = E[X²] if X has a zero mean. Now we can express the pre-activation layer’s variance as a separate product of layer and weight variance.

Finally, in order to link Var[yₖ] to Var[yₖ ₋ ₁], we express the squared expectation E[xₖ²] in terms of Var[yₖ ₋ ₁] in the following steps using the Law of the Unconscious Statistician.

The theorem states that we can formulate any expectation of the function of a random variable as an integral of its function and probability density p. As we know that xₖ = max(0, yₖ ₋ ₁), we can rewrite the squared expectation of xₖ as an integral on ℝ of y.

In the sixth step, we simplify the integral using that y is zero on ℝ⁻.
In the seventh step, we leverage the statistical property of y as a symmetric random variable, which hence has a symmetric density function p, and note that the entire integral’s term is an even function. Even functions are symmetric with respect to 0 on ℝ, which means that integrating from 0 to a is the same as from -a to 0. We use this trick to reformulate back the integral as an integral over ℝ.

In the ninth and tenth steps, we rewrite this integral as an integral of a function of a random variable. By applying the LOTUS — this time from right to left — we can change this integral to an expectation of the function over the random variable y. As a squared expectation of a zero mean variable, this is essentially a variance.

We finally get to put it all together using the results of steps five and ten — the variance of a pre-activation layer is directly linked to its previous pre-activation variance as well as the variance of the layer’s weights. Since we require that Var[yₖ] = Var[yₖ ₋ ₁], it allows us to confirm that a layer’s weights variance Var[Wₖ] should be 2/nₖ .

In summary, here is again the whole derivation of the forward propagation reviewed in this section:

II. Backward Pass

We are looking for Wₖ such that Var[Δxₖ] = Var[Δxₖ ₊ ₁].

Here, xₖ ₊ ₁= f (yₖ) and yₖ = xₖ Wₖ + bₖ.

Before applying the variance operator, let us first calculate the partial derivatives of the loss L with respect to x and y : Δxₖ and Δyₖ .

First, we use the chain rule and the fact that the derivative of a linear product is its linear coefficient — in this case, Wₖ.
Second, we leverage Assumption 2 stating that gradients and weights are independent of each other. Using independence, the variance of the product becomes the product of variances, which is equal to zero since the weights are assumed to be initialized with zero means. Hence, the expectation of the gradient of L w.r.t. x is zero.
Third, we use the chain rule to link Δyₖ and Δxₖ ₊ ₁ as the partial derivative of x w.r.t. y is ReLU’s derivative taken in y.

Fourth, recalling the derivative of ReLU, we compute the expectation of Δyₖ using the previous equation. As f’(x) is split into two parts with an equal probability of ½, we can write it as a sum of two terms: expectation over ℝ⁺ and ℝ⁻, respectively. From previous calculations, we know that the expectation of Δxₖ is zero, and we can thus confirm that both gradients have a mean of 0.

Fifth, we use the same rule as before to write a squared expectation as a variance, here with Δyₖ .
Sixth, we leverage Assumption 2 stating gradients are independent at initialization to split the variance of the two gradients Δxₖ ₊ ₁ and f’(yₖ). Further simplification stems from Assumption 3 and we can finally compute ReLU’s squared expectation given its even split between positive and negative intervals.

Finally, using the gathered results from the above sections, and reapplying the assumption of iid, we conclude the result of the backpropagation pass to be similar to the forward pass, i.e. given Var[Δxₖ] = Var[Δxₖ ₊ ₁], the variance of any layer’s weights Var[Wₖ] is equal to 2/nₖ .

To summarize, here is a reminder of the important step-by-step calculations included within this backward pass section:

In the two previous sections, we concluded the following for both backward and forward setups:

It is interesting to note that this result is different from the Glorot initialization⁽²⁾, where the authors essentially have to average the two distinct results obtained in the forward and backward passes. Furthermore, we observe that the variance in the He method is doubled, which, intuitively, is due to the fact that ReLU’s zero negative section reduces variance by a factor of two.

Subsequently, knowing the variance of the distribution, we can now initialize the weights with either normal distribution N(0, 𝜎²) or uniform distribution U(-a, a). Empirically, there is no evidence that one distribution is superior to the other, and it seems that the performance improvement comes down solely to the symmetry and scale properties of a chosen distribution. Furthermore, we do need to keep in mind Assumption 3, restricting the distribution choice to be symmetric and centered in 0.

For Normal distribution N(0, 𝜎²)

If X ~ N(0, 𝜎²), then Var[X] = 𝜎², thus the variance and standard deviation of the weight matrix can be written as:

We can therefore conclude that Wₖ follows a normal distribution with coefficients:

As a reminder, nₖ is the number of inputs of the layer k.

For Uniform distribution U(-a, a)

If X ~ U(-a, a), then using the below formula of a variance for a random variable following a uniform distribution, we can find the bound a:

Finally, we can conclude that Wₖ follows a uniform distribution with coefficients:

This article provides a step-by-step derivation of why He initialization method is optimal for neural networks that use ReLU activation functions, given the constraints on forward and backward passes to have constant variances.

The methodology of this proof also extends to the broader family of linear rectifiers, like PReLU (discussed in (1) by He et al.) or Leaky ReLU (allowing for a minuscule gradient to flow in the negative interval). Similar optimal variance formulas can be derived for these variants of the ReLU activation function.