Getting Started with Normalizing Flows: Linear Algebra & Probability | by Saptashwa Bhattacharyya

Change of Variables Rule, Bijection & Diffeomorphism

The calming flow (Credit: Author)

The basis of generative modelling is to understand the distribution from where the data samples came from. In one of my previous posts, an example of generative modelling was demonstrated by going through the steps of the Expectation-Maximization algorithm, where we assumed that latent variables give rise to the expected data. We also know some other neural-net based approaches like Variational Auto-Encoder (VAE) and Generative Adversarial Networks (GAN) which have shown (variations of them) spectacular applications but, they lack one key feature; They don’t allow for exact evaluation of the probability density of new points. With Normalizing Flows it is possible to perform both sampling and density estimation through tractable distributions.

To understand the fundamentals of Normalizing Flows, here we discuss the undergraduate level probability and linear algebra that are necessary and will be helpful to grasp the concepts of transforming distributions from simple to complex ones step by step. What can you expect to learn from this post?

What are bijection and diffeomorphism?
Basics of linear algebra and Jacobian.
Transforming probability distributions and Jacobian.
Where does Normalizing Flows fit into those previous concepts?
Check our understanding via using the TensorFlow Probability library.

1. Bijection & Diffeomorphism:

A function f: A→B is bijective if the elements of these 2 sets (A, B) have a perfect one-to-one correspondence. The most important property of a bijective function is the existence of an inverse function which can undo the action of the function.

Example: The function from set {1,2,3,4} to set {8,9,10,11} defined by the formula f(x)=x+7 is a bijection.

Diffeomorphism: Consider x as a D dimensional vector and we would like to define a joint distribution over x. The main idea of flow-based modelling is to express x as a transformation ϕ of a real vector u sampled from p_u(u):

x=ϕ(u) where, u∼p_u(u) ……(eq: 1)

Here p_u(u) is called the base distribution of the flow model. The defining property of flow-based models is that the transformation ϕ must be invertible, and both ϕ, and Inv(ϕ) must be differentiable. These transformations are called Diffeomorphisms and demand that u must be D dimensional as well [1].

2. Change of Variables & Jacobian:

Here we take help from our undergrad linear algebra class.

Consider a linear transformation T: R²→R² (In mathematics, the real coordinate space of dimension n, denoted as Rⁿ). Such a linear transformation can be described by a 2×2 matrix A. Taking ordered pairs as column vectors, we can write:

Eq. 2.1: Linear Transformation and Jacobian

The matrix A sends the unit square (the square with two sides being the standard unit vectors i and j) to a parallelogram with two sides being the columns of A, namely, [a c], [b d]. The matrix A of partial derivatives (which would be a constant matrix if T is a linear transformation) has a determinant; this is called the Jacobian of T. The area of this parallelogram is |det(A)|, the absolute value of the determinant of A.

Generally, if D is any region in R² and D_1 = T(D) is its image under this linear transformation, then : Area(D_1) = |det(A)| Area(D). We can easily verify this by taking an example of changing from cartesian coordinates to polar coordinates in R² space. We know the transformation rules are given by x = rcosθ, y = rsinθ, then the Jacobian matrix for transformation is:

Eq. 2.2: Jacobian for transforming from cartesian to polar coordinate system

Using the above definition of Jacobian we can get the area element as: dA(x, y) = |det J| dA(r, θ) = r dr dθ. Finally, the change of variable formula is very useful in multi-variable calculus and we can summarize everything as below:

Let T be a C¹ diffeomorphism (If T is r times continuously differentiable, T is called a Cʳ diffeomorphism) from D_1 to D. For any continuous function f in D:

Eq. 2.3: Transformation rule using Jacobian

This expression can be generalized to n-dimension.

3.1. Probability & Change of Variables:

We extend the above concepts from single/multi-variable functions to probability distributions. Consider a simple transformation, a random variable U that is uniformly distributed over the unit cube, u∈[0,1]³. We can scale U by a factor of n to get a new random variable X (as shown in Fig. 1 of the image below).

Eq. 3.1.1: Scaling a random variable by a factor n

Since the total probability is conserved, we can write:

Eq. 3.1.2: Probability sums up to 1

Since we started from a random variable on a cube of unit volume (V_u), scaling the random variable would also scale the probability density as below:

Eq. 3.1.3: Probability density shrinks when n>1

The previous discussion (fig. 1) related to the transformation from u→x, tells us how much du has been shrunk/stretched to dx due to the transformation from p(u)→p(x) due to the mapping function (invertible) ϕ (fig. 2).

Why is this important? Because this is the starting point of Normalizing Flow. This is how we started from Eq. 1 in the Diffeomorphism section before. We consider a real vector u sampled from a probability distribution p_u and a transformation function ϕ that changes u→x. For flow-based models, ϕ is a diffeomorphism.

Fig. 1: Transforming probability distributions using a diffeomorphism: Due to total probability being always 1, we get a shrank/stretched space after transformation. (Credit: Author’s slides)

3.2. Transforming Probability Distributions:

To derive some simple formulations for transforming probability distributions, we get started with Eq. 3.1.2. in the previous section and consider a single region, so we can get rid of the integration:

Eq. 3.2.1: Conservation of probability leads to first log probability formula.

We remember our transformation from u to x via mapping function ϕ as x=ϕ(u); u=Inv{ϕ} (x). From the previous discussion on transformation (in the linear algebra section) we know that stretching/shrinking due to linear/non-linear transformation can be quantified by the determinant of the Jacobian matrix. Very similarly, here also the transformation rule becomes:

Eq. 3.2.2: Writing the equation before in terms of Jacobian following linear algebra discussion before.

R.H.S of this equation now completely depends on the base variable u. The Jacobian matrix definition remains as usual:

Eq. 3.2.3: Mighty Jacobian: What it represents…

Taking the logarithm of both sides of the previous equation, we can write it as below:

Eq. 3.2.4: Taking the logarithm of Eq. 3.2.2.

Since the transformation function ϕ is invertible, we can write the equation as below too:

Eq. 3.2.5: Due to invertible transformation we can add a few steps and re-write Eq. 3.2.4 in a different form.

We will use these transformation rules later for transforming probability distributions.

4. Normalizing Flows:

With all the necessary basics and mathematical formulations already introduced, we are ready to introduce Normalizing Flows.

Before we limited our discussion to only one transformation via an invertible function ϕ. A very important property of the diffeomorphism (transformation via a function such that the function and its inverse both are differentiable) is that they are composable i.e. given two such transformations ϕ1, ϕ2 their composition ϕ1⋅ϕ2 is also invertible and differentiable. The inverse and the Jacobian determinant are given by —

Eq. 4: Combinations of multiple diffeomorphisms.

The idea behind Normalizing flow is to chain together multiple diffeomorphisms ϕ1, ϕ2, ⋯ ϕk to gradually obtain x (original data) coming from a complex distribution p_X(x), starting from u, coming from a simple base distribution p_U(u) as shown in the figure below —

Fig. 2: What normalizing flow does? Transforms simple distributions (ex: N(0, I)) to complex via a series of diffeomorphisms. (Credit: Author’s slides)

Simple Examples w TensorFlow Probability Library:

We will get started with some simple examples of forward and inverse transformations using the TensorFlow Probability library that has the Bijector module. Let’s load the necessary libraries:

import tensorflow as tfimport tensorflow_probability as tfptfd = tfp.distributionstfb = tfp.bijectors

Transforming Tensor: Let’s use two of the simplest bijection operations (scaling and shifting) and combine them to ‘forward’ transform a tensor and perform the inverse operation to retrieve the original tensor again:

We started from a constant tensor [1., 2., 3.] and then we chain two bijection operations (shift 3 units and scale 2 units) via tfb.Chain . Then we can easily perform forward and inverse operations and we verify that after forward transformation, using the inverse transformation we get back the original tensor.

Transforming Distributions: In the previous example we transformed a tensor but here let’s transform a distribution. We take samples from a normal distribution and use the same chained bijection operations that we have used before for transformations:

Considering only 5 random samples from a normal distribution, we apply the chained bijectors (scale & shift) and below is the resulting graph of forward transformation (histogram).

Fig. 3: Sample from a normal distribution (left) and results after a forward transformation (3 units shift + 2 units scale).

We also checked the log probability rules that are described in eq. 3.2.4 & 3.2.5. These are easy to check since the bijectors have the methods forward_log_det_jacobian and inverse_log_det_jacobian in TensorFlow Probability.

What happens if one considers a 2D distribution, instead of 1D distribution? Let’s see an example starting from a 2D uniform distribution and apply a scaling bijection operation as below:

The 2D uniform distribution was scaled by -2 units in the x-direction and 0.5 units in the y-direction. We can scatter plot the 100 samples before and after scale bijection:

Fig, 4: Transforming 2D distributions via a forward operation (transformation function is bijective).

A diagonal matrix will always perform scaling operation on the two dimensions independently, i.e. rectangular distribution remains rectangular and we saw this in the last example.

What if we want to change from rectangular to quadrilateral distribution? Here we can use the lower triangular matrix for scaling bijection operation. Let’s see:

Starting from a rectangular distribution we ended up in a quadrilateral distribution as below:

Fig. 5: Same as Fig. 4, but here we transform a rectangular distribution into a quadrilateral distribution. Samples from those distributions are shown here.

Just as before, we can always revert back to the original distribution by applying the inverse transformation as below:

y2_inv = scale_bijector2.inverse(scale_bijector2(x))

Fig. 6: We can always perform an inverse transformation (as long as the transformation is bijective) and retrieve back the original samples (coming from a uniform 2D distribution).

We went through the core concepts and building blocks of Normalizing Flows starting from linear algebra, Jacobian and probability distribution transformation. We have seen how the concept of the determinant of Jacobian from linear algebra can be seamlessly used for transforming probability distributions. Finally, we have used the bijector module in the TensorFlow Probability library to verify the concepts and the formulas that we’ve derived.

In the next post, I will focus more on transforming complex distributions (e.g. from Normal to Bi-Modal distribution etc.) using bijective functions but the concepts and formulas used here will be extremely useful at any stage of studying Normalizing Flows.

Cheers and stay strong!!