Paper Explained — High-Resolution Image Synthesis with Latent Diffusion Models | by Mario Namtao Shianti Larcher | Mar, 2023

By Jessie Hobb On Mar 31, 2023

While OpenAI has dominated the field of natural language processing with their generative text models, their image generation counterpart, DALL·E, now faces a viable open-source competitor in Stable Diffusion. This article delves into the Latent Diffusion paper, upon which Stable Diffusion is based

Part of Fig. 13. from High-Resolution Image Synthesis with Latent Diffusion Models, generated with the prompt “An oil painting of a latent space”.

As I write this article, OpenAI’s chatbot, ChatGPT, continues to gain traction with its integration into Microsoft products used by over a billion people. While Google has recently launched its own AI assistant, Bard, and other companies are also making advancements in the field, OpenAI continues to remain at the forefront with no clear contender in sight. One might assume that OpenAI’s DALL·E, the generative model for images, would be similarly dominant in the field of conditional and non-conditional image generation. However, it’s actually an open-source alternative, Stable Diffusion, that’s taking the lead in popularity and innovation.

This article delves deep into the scientific paper behind Stable Diffusion, aiming to provide a clear and comprehensive understanding of the model that’s revolutionizing the world of image generation. While other articles provide high-level explanations of the technology, this piece goes beyond the surface to explore often overlooked details.

Before delving into the methodology presented in the scientific paper High-Resolution Image Synthesis with Latent Diffusion Models, it’s essential to understand the key issues this work addresses.

Over the years, image generation has been tackled mainly through four families of models: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Autoregressive Models (ARMs), and more recently, Diffusion Probabilistic Models (DMs).

Generative Adversarial Networks (GANs)

Since their first appearance in 2014, Generative Adversarial Networks (GANs) have been one of the dominant approaches for image generation. While GANs show promising results for data with limited variability, they come with several issues. The most well-known issue is mode-collapse, where the generator produces a limited range of outputs instead of a diverse set of images

Mode collapse: this phenomenon occurs when the generator can alternately generate a limited number of outputs that fool the discriminator. In general, GANs struggle capturing the full data distribution.

and, in general, their training is often unstable.

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) are an alternative to GANs that offer several advantages. They do not suffer from mode-collapse and can efficiently generate high-resolution images. However, their sample quality is not always comparable to that of GANs.

Autoregressive Models (ARMs)

Autoregressive Models (ARMs) are excellent at density estimation and have achieved remarkable performance in this area. However, their computationally demanding architectures and sequential sampling process limit them to generating low-resolution images.

Diffusion Probabilistic Models (DMs)

Diffusion models have made significant progress in density estimation and sample quality, but their operation in pixel space by adding or removing noise to a tensor of the same size as the original image results in slow inference speed and high computational cost. For instance, even a relatively small image such as an RGB image of size 512×512 corresponds to a tensor of around 800,000 values, which makes the generation of larger images computationally demanding during both training for gradient propagation and inference for the iterative approach used in generation.

Conditioning Mechanism

Generating images based on textual description or the style of another image is often desirable, but conditioning the result on one or more inputs has been a challenge in previous approaches.

Fig. 3. from High-Resolution Image Synthesis with Latent Diffusion Models.

To summarize the approach proposed by the scientific paper High-Resolution Image Synthesis with Latent Diffusion Models, we can break it down into four main steps:

The first step is to extract a more compact representation of the image using the encoder E located in the upper left corner of the figure above. Unlike other methods, latent diffusion works in the latent space defined by the encoder, rather than in pixel space.
Next, Gaussian noise is added to the image in the upper middle part of the figure, as part of the diffusion process that goes from z to zT (in case T steps of noise addition are applied).
The zT representation is then passed through a U-Net located in the middle part at the bottom of the figure. The U-Net has the role of predicting zT-1, and this process is repeated T-1 times until we arrive at z, which is then returned from latent space to pixel space via the decoder D.
Finally, the approach allows for arbitrary conditioning by mapping various input modalities such as semantic maps or text. This is achieved by first transforming the input y with a dedicated encoder τθ and then mapping it to the intermediate layers of the U-Net with the same cross-attention layer used by the Transformer architecture.

With this general overview, we can now take a closer look at each of these steps in more detail.

Latent Diffusion explicitly separates the image compression phase to remove high frequency details (perceptual compression), from the generation phase where the model learns a semantic and conceptual composition of the data (semantic compression).

Objective Function

To train the autoencoder used for image compression the authors follow the same approach used by VQGAN presented in Taming Transformers for High-Resolution Image Synthesis.

Fig. 2. from Taming Transformers for High-Resolution Image Synthesis.

In particular, the objective function to train the autoencoding model (E, D) is:

Eq. 25. from High-Resolution Image Synthesis with Latent Diffusion Models.

Let’s define x^ as the reconstructed image D(E(x)), Lrec is a reconstruction loss (squared error between x and x^), Ladv is the adversarial loss defined as log(1 – Dψ(x^)), Dψ is a patch-based discriminator optimized to differentiate original images from reconstructions x^ (so Dψ(x) try to output 1 for the real image x and 0 for the reconstructed “fake” image x^), and Lreg is a regularization loss.

Regularization

The authors experiment with two different methods of regularization.

The first approach involves a low-weighted Kullback-Leibler term, similar to standard VAEs.

Kullback-Leibler (KL) Penality: Kullback-Leibler divergence is a type of statistical distance between two distributions. In this case, the idea is to make the distribution of the latent variable z ~ N(Eµ , Eσ^2) close to that of a standard normal distribution N(0, 1). Imposing this constraint regularizes the latent space by concentrating it more, so, for example, if z lies close to z1 and z2 then D(z) will itself have something in common both with D(z1) and D(z2).

In the second approach, the latent space is regularized with a vector quantization layer.

Vector Quantization (VQ): VQ is the approach used by VQVAE presented in the scientific paper Neural Discrete Representation Learning and the already mentioned VQGAN. As visible from the above image, for each spatial position of the encoder output z^, the corresponding vector (whose size depends on the number of channels of z^) is replaced with the vector closest to it in a learnable “codebook”. This then limits the decoder’s possible inputs during inference, which can only be combinations of codebook vectors (a discretization or quantization of the latent space).

In the case of VQ-regularized latent space, z is extracted before the quantization layer and absorbs the quantization operation in the decoder, i.e., it can be interpreted as the first layer of D.

Fig. 2. from Denoising Diffusion Probabilistic Models.

Since this article deals with Latent Diffusion and not diffusion models in general, I will only describe the most important aspects of them. First, let us distinguish between two processes: forward and reverse.

Forward Process

The forward or diffusion process, the one going from right to left in the figure, is a Markov chain, that is, the image at time t depends only on the one at time t-1 and not on all the previous ones. At each step, xt is sampled according to the following transition probability:

Eq. 2. from Denoising Diffusion Probabilistic Models.

In the formula above the βt define a variance schedule and can be either learned or held constant by treating them as hyperparameters. An interesting property of forward processing is also that it is possible to sample xt in closed form for an arbitrary timestep t. Using the notation

and

we have

Eq. 4. from Denoising Diffusion Probabilistic Models.

So, to recap, in the forward process we can get the image at any time t by sampling from a Gaussian distribution with mean and variance defined by the formula above.

Reverse Process

Given the forward process, the reverse process also follows a Gaussian distribution:

As for the variance, the authors set it at

where they note experimentally that both

and

bring equivalent results.

Before seeing how the mean is parameterized, let us re-parameterize eq. 4. of the forward process:

At this point we parameterize the mean as

where ϵθ is an estimator of ϵ from xt, specifically it is a variant of a time-conditional U-Net.

At this point we have all the elements to sample xt-1 conditioned to xt, considering that we know all the parameters of the Gaussian distribution introduced at the beginning of the description of the reverse process.

Without entering in the mathematical details, the (simplified) objective to be minimized is:

Eq. 1. from High-Resolution Image Synthesis with Latent Diffusion Models.

with t uniformly sampled from {1, . . . , T}.

Generative Modeling of Latent Representations

As already noted, Latent Diffusion works like a diffusion model, similar to the one explained earlier. However, it differs in that it starts from the latent representation z of the image obtained through an encoder (latent space), rather than from the image x (pixel space). This detail greatly reduces the computational burden, as the latent space is more compact than pixel space.

Given this, replacing xt with zt in the diffusion model objective, we have the new objective:

Eq. 2. from High-Resolution Image Synthesis with Latent Diffusion Models.

Before this study, there was limited exploration on how to condition diffusion models with inputs beyond a class label or a blurred version of the input image. The proposed approach by Latent Diffusion is highly versatile and involves integrating additional information directly into the intermediate layers of the U-Net model using cross-attention, which is similar to the Transformer architecture.

To be more specific, the input information (such as text) is first converted into an intermediate representation through a domain-specific encoder called τθ (an example will be provided later). This representation is then passed through a cross-attention layer and added to intermediate layers of the U-Net:

with

In the equation, ϕi(zt) represents the flattened intermediate representation of the U-Net, and the Ws are trainable projection matrices. Although the paper does not elaborate on this, the code implementation reveals that the output of the cross-attention layer is added to the original U-Net layer. This can be seen in the following code snippet:

x = self.attn2(self.norm2(x), context=context) + x

Here, attn2 denotes the cross-attention layer, while the context refers to τθ(y). Although the full implementation of this process is more complex, this is the crucial conceptual element. For a more in-depth understanding of this mechanism, please refer to the BasicTransformerBlock that is utilized in the model.

Fig. 5. from High-Resolution Image Synthesis with Latent Diffusion Models.

The paper conducts numerous experiments to explore various methods of image generation, including unconditional generation, layout-to-image synthesis, spatial conditioning, super-resolution, inpainting, and more. To further highlight two important aspects of Latent Diffusion, we will focus on the well-known task of text-to-image.

The first crucial aspect to consider is how to convert text into a representation that can be passed to the cross-attention layer. The authors use the BERT-tokenizer and implement τθ as a Transformer to achieve this.

The second important aspect is how much the input image should be compressed through the encoder. The authors experiment with different downsampling factors f ∈ {1, 2, 4, 8, 16, 32}, and they conclude that 4 and 8 offer the best conditions for achieving high-quality synthesis results. The results shown above were obtained using LDM-8 (KL).

Latent Diffusion and the successive works inspired by this paper have produced astonishing results that were once considered unimaginable. Today, these models are no longer confined to research labs but are being integrated into popular products such as Adobe Photoshop. This development marks a significant milestone in the field of artificial intelligence and demonstrates its potential to impact various aspects of our lives.

However, despite the remarkable progress made in this area, there are still several challenges that need to be addressed. These challenges encompass copyright issues pertaining to the usage of images for training AI models and the innate biases that surface when large datasets are crawled from the internet. However, notwithstanding these limitations, the potential of AI to democratize creativity and enable individuals to express themselves in novel and captivating ways is too significant to ignore.