Diffusion Probabilistic Models and Text-to-Image Generation | by Cheng | Mar, 2023

By Jessie Hobb On Mar 29, 2023

Photorealistic Generation of Anything You Can Think of

Figure 1. Text-to-Image Generation. Image made by author.

If you are an avid follower of the newest CV papers, you would be surprised at the stunning results of generative networks in creating images. Many of the previous literature were based on the groundbreaking generative adversarial network (GAN) idea, but that’s no longer the case for recent papers. In fact, if you look closely at the newest papers such as ImageN and Staple Diffusion, you will constantly see a unfamiliar term: diffusion probabilistic model.

This article dives in to the very basics of the newly trending model, how it is learnt in a brief overview, and the exciting applications that have soon followed.

Figure 2. Overview of Denoising Diffusion Probabilistic Models. Image Retrieved from: https://arxiv.org/abs/2006.11239.

Consider an image to which a small amount of Gaussian noise is added. The image may becomes a little noisy, but the original content can most likely still be recognised. Now repeat the step again and again; eventually the image would become almost a pure Gaussian noise. This is known asthe forward process of a diffusion probabilistic model.

The goal is simple: by leveraging the fact that forward process is a Markov chain (the process of the current timeframe is independent from the previous timeframe), we can actually learn a reverse process, denoising the image on the current frame slightly.

Given a properly learnt reverse process and a random Gaussian noise, we can now repeatedly apply the noise and ultimately obtain an image that is very similar to the original data distribution the process is trained — hence a generative model.

One advantage of diffusion models is that the training can be done by just picking a random timestamp in the middle for optimisation (instead of having to fully reconstruct the image end-to-end). The training itself is much more stable compared to GANs, where small hyperparameter differences could easily lead to model collapse.

Note that this is a very high-level overview of what a denoising diffusion probabilistic model looks like. For the mathematical details, please refer to here and here.

Figure 3. Results produced by ImageN. The text prompts are below the images. Image retrieved from: https://arxiv.org/abs/2205.11487.

The idea of denoising diffusion models for image generations was first published in 2020, but it was not until the recent Google Paper ImageN that truly blew up the field.

Like GANs, diffusion models can also be conditioned on prompts such as images and texts. The Google Research Brain Team suggested that large-frozen language models are in fact great encoders for providing the text conditions for photorealistic generations.

Figure 4. Overview of the DreamFusion pipeline. Image retrieved from: https://arxiv.org/abs/2209.14988.

As with numerous computer vision trends, the excelling performances in the two-dimensional domain leads to ambitions of extending into 3D; diffusion models follow no different path. Recently, Poole et al. proposed DreamFusion a text-to-3D model building on the strong foundations of ImageN and NeRF.

For a brief overview of NeRF, please refer here.

Figure 4 refers to the pipeline of DreamFusion. The pipeline starts with a randomly initialised NeRF. Based on the generated density, albedo, and normals (with a given light source), the network outputs the shading and subsequently the colour of NeRF form a particular camera angle. The rendered image is combined with a Gaussian noise, and the goal is to utilise a frozen ImageN model to reconstruct the image and subsequently update the NeRF model.

Figure 5. Results of DreamFusion. Image retrieved from: https://arxiv.org/abs/2209.14988.

Some of the stunning 3D results are presented in the gallery as show on Figure 5. With consistent colours and shapes of an object fully portrayed form a simple image.

Recent work such as Magic3D further improved the pipeline by making the reconstruction faster and much more fine-grained.

And there you have it — an overview of the progression in diffusion models for image generation. When simple words transform into vivid images, it becomes much easier for everyone to imagine and paint their craziest thoughts.

“Writing is the painting of the voice” — Voltaire

Thank you for making it this far 🙏! I regularly write about different areas of computer vision/deep learning, so join and subscribe if you are interested to know more!