Techno Blender
Digitally Yours.

Understanding the Denoising Diffusion Probabilistic Model, the Socratic Way | by Wei Yi | Feb, 2023

0 158


Photo by Chaozzy Lin on Unsplash
  1. why the denoising diffusion model is designed in terms of the forward process, the forward process posteriors (which I will call the reverse of the forward process to avoid the word “posteriors” because it confuses me) and backward process. And what is the relationship among these processes?
  2. how to derive the mysterious loss function. In the paper, there are many skipped steps in deriving the loss function Lₛᵢₘₚₗₑ. I went through all derivations to fill in the missing steps. Now I realize the derivation of the analytical formula for Lₛᵢₘₚₗₑ tells a truly beautiful Bayesian story. And after all the steps filled in, the whole story is easy to understand.

Foreword process turns a natural image into noise

Reverse process turns noise into a natural image

Image from paper Denoising Diffusion Probabilistic Models, page 2
  1. given a natural image, say X₀, plugging X₀ in q(x₀=X₀) should return a probability number between 0 and 1 to indicate how likely this natural image occurs in all natural images.
  2. Summing up the probability numbers from q(x₀) for all natural images gives us 1.
  • x₀: there are observations to the random variable x₀. The observations are the actual images from the training dataset. We call x₀ observational random variable.
  • x₁ to x_T: there is no observation for them, hence they are latent random variables.

Property 1: Fully joint probability density function q(x_0:T)

Hand drawn illustration by me

Property 2: Marginal probability density function q(xₜ|x₀)

Property 3: The reverse of the forward process q(xₜ₋₁|xₜ, x₀)

μₚ(xₜ, t) and Σₚ(xₜ, t) contain model parameters

Why do we need the reverse process p(xₜ₋₁|xₜ)? Isn’t the reverse of the forward process q(xₜ₋₁|xₜ, x₀) enough?

Hand drawn illustration by me

The joint probability density function, with data plugged in

The likelihood p(x₀)

Why not integrate x₀ away from the likelihood p(x₀) as well?

The likelihood p(x₀) mentions all model parameters

The negative log likelihood of data as the loss function

The negative log likelihood loss function is not opitimizable

Can we use sample-averaging to derive the analytical form for the loss function?

  1. First a sample for x_T from the standard multivariate Gaussian distribution using the base case to remove the integration with respect to x_T.
  2. With a sample Sₜ for the random variable xₜ at hand, plug Sₜ into p(xₜ₋₁|xₜ=Sₜ), then sample xₜ₋₁, using the inductive case.
  3. As long as we don’t lose model parameters during this process, we will end up with an analytical formula for the loss function L. Losing model parameters during sample-averaging means that it is possible that sample-averaging results in a formula that does not mention model parameters anymore. This is bad because a loss function that does not mention model parameters is useless. Reparameterization trick is used to prevent this from happening. But in our case, we don’t need to worry about losing model parameters when applying sample-averaging. Appendix “Why we won’t loss model parameters when applying sample-averaging to derive the analytical formula for the loss function L?” explains why.
  4. Once all samples for x₁ to x_T are available, let’s call them a sample trajectory. Plug this trajectory into the joint probability density p(x_0:T) to get the analytical expression for p(x_0) under this trajectory.
  5. Repeat steps 1~4 to get analytical expression for p(x_0) under different trajectories and average them to approximate the inner integration. Say there are m sample trajectories, each trajectory i gives an analytical formula pᵢ(x_0), then the analytical formula for the average is:

Derivation by importance sampling

New loss Lᵥ to derive analytical formula for, and to minimize

Rewriting L to get the important Lₜ₋₁ term

  1. probability density function for the same random variable xₜ₋₁ and
  2. they are both multivariate Gaussian distributions with their analytical probability density available — previously we have defined the analytical form for both p(xₜ₋₁|xₜ) in the reverse process, and q(xₜ₋₁|xₜ, x₀) in the reverse of the forward process.
  1. q(x₀), which is a distribution about x₀, and its formula is unknown.
  2. q(xₜ₋₁|x₀), which is a distribution about xₜ₋₁.
  3. q(xₜ|x₀), which is a distribution about xₜ.
  4. q(xₒₜₕₑᵣ), which is a distribution about the latent random variables other than xₜ₋₁ and xₜ.
  • Sample x₀ by randomly picking natural images from the training set.
  • Sample xₜ from the marginal q(xₜ|x₀) after plugging the sample for x₀.
  • No need to sample xₒₜₕₑᵣ as line(10) reveals that xₒₜₕₑᵣ is not mentioned in the KL-divergence. The values for random variables inside xₒₜₕₑᵣ won’t change the computed result of the KL-divergence.

Sample-averaging to solve the integration

  1. First, sample x₀ from our training set. Let’s call x₀’s sample S₀.
  2. Plug S₀ in q(xₜ|x₀) to get q(xₜ|x₀=S₀), which is now a fully specified multivariate Gaussian distribution ready to be sampled. Let’s call a x’s sample Sₜ.
  3. Ignore the integration over xₒₜₕₑᵣ because xₒₜₕₑᵣ does not appear in the KL-divergence, their samples do not change the analytical form for the integration result.

KL(q(xₜ₋₁|xₜ, x₀) || p(xₜ₋₁|xₜ)) serves as regularization

  • q(xₜ₋₁|xₜ, x₀) — the reverse of the forward process that we derived from the forward process by using the Bayes rule.
  • p(xₜ₋₁|xₜ) — the reverse process that we used deep neural network to implement.

Trajectory viewpoint

Hand drawn illustration by me

Optimization forces p to change by fixing q

  1. μₚ(xₜ, t), the neural network that is responsible to predict the mean of the mean vector for the p(xₜ₋₁|xₜ) multivariate Gaussian distribution.
  2. Σₚ(xₜ, t), a second neural network that is responsible to predict the covariance matrix for the p(xₜ₋₁|xₜ) multivariate Gaussian distribution.

Simplifying the model by setting the reverse process covariance matrix to constant

Interpreting the meaning of LMₜ₋₁

  1. Originally we want to minimize the distance between q(xₜ₋₁|xₜ, x₀) the reverse of the forward process and p(xₜ₋₁|xₜ), which is our neural network implementation of the reverse process, at every time step t from 2 to T. In other words, we want to find a configuration (model parameter values) for the p(xₜ₋₁|xₜ) distribution such that these two distributions are similar to each other.
  2. These two distributions for the random variable xₜ₋₁ are both multivariate Gaussian. A multivariate distribution is fully specified by it mean vector and covariance matrix. If p(xₜ₋₁|xₜ) needs to be similar to q(xₜ₋₁|xₜ, x₀), their mean vector and covariance matrix must be similar to each other. This is called momentum matching, with the mean being the first momentum, and the covariance being the second. The letter “M” in LMₜ₋₁ stands for momentum matching.
  3. After we simplified the covariance matrix from the p(xₜ₋₁|xₜ) distribution to a quantity that is equal to the covariance matrix from the reverse of the forward process, the only thing that we can still change to make these two distributions similar or different is the mean vector. So we want to minimize the distance between the mean vectors from the p(xₜ₋₁|xₜ) and the q(xₜ₋₁|xₜ, x₀) distribution.
  4. Since the mean vector from the p(xₜ₋₁|xₜ) distribution is predicted by our neural network, we can use optimization to move the values of the neural network weights around by minimizing LMₜ₋₁.

Simplifying LMₜ₋₁

  1. xₜ is known via sampling, there is no need to predict it.
  2. Given timestamp t, βₜ is constant, and so all the other quantities derived from βₜ, namely αₜ and αₜ bar.
  3. The only part that needs predicting is the noise ϵₜ.

Is this objective function still analytical?

  1. In the final formula for LMₜ₋₁, there is no mention of xₜ anymore, x is expressed via x₀ and the noise ϵₜ. So we don’t need to add the expectation with respect to xₜ. Instead, we need to add the expectation with respect to ϵₜ, which is a standard multivariate Gaussian, that is ϵₜ~N(0, 1).
  2. There is the mention of timestamp t, which represents an integer between 2 and T. We need to add an expectation with respect to t, which comes from a uniform distribution.
  3. There is the mention of x₀, which comes from the unknown data distribution q(x₀).
From paper Denoising Diffusion Probabilistic Models, page 4

Ignoring the L_T team

Approximating the L₀ term

  1. L₀ does not mention any model parameters so it can be ignored during the optimization. Or
  2. L₀ mentions model parameters and is analytical so its gradient can be taken for gradient descent.
  • write down µₜ(xₜ, x) for t=1, that is µ₁(x₁, x₀) and,
  • if µ₁(x₁, x₀) is close to the natural image sample X₀
From paper Denoising Diffusion Probabilistic Models, page 4

No concern on high variance in sample-averaging Lₛᵢₘₚₗₑ?

  1. For the random variable x₀, there is no way to compute the expectation with respect to it analytically because the data distribution q(x₀) is unknown. So sample-averaging is the only option.
  2. For the random variable t that comes from an uniform distribution. It’s expectation is just take all possible values of t, compute the formula inside the expectation and average them. This is equivalent to sample-averaging in our context of stochastic gradient descent. Even though in stochastic gradient descent, Algorithm 1 only works with a single term, instead of adding all those terms together and dividing the sum by T, the algorithm does it repeated until converging. This is equivalent to computing the expectation over t asymptotically. For more details, please see the proof in Can We Use Stochastic Gradient Descent (SGD) on a Linear Regression Model?
  3. For the standard multivariate Gaussian random variable ϵₜ, we can use Gaussian quadrature to approximate the expectation analytically. For more details about Gaussian quadrature, please see Variational Gaussian Process (VGP) — What To Do When Things Are Not Gaussian. But Gaussian quadrature works better in low dimensional settings. In our case, the ϵₜ is a d dimensional random variable with d being the number of pixels in the images that we want to generation, so d is a large integer. And applying Gaussian quadrature is not practical. For more details about why it is not practical, please see the Appendix of the above link.

Why we won’t loss model parameters when applying sample-averaging to derive the analytical formula for the loss function L


Photo by Chaozzy Lin on Unsplash
  1. why the denoising diffusion model is designed in terms of the forward process, the forward process posteriors (which I will call the reverse of the forward process to avoid the word “posteriors” because it confuses me) and backward process. And what is the relationship among these processes?
  2. how to derive the mysterious loss function. In the paper, there are many skipped steps in deriving the loss function Lₛᵢₘₚₗₑ. I went through all derivations to fill in the missing steps. Now I realize the derivation of the analytical formula for Lₛᵢₘₚₗₑ tells a truly beautiful Bayesian story. And after all the steps filled in, the whole story is easy to understand.

Foreword process turns a natural image into noise

Reverse process turns noise into a natural image

Image from paper Denoising Diffusion Probabilistic Models, page 2
  1. given a natural image, say X₀, plugging X₀ in q(x₀=X₀) should return a probability number between 0 and 1 to indicate how likely this natural image occurs in all natural images.
  2. Summing up the probability numbers from q(x₀) for all natural images gives us 1.
  • x₀: there are observations to the random variable x₀. The observations are the actual images from the training dataset. We call x₀ observational random variable.
  • x₁ to x_T: there is no observation for them, hence they are latent random variables.

Property 1: Fully joint probability density function q(x_0:T)

Hand drawn illustration by me

Property 2: Marginal probability density function q(xₜ|x₀)

Property 3: The reverse of the forward process q(xₜ₋₁|xₜ, x₀)

μₚ(xₜ, t) and Σₚ(xₜ, t) contain model parameters

Why do we need the reverse process p(xₜ₋₁|xₜ)? Isn’t the reverse of the forward process q(xₜ₋₁|xₜ, x₀) enough?

Hand drawn illustration by me

The joint probability density function, with data plugged in

The likelihood p(x₀)

Why not integrate x₀ away from the likelihood p(x₀) as well?

The likelihood p(x₀) mentions all model parameters

The negative log likelihood of data as the loss function

The negative log likelihood loss function is not opitimizable

Can we use sample-averaging to derive the analytical form for the loss function?

  1. First a sample for x_T from the standard multivariate Gaussian distribution using the base case to remove the integration with respect to x_T.
  2. With a sample Sₜ for the random variable xₜ at hand, plug Sₜ into p(xₜ₋₁|xₜ=Sₜ), then sample xₜ₋₁, using the inductive case.
  3. As long as we don’t lose model parameters during this process, we will end up with an analytical formula for the loss function L. Losing model parameters during sample-averaging means that it is possible that sample-averaging results in a formula that does not mention model parameters anymore. This is bad because a loss function that does not mention model parameters is useless. Reparameterization trick is used to prevent this from happening. But in our case, we don’t need to worry about losing model parameters when applying sample-averaging. Appendix “Why we won’t loss model parameters when applying sample-averaging to derive the analytical formula for the loss function L?” explains why.
  4. Once all samples for x₁ to x_T are available, let’s call them a sample trajectory. Plug this trajectory into the joint probability density p(x_0:T) to get the analytical expression for p(x_0) under this trajectory.
  5. Repeat steps 1~4 to get analytical expression for p(x_0) under different trajectories and average them to approximate the inner integration. Say there are m sample trajectories, each trajectory i gives an analytical formula pᵢ(x_0), then the analytical formula for the average is:

Derivation by importance sampling

New loss Lᵥ to derive analytical formula for, and to minimize

Rewriting L to get the important Lₜ₋₁ term

  1. probability density function for the same random variable xₜ₋₁ and
  2. they are both multivariate Gaussian distributions with their analytical probability density available — previously we have defined the analytical form for both p(xₜ₋₁|xₜ) in the reverse process, and q(xₜ₋₁|xₜ, x₀) in the reverse of the forward process.
  1. q(x₀), which is a distribution about x₀, and its formula is unknown.
  2. q(xₜ₋₁|x₀), which is a distribution about xₜ₋₁.
  3. q(xₜ|x₀), which is a distribution about xₜ.
  4. q(xₒₜₕₑᵣ), which is a distribution about the latent random variables other than xₜ₋₁ and xₜ.
  • Sample x₀ by randomly picking natural images from the training set.
  • Sample xₜ from the marginal q(xₜ|x₀) after plugging the sample for x₀.
  • No need to sample xₒₜₕₑᵣ as line(10) reveals that xₒₜₕₑᵣ is not mentioned in the KL-divergence. The values for random variables inside xₒₜₕₑᵣ won’t change the computed result of the KL-divergence.

Sample-averaging to solve the integration

  1. First, sample x₀ from our training set. Let’s call x₀’s sample S₀.
  2. Plug S₀ in q(xₜ|x₀) to get q(xₜ|x₀=S₀), which is now a fully specified multivariate Gaussian distribution ready to be sampled. Let’s call a x’s sample Sₜ.
  3. Ignore the integration over xₒₜₕₑᵣ because xₒₜₕₑᵣ does not appear in the KL-divergence, their samples do not change the analytical form for the integration result.

KL(q(xₜ₋₁|xₜ, x₀) || p(xₜ₋₁|xₜ)) serves as regularization

  • q(xₜ₋₁|xₜ, x₀) — the reverse of the forward process that we derived from the forward process by using the Bayes rule.
  • p(xₜ₋₁|xₜ) — the reverse process that we used deep neural network to implement.

Trajectory viewpoint

Hand drawn illustration by me

Optimization forces p to change by fixing q

  1. μₚ(xₜ, t), the neural network that is responsible to predict the mean of the mean vector for the p(xₜ₋₁|xₜ) multivariate Gaussian distribution.
  2. Σₚ(xₜ, t), a second neural network that is responsible to predict the covariance matrix for the p(xₜ₋₁|xₜ) multivariate Gaussian distribution.

Simplifying the model by setting the reverse process covariance matrix to constant

Interpreting the meaning of LMₜ₋₁

  1. Originally we want to minimize the distance between q(xₜ₋₁|xₜ, x₀) the reverse of the forward process and p(xₜ₋₁|xₜ), which is our neural network implementation of the reverse process, at every time step t from 2 to T. In other words, we want to find a configuration (model parameter values) for the p(xₜ₋₁|xₜ) distribution such that these two distributions are similar to each other.
  2. These two distributions for the random variable xₜ₋₁ are both multivariate Gaussian. A multivariate distribution is fully specified by it mean vector and covariance matrix. If p(xₜ₋₁|xₜ) needs to be similar to q(xₜ₋₁|xₜ, x₀), their mean vector and covariance matrix must be similar to each other. This is called momentum matching, with the mean being the first momentum, and the covariance being the second. The letter “M” in LMₜ₋₁ stands for momentum matching.
  3. After we simplified the covariance matrix from the p(xₜ₋₁|xₜ) distribution to a quantity that is equal to the covariance matrix from the reverse of the forward process, the only thing that we can still change to make these two distributions similar or different is the mean vector. So we want to minimize the distance between the mean vectors from the p(xₜ₋₁|xₜ) and the q(xₜ₋₁|xₜ, x₀) distribution.
  4. Since the mean vector from the p(xₜ₋₁|xₜ) distribution is predicted by our neural network, we can use optimization to move the values of the neural network weights around by minimizing LMₜ₋₁.

Simplifying LMₜ₋₁

  1. xₜ is known via sampling, there is no need to predict it.
  2. Given timestamp t, βₜ is constant, and so all the other quantities derived from βₜ, namely αₜ and αₜ bar.
  3. The only part that needs predicting is the noise ϵₜ.

Is this objective function still analytical?

  1. In the final formula for LMₜ₋₁, there is no mention of xₜ anymore, x is expressed via x₀ and the noise ϵₜ. So we don’t need to add the expectation with respect to xₜ. Instead, we need to add the expectation with respect to ϵₜ, which is a standard multivariate Gaussian, that is ϵₜ~N(0, 1).
  2. There is the mention of timestamp t, which represents an integer between 2 and T. We need to add an expectation with respect to t, which comes from a uniform distribution.
  3. There is the mention of x₀, which comes from the unknown data distribution q(x₀).
From paper Denoising Diffusion Probabilistic Models, page 4

Ignoring the L_T team

Approximating the L₀ term

  1. L₀ does not mention any model parameters so it can be ignored during the optimization. Or
  2. L₀ mentions model parameters and is analytical so its gradient can be taken for gradient descent.
  • write down µₜ(xₜ, x) for t=1, that is µ₁(x₁, x₀) and,
  • if µ₁(x₁, x₀) is close to the natural image sample X₀
From paper Denoising Diffusion Probabilistic Models, page 4

No concern on high variance in sample-averaging Lₛᵢₘₚₗₑ?

  1. For the random variable x₀, there is no way to compute the expectation with respect to it analytically because the data distribution q(x₀) is unknown. So sample-averaging is the only option.
  2. For the random variable t that comes from an uniform distribution. It’s expectation is just take all possible values of t, compute the formula inside the expectation and average them. This is equivalent to sample-averaging in our context of stochastic gradient descent. Even though in stochastic gradient descent, Algorithm 1 only works with a single term, instead of adding all those terms together and dividing the sum by T, the algorithm does it repeated until converging. This is equivalent to computing the expectation over t asymptotically. For more details, please see the proof in Can We Use Stochastic Gradient Descent (SGD) on a Linear Regression Model?
  3. For the standard multivariate Gaussian random variable ϵₜ, we can use Gaussian quadrature to approximate the expectation analytically. For more details about Gaussian quadrature, please see Variational Gaussian Process (VGP) — What To Do When Things Are Not Gaussian. But Gaussian quadrature works better in low dimensional settings. In our case, the ϵₜ is a d dimensional random variable with d being the number of pixels in the images that we want to generation, so d is a large integer. And applying Gaussian quadrature is not practical. For more details about why it is not practical, please see the Appendix of the above link.

Why we won’t loss model parameters when applying sample-averaging to derive the analytical formula for the loss function L

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment