-
[DDPM] Denoising Diffusion Probabilistic Models - Theory to ImplementationResearch/Generative Model 2024. 4. 22. 15:12
https://learnopencv.com/denoising-diffusion-probabilistic-models/
Diffusion probabilistic models are an exciting new area of research showing great promise in image generation. In retrospect, diffusion-based generative models were first introduced in 2015 and popularized in 2020 when Ho et al. published the paper “Denoising Diffusion Probabilistic Models” (DDPM). DDPMs are responsible for making diffusion models practical. In this article, we will highlight the key concepts and techniques behind DDPMs and train DDPMs from scratch on a “flowers” dataset for unconditional image generation.
In DDPMs, the authors changed the formulation and model training procedures which helped to improve and achieve “image fidelity” rivaling GANs and established the validity of these new generative algorithms.
1. The Need For Generative Models
The job of image-based generative models is to generate new images that are similar, in other words, “representative” of our original set of images.
We need to create and train generative models because the set of all possible images that can be represented by, say, just (256x256x3) images is enormous. An image must have the right pixel value combinations to represent something meaningful (something we can understand).
For example, for the above image to represent a “Sunflower”, the pixels in the image need to be in the right configuration (they need to have the right values). And the space where such images exist is just a fraction of the entire set of images that can be represented by a (256x256x3) image space.
Now, if we knew how to get/sample a point from this subspace, we wouldn’t need to build “‘generative models.” However, at this point in time, we don’t. 😓
The probability distribution function or, more precisely, probability density function (PDF) that captures/models this (data) subspace remains unknown and most likely too complex to make sense.
This is why we need ‘Generative models — To figure out the underlying likelihood function our data satisfies.
PS: A PDF is a “probability function” representing the density (likelihood) of a continuous random variable – which, in this case, means a function representing the likelihood of an image lying between a specific range of values defined by the function’s parameters.
PPS: Every PDF has a set of parameters that determine the shape and probabilities of the distribution. The shape of the distribution changes as the parameter values change. For example, in the case of a normal distribution, we have mean µ (mu) and variance σ2 (sigma) that control the distribution’s center point and spread.
2. What Are Diffusion Probabilistic Models?
2.1. Forward Diffsuion Process
In our previous post, “Introduction to Diffusion Models for Image Generation”, we didn’t discuss the math behind these models. We provided only a conceptual overview of how diffusion models work and focused on different well-known models and their applications. In this article, we’ll be focusing heavily on the first part.
In this section, we’ll explain diffusion-based generative models from a logical and theoretical perspective. Next, we’ll review all the math required to understand and implement Denoising Diffusion Probabilistic Models from scratch.
Diffusion models are a class of generative models inspired by an idea in Non-Equilibrium Statistical Physics, which states:
“We can gradually convert one distribution into another using a Markov chain”
– Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015
Forward Diffusion Process:
“It’s easy to destroy but hard to create”
– Pearl S. Buck
- In the “Forward Diffusion” process, we slowly and iteratively add noise to (corrupt) the images in our training set such that they “move out or move away” from their existing subspace.
- What we are doing here is converting the unknown and complex distribution that our training set belongs to into one that is easy for us to sample a (data) point from and understand.
- At the end of the forward process, the images become entirely unrecognizable. The complex data distribution is wholly transformed into a (chosen) simple distribution. Each image gets mapped to a space outside the data subspace.
2.2. Reverse Diffusion Process
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.
Stable Diffusion, 2022
- In the “Reverse Diffusion process,” the idea is to reverse the forward diffusion process.
- We slowly and iteratively try to reverse the corruption performed on images in the forward process.
- The reverse process starts where the forward process ends.
- The benefit of starting from a simple space is that we know how to get/sample a point from this simple distribution (think of it as any point outside the data subspace).
- And our goal here is to figure out how to return to the data subspace.
- However, the problem is that we can take infinite paths starting from a point in this “simple” space, but only a fraction of them will take us to the “data” subspace.
- In diffusion probabilistic models, this is done by referring to the small iterative steps taken during the forward diffusion process.
- The PDF that satisfies the corrupted images in the forward process differs slightly at each step.
- Hence, in the reverse process, we use a deep-learning model at each step to predict the PDF parameters of the forward process.
- And once we train the model, we can start from any point in the simple space and use the model to iteratively take steps to lead us back to the data subspace.
- In reverse diffusion, we iteratively perform the “denoising” in small steps, starting from a noisy image.
- This approach for training and generating new samples is much more stable than GANs and better than previous approaches like variational autoencoders (VAE) and normalizing flows.
Since their introduction in 2020, DDPMs has been the foundation for cutting-edge image generation systems, including DALL-E 2, Imagen, Stable Diffusion, and Midjourney.
With the huge number of AI art generation tools today, it is difficult to find the right one for a particular use case. In our recent article, we explored all the different AI art generation tools so that you can make an informed choice to generate the best art.
3. Mathematical Details Behind Denoising Diffusion Probabilistic Models
3.1. Mathematical Details Of The Forward Diffusion Process
The distribution q in the forward diffusion process is defined as Markov Chain given by:
How do we get image xt from xt-1 and how is noise added at each time step?
This can be easily understood by using the reparameterization trick in variational autoencoders.
Referring to the second equation, we can easily sample image xt from a normal distribution as:
There’s a problem here, which results in an inefficient forward process 🐢.
Whenever we need a latent sample x at timestep t, we have to perform t-1 steps in the Markov chain.
To fix this, the authors of the DDPM reformulated the kernel to directly go from timestep 0 (i.e., from the original image) to timestep t in the process.
To do so, two additional terms are defined:
3.2. Mathematical Details Of The Reverse Diffusion Process
“In the reverse diffusion process, the task is to learn a finite-time (within T timesteps) reversal of the forward diffusion process.”
This basically means that we have to “undo” the forward process i.e., to remove the noise added in the forward process iteratively. It is done using a neural network model.
In the forward process, the transitions function q was defined using a Gaussian, so what function should be used for the reverse process p? What should the neural network learn?
- In 1949, W. Feller showed that, for gaussian (and binomial) distributions, the diffusion process’s reversal has the same functional form as the forward process.
- This means that similar to the FDK, which is defined as a normal distribution, we can use the same functional form (a gaussian distribution) to define the reverse diffusion kernel.
- The reverse process is also a Markov chain where a neural network predicts the parameters for the reverse diffusion kernel at each timestep.
- During training, the learned estimates (of the parameters) should be close to the parameters of the FDK’s posterior at each timestep. We’ll talk more about FDK’s posterior in the next section.
- We want this because if we follow the forward trajectory in reverse, we may return to the original data distribution.
- In doing so, we would also learn how to generate new samples that closely match the underlying data distribution, starting from a pure gaussian noise (we do not have access to the forward process during inference).
1. The Markov chain for the reverse diffusion starts from where the forward process ends, i.e., at timestep T, where the data distribution has been converted into (nearly an) isotropic gaussian distribution.
2. The PDF of the reverse diffusion process is an “integral” over all the possible pathways we can take to arrive at a data sample (in the same distribution as the original) starting from pure noise xT.
Training Objective & Loss Function Used In Denoising Diffusion Probabilistic Models
The training objective of diffusion-based generative models amounts to “maximizing the log-likelihood of the sample generated (at the end of the reverse process) (x) belonging to the original data distribution.”
We have defined the transition functions in diffusion models as “Gaussians”. To maximize the log-likelihood of a gaussian distribution, it is to try and find the parameters of the distribution (𝞵, 𝝈2) such that it maximizes the “likelihood” of the (generated) data belonging to the same data distribution as the original data.
To train our neural network, we define the loss function (L) as the objective function’s negative. So a high value for p𝜭(x0), means low loss and vice versa.
Turns out, this is intractable because we need to integrate over a very high dimensional (pixel) space for continuous values over T timesteps.
Instead, the authors take inspiration from VAEs and reformulate the training objective using a variational lower bound (VLB), also known as “Evidence lower bound” (ELBO), which is this scary-looking equation 👻:
After some simplification, the DDPM authors arrive at this final Lvlb– Variational Lower Bound loss term:
We can break the above Lvlb loss term into individual timestep as follows:
You may notice that this loss function is huge! But the authors of DDPM further simplify it by ignoring some of the terms in their simplified loss function.
The terms ignored are:
- L0 – The authors got better results without this.
- LT – This is the “KL divergence” between the distribution of the final latent in the forward process and the first latent in the reverse process. However, there are no neural network parameters involved here, so we can’t do anything about it except define a good variance scheduler and use large timesteps such that they both represent an Isotropic Gaussian distribution.
So Lt-1 is the only loss term left which is a KL divergence between the “posterior” of the forward process (conditioned on xt and the initial sample x0), and the parameterized reverse diffusion process. Both terms are gaussian distributions as well.
The term q(xt-1|xt, x0) is referred to as “forward process posterior distribution.”
The job of our deep-learning model during training is to approximate/estimate the parameters of this (gaussian) posterior such that the KL divergence is as minimal as possible
The parameters of the posterior distribution are as follows:
To further simplify the task of the model, the authors decided to fix the variance to a constant 𝝱t.
Now, the model only needs to learn to predict the above equation. And the reverse diffusion kernel gets modified to:
As we have kept the variance constant, minimizing KL divergence is as simple as minimizing the difference (or distance) between means (𝞵) of two gaussian distributions q and p (for e.g. difference between the means of distributions in the left image), which can be done as follows:
Now, there are three approaches we can take here:
This is the final loss function we use to train DDPMs, which is just a “Mean Squared Error” between the noise added in the forward process and the noise predicted by the model. This is the most impactful contribution of the paper Denoising Diffusion Probabilistic Models.
It’s awesome because, beginning from those scary-looking ELBO terms, we ended up with the simplest loss function in the entire machine learning domain.
4. Writing DDPMs From Scratch In PyTorch
First and foremost, we’ll define configuration classes that will hold the hyperparameters for loading the dataset, creating log directories, and training the model.
5. Creating PyTorch Dataset Class Object
This article uses the “Flowers” dataset, which can be downloaded from Kaggle or quickly loaded in the Kaggle kernel environment. But as you may have noticed, in the BaseConfig class, we have also provided the option to load the MNIST, Cifar-10 and Cifar-100 datasets. You can choose whichever one you prefer.
Here, we are creating two functions:
- get_dataset(...): Returns the dataset class object that will be passed to the Dataloader. Three preprocessing transforms, and one augmentation are applied to every image in the dataset.
- Preprocessing:
- Convert pixel values from the range [0, 255] → [0.0, 1.0]
- Resize Images to shape (32x32).
- Change pixel values from the range [0.0, 1.0] → [-1.0, 1.0]. This is done by the DDPM authors so that the input image roughly has the same range of values as a standard gaussian.
- Augmentation:
- A random horizontal flip, as used in the original implementation. In case you are using the MNIST dataset, be sure to comment out this line.
- Preprocessing:
- inverse_transforms(...): This function is used for inverting the transforms applied during the loading step and reverting the image to the range [0.0, 255.0].
6. Creating PyTorch Dataloader Class Object
Next, we define the get_dataloader(...) function that returns a Dataloader object for the chosen dataset.
7. Visualizing Dataset
8. Model Architecture Used In DDPMs
In DDPMs, the authors use a UNet-shaped deep neural network which takes in as input:
- The input image at any stage of the reverse process.
- The timestep of the input image.
From the usual UNet architecture, the authors replaced the original double convolution at each level with “Residual blocks” used in ResNet models.
The architecture comprises 5 components:
- Encoder blocks
- Bottleneck blocks
- Decoder blocks
- Self attention modules
- Sinusoidal time embeddings
Architectural Details:
- There are four levels in the encoder and decoder path with bottleneck blocks between them.
- Each encoder stage comprises two residual blocks with convolutional downsampling except the last level.
- Each corresponding decoder stage comprises three residual blocks and uses 2x nearest neighbors with convolutions to upsample the input from the previous level.
- Each stage in the encoder path is connected to the decoder path with the help of skip connections.
- The model uses “Self-Attention” modules at a single feature map resolution.
- Every residual block in the model gets the inputs from the previous layer (and others in the decoder path) and the embedding of the current timestep. The timestep embedding informs the model of the input’s current position in the Markov chain.
In this article, we are working on an image size of (32×32). Only two minor changes exist between our model and the original model used in the paper.
- We use 64 base channels instead of 128.
- There are four levels in both encoder and decoder paths. The feature maps size at each level are kept as follows: 32 →16 → 8 → 8. We are applying self-attention at feature map sizes of both (16x16) and (8x8) as opposed to the original, where they are applied just once at a feature map size of (16x16).
9. Diffusion Class
In this section, we are creating a class called SimpleDiffusion. This class contains:
- Scheduler constants required for performing the forward and reverse diffusion process.
- A method to define the linear variance scheduler used in DDPMs.
- A method that performs a single step using the updated forward diffusion kernel.
10. Python Code For Forward Diffusion Process
In this section, we are writing the python code to perform the “forward diffusion process” in a single step as per the equation mentioned here.
The forward_diffusion(...) function takes in a batch of images and corresponding timesteps and adds noise/corrupts the input images using the updated forward diffusion kernel equation
Visualizing Forward Diffusion Process On Sample Images
In this section, we’ll visualize the forward diffusion process on some sample images to see how they get corrupted as they pass through the Markov chain for T timesteps.
Performing the forward process for some specific timesteps and also storing the noisy versions of the original image.
11. Training & Sampling Algorithms Used in Denoising Diffusion Probabilistic Models
Training code based on Algorithm 1:
The first function defined here is train_one_epoch(...). This function is used for performing “one epoch of training ” i.e., it trains the model by iterating once over the entire dataset and will be called in our final training loop.
We also use Mixed-Precision training to train the model faster and save GPU memory. The code is pretty straightforward and almost a one-to-one conversion from the algorithm.
Sampling or Inference code based on Algorithm 2:
The next function we define is reverse_diffusion(...) which is responsible for performing inference i.e., generating images using the reverse diffusion process. The function takes in a trained model and the diffusion class and can either generate a video showcasing the entire diffusion process or just the final generated image.
12. Training DDPMs From Scratch
In the previous sections, we have already defined all the necessary classes and functions required for training. All we have to do now is assemble them and start the training process.
Before we begin training:
- We’ll first define all the model-related hyperparameters.
- Then initialize the UNet model, AdamW optimizer, MSE loss function, and other necessary classes.
Then we’ll initialize the logging and checkpoint directories to save intermediate sampling results and model parameters.
Finally, we can write our training loop. As we have divided all our code into simple, easy-to-debug functions and classes, all we have to do now is call them in the epochs training loop. Specifically, we need to call the “training” and “sampling” functions defined in the previous section in a loop.
13. Generating Images Using DDPMs
You can let the training complete for 800 epochs or interrupt in between if you are satisfied with the samples generated at every 20 epochs.
To perform the inference, we simply have to reload the saved model, and you can use the same or a different logging directory to save the results. You can re-initialize the SimpleDiffusion class as well, but it’s not necessary.
The inference code is simply a call to the reverse_diffusion(...) function using the trained model.
Some of the results we got:
14. Summary
In conclusion, diffusion models represent a rapidly growing field with a wealth of exciting possibilities for the future. As research in this area continues to evolve, we can expect even more advanced techniques and applications to emerge. I encourage readers to share their thoughts and questions about this topic and to engage in conversations about the future of diffusion models.
To summarise this article📜, we covered a comprehensive list of related topics
- We began by providing an intuitive answer to the fundamental question of why we need generative models.
- Then we continued the discussion to explain diffusion-based generative models from a logical and theoretical perspective.
- After building the theoretical base, we introduced all the necessary mathematical equations derived for DDPMs one by one while also maintaining the flow so that it’s easy to grasp.
- Finally, we concluded by explaining all the different pieces of code required for training DDPMs from scratch and performing inference. We also demonstrated the results we got from our experiments.
References
- What are Diffusion Models?
- DDPMs from scratch
- Diffusion Models | Paper Explanation | Math Explained
- Paper – Deep Unsupervised Learning using Nonequilibrium Thermodynamics
- Paper – Denoising Diffusion Probabilistic Models
- Paper – Improved Denoising Diffusion Probabilistic Models
- Paper – A Survey on Generative Diffusion Model
- An introduction to Diffusion Probabilistic Models – Ayan Das
- Denoising diffusion probabilistic models – Param Hanji
'Research > Generative Model' 카테고리의 다른 글
Two Formulations of Diffusion Models (0) 2024.04.29 Understanding Diffusion Probabilistic Models (DPMs) (0) 2024.04.27 Denoising Diffusion Probabilistic Models 정리 (0) 2024.04.21 DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection (0) 2024.04.16 Introduction to Diffusion Models (0) 2024.04.15