Variational Diffusion Models

Generative Model/Generative Model_2 2025. 1. 23. 00:21

https://arxiv.org/pdf/2107.00630
https://github.com/google-research/vdm
(NeurIPS 2021) Jul 2021

★ ★ ★ ★ 이걸 먼저 봐야 함!!!!! ★ ★ ★ ★

↓ ↓ ↓ ↓ ↓ ↓ ↓
Demystifying Variational Diffusion Models https://arxiv.org/pdf/2401.06281
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

「 Variational Diffusion Models 」 이 논문에서의 수식이
Classifier-free guidance, distillation, Imagen 등등..에서 계속 이어지는데,
(score-based model 계열 논문에서도 수식, 개념이 나옴)
본 논문에 대한 언급을 보지 못했다.

엄청 중요해 보이는데.. 왜 언급을 찾질 못했을까..

심지어 Lil log에도 본 논문이 reference에 포함되어 있지가 않다.

난 이 논문을 보고 또 보고,
위에 나의 구세주 같은 「 Demystifying Variational Diffusion Models 」을 여러 번 본 후에야
Classifier-free guidance, distillation, Imagen을 이해하게 되었는데..

이거 내가 이상한 건가...?
다들 그냥 보고 바로 이해하신 건가요...?

「 Demystifying Variational Diffusion Models 」에서
모든 수식을 상세하게 derivation 해준다.
넘나 감사 ㅠㅠ
모든 게 clear해진다. 완전 강추

Abstract

Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm

1. Introduction

Likelihood-based generative modeling is a central task in machine learning that is the basis for a wide range of applications ranging from speech synthesis [Oord et al., 2016], to translation [Sutskever et al., 2014], to compression [MacKay, 2003], to many others. Autoregressive models have long been the dominant model class on this task due to their tractable likelihood and expressivity, as shown in Figure 1. Diffusion models have recently shown impressive results in image [Ho et al., 2020, Song et al., 2021b, Nichol and Dhariwal, 2021] and audio generation [Kong et al., 2020, Chen et al., 2020] in terms of perceptual quality, but have yet to match autoregressive models on density estimation benchmarks. In this paper we make several technical contributions that allow diffusion models to challenge the dominance of autoregressive models in this domain. Our main contributions are as follows:

• We introduce a flexible family of diffusion-based generative models that achieve new state-of-the-art log-likelihoods on standard image density estimation benchmarks (CIFAR-10 and ImageNet). This is enabled by incorporating Fourier features into the diffusion model and using a learnable specification of the diffusion process, among other modeling innovations.

• We improve our theoretical understanding of density modeling using diffusion models by analyzing their variational lower bound (VLB), deriving a remarkably simple expression in terms of the signal-to-noise ratio of the diffusion process. This result delivers new insight into the model class: for the continuous-time (infinite-depth) setting we prove a novel invariance of the generative model and its VLB to the specification of the diffusion process, and we show that various diffusion models from the literature are equivalent up to a trivial time-dependent rescaling of the data.

2. Related Work

Our work builds on diffusion probabilistic models (DPMs) [Sohl-Dickstein et al., 2015], or diffusion models in short. DPMs can be viewed as a type of variational autoencoder (VAE) [Kingma and Welling, 2013, Rezende et al., 2014], whose structure and loss function allows for efficient training of arbitrarily deep models. Interest in diffusion models has recently reignited due to their impressive image generation results [Ho et al., 2020, Song and Ermon, 2020].

Ho et al. [2020] introduced a number of model innovations to the original DPM, with impressive results on image generation quality benchmarks. They showed that the VLB objective, for a diffusion model with discrete time and diffusion variances shared across input dimensions, is equivalent to multiscale denoising score matching, up to particular weightings per noise scale. Further improvements were proposed by Nichol and Dhariwal [2021], resulting in better log-likelihood scores. Gao et al. [2020] show how diffusion can also be used to efficiently optimize energy-based models (EBMs) towards a close approximation of the log-likelihood objective, resulting in high-fidelity samples even after long MCMC chains.

Song and Ermon [2019] first proposed learning generative models through a multi-scale denoising score matching objective, with improved methods in Song and Ermon [2020]. This was later extended to continuous-time diffusion with novel sampling algorithms based on reversing the diffusion process [Song et al., 2021b].

Concurrent to our work, Song et al. [2021a], Huang et al. [2021], and Vahdat et al. [2021] also derived variational lower bounds to the data likelihood under a continuous-time diffusion model. Where we consider the infinitely deep limit of a standard VAE, Song et al. [2021a] and Vahdat et al. [2021] present different derivations based on stochastic differential equations. Huang et al. [2021] considers both perspectives and discusses the similarities between the two approaches. An advantage of our analysis compared to these other works is that we present an intuitive expression of the VLB in terms of the signal-to-noise ratio of the diffused data, leading to much simplified expressions of the discrete-time and continuous-time loss, allowing for simple and numerically stable implementation. This also leads to new results on the invariance of the generative model and its VLB to the specification of the diffusion process. We empirically compare to these works, as well as others, in Table 1.

Previous approaches to diffusion probabilistic models fixed the diffusion process; in contrast optimize the diffusion process parameters jointly with the rest of the model. This turns the model into a type of VAE [Kingma and Welling, 2013, Rezende et al., 2014]. This is enabled by directly parameterizing the mean and variance of the marginal q(zt|z0), where previous approaches instead parameterized the individual diffusion steps q(zt+ε|zt). In addition, our denoising models include several architecture changes, the most important of which is the use of Fourier features, which enable us to reach much better likelihoods than previous diffusion probabilistic models.

3. Model

We will focus on the most basic case of generative modeling, where we have a dataset of observations of x, and the task is to estimate the marginal distribution p(x). As with most generative models, the described methods can be extended to the case of multiple observed variables, and/or the task of estimating conditional densities p(x|y). The proposed latent-variable model consists of a diffusion process (Section 3.1) that we invert to obtain a hierarchical generative model (Section 3.3). As we will show, the model choices below result in a surprisingly simple variational lower bound (VLB) of the marginal likelihood, which we use for optimization of the parameters.

3.1. Forward time diffusion process

3.2. Noise schedule

3.3. Reverse time generative model

3.4. Noise prediction model and Fourier features

3.5. Variational lower bound

4. Discrete-time model

4.1. More steps leads to a lower loss

5. Continuous-time model: T → ∞

5.1. Equivalence of diffusion models in contiuous time

5.2. Weighted diffusion loss

5.3. Variance minimization

Lowering the variance of the Monte Carlo estimator of the continuous-time loss generally improves the efficiency of optimization. We found that using a low-discrepancy sampler for t, as explained in Appendix I.1, leads to a significant reduction in variance. In addition, due to the invariance shown in Section 5.1 for the continous-time case, we can optimize the schedule between its endpoints w.r.t. to minimize the variance of our estimator of loss, as detailed in Appendix I. The endpoints of the noise schedule are simply optimized w.r.t. the VLB.

6. Experiments

We demonstrate our proposed class of diffusion models, which we call Variational Diffusion Models (VDMs), on the CIFAR-10 [Krizhevsky et al., 2009] dataset, and the downsampled ImageNet [Van Oord et al., 2016, Deng et al., 2009] dataset, where we focus on maximizing likelihood. For our result with data augmentation we used random flips, 90-degree rotations, and color channel swapping. More details of our model specifications are in Appendix B.

6.1. Likelihood and samples

Table 1 shows our results on modeling the CIFAR-10 dataset, and the downsampled ImageNet dataset. We establish a new state-of-the-art in terms of test set likelihood on all considered benchmarks, by a significant margin. Our model for CIFAR-10 without data augmentation surpasses the previous best result of 2.80 about 10x faster than it takes the Sparse Transformer to reach this, in wall clock time on equivalent hardware. Our CIFAR-10 model, whose hyper-parameters were tuned for likelihood, results in a FID (perceptual quality) score of 7.41. This would have been state-of-the-art until recently, but is worse than recent diffusion models that specifically target FID scores [Nichol and Dhariwal, 2021, Song et al., 2021b, Ho et al., 2020]. By instead using a weighted diffusion loss, with the weighting function w(SNR) used by Ho et al. [2020] and described in Appendix K, our FID score improves to 4.0. We did not pursue further tuning of the model to improve FID instead of likelihood. A random sample of generated images from our model is provided in Figure 3. We provide additional samples from this model, as well as our other models for the other datasets, in Appendix M.

6.2. Ablations

Next, we investigate the relative importance of our contributions. In Table 2 we compare our discrete-time and continuous-time specifications of the diffusion model: When evaluating our model with a small number of steps, our discretely trained models perform better by learning the diffusion schedule to optimize the VLB. However, as argued theoretically in Section 4.1, we find experimentally that more steps T indeed gives better likelihood. When T grows large, our continuously trained model performs best, helped by training its diffusion schedule to minimize variance instead.

Minimizing the variance also helps the continuous time model to train faster, as shown in Figure 5. This effect is further examined in Table 4b, where we find dramatic variance reductions compared to our baselines in continuous time. Figure 4a shows how this effect is achieved: Compared to the other schedules, our learned schedule spends much more time in the high SNR(t) / low σ^2 t range.

In Figure 5 we further show training curves for our model including and excluding the Fourier features proposed in Appendix C: with Fourier features enabled our model achieves much better likelihood. For comparison we also implemented Fourier features in a PixelCNN++ model [Salimans et al., 2017], where we do not see a benefit. In addition, we find that learning the SNR is necessary to get the most out of including Fourier features: if we fix the SNR schedule to that used by Ho et al. [2020], the maximum log-SNR is fixed to approximately 8 (see figure 7), and test set negative likelihood stays above 4 bits per dim. When learning the SNR endpoints, our maximum log-SNR ends up at 13.3, which, combined with the inclusion of Fourier features, leads to the SOTA test set likelihoods reported in Table 1.

6.3. Lossless compression

For a fixed number of evaluation timesteps T_eval, our diffusion model in discrete time is a hierarchical latent variable model that can be turned into a lossless compression algorithm using bits-back coding [Hinton and Van Camp, 1993]. As a proof of concept of practical lossless compression, Table 2 reports net codelengths on the CIFAR10 test set for various settings of T_eval using BBANS [Townsend et al., 2018], an implementation of bits-back coding based on asymmetric numeral systems [Duda, 2009]. Details of our implementation are given in Appendix N. We achieve state-of-the-art net codelengths, proving our model can be used as the basis of a lossless compression algorithm. However, for large T_eval a gap remains with the theoretically optimal codelength corresponding to the negative VLB, and compression becomes computationally expensive due to the large number of neural network forward passes required. Closing this gap with more efficient implementations of bits-back coding suitable for very deep models is an interesting avenue for future work.

7. Conclusion

We presented state-of-the-art results on modeling the density of natural images using a new class of diffusion models that incorporates a learnable diffusion specification, Fourier features for fine-scale modeling, as well as other architectural innovations. In addition, we obtained new theoretical insight into likelihood-based generative modeling with diffusion models, showing a surprising invariance of the VLB to the forward time diffusion process in continuous time, as well as an equivalence between various diffusion processes from the literature previously thought to be different.

A. Distribution details

B. Hyperparameters, architecture, and implementation details

C. Fourier features for improved fine scale prediction

D. As a SDE

E. Derivation of the VLB estimators

F. Influence of the number of steps T on the VLB

G. Equivaluence of diffusion specifications

H. Implementation of monotonic neural net noise schedule γη(t)

K. Comparison to DDPM and NCSN objectives

L. Consistency

'Generative Model > Generative Model_2' 카테고리의 다른 글

Improved Techniques for Training Score-Based Generative Models (0)	2025.02.15
Generative Modeling by Estimating Gradients of the Data Distribution (0)	2025.02.15
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models (0)	2025.01.31
Neural Ordinary Differential Equations (0)	2025.01.30
[VDMs] 계속 등장하는 수식들 (0)	2025.01.27

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

Abstract

1. Introduction

2. Related Work