04 Ddpm Ve | Zhijun's Notes

$$ \newcommand{aster}{*} \newcommand{exist}{\exists} \newcommand{B}{\mathbb B} \newcommand{C}{\mathbb C} \newcommand{I}{\mathbb I} \newcommand{N}{\mathbb N} \newcommand{Q}{\mathbb Q} \newcommand{R}{\mathbb R} \newcommand{Z}{\mathbb Z} \newcommand{eR}{\overline {\mathbb R}} \newcommand{cD}{ {\mathbb D}} \newcommand{dD}{ {\part \mathbb D}} \newcommand{dH}{ {\part \mathbb H}} \newcommand{eC}{\overline {\mathbb C}} \newcommand{A}{\mathcal A} \newcommand{D}{\mathcal D} \newcommand{E}{\mathcal E} \newcommand{F}{\mathcal F} \newcommand{G}{\mathcal G} \newcommand{H}{\mathcal H} \newcommand{J}{\mathcal J} \newcommand{L}{\mathcal L} \newcommand{U}{\mathcal U} \newcommand{M}{\mathcal M} \newcommand{O}{\mathcal O} \newcommand{P}{\mathcal P} \newcommand{S}{\mathcal S} \newcommand{T}{\mathcal T} \newcommand{V}{\mathcal V} \newcommand{W}{\mathcal W} \newcommand{X}{\mathcal X} \newcommand{Y}{\mathcal Y} \newcommand{bE}{\symbf E} \newcommand{bF}{\symbf F} \newcommand{bD}{\symbf D} \newcommand{bI}{\symbf I} \newcommand{bX}{\symbf X} \newcommand{bY}{\symbf Y} \newcommand{nz}{\mathcal Z} \newcommand{bT}{\mathbb T} \newcommand{bB}{\mathbb B} \newcommand{bS}{\mathbb S} \newcommand{bA}{\mathbb A} \newcommand{bL}{\mathbb L} \newcommand{bP}{\symbf P} \newcommand{bM}{\symbf M} \newcommand{bH}{\mathbb H} \newcommand{dd}{\mathrm d} \newcommand{Mu}{\mathup M} \newcommand{Tau}{\mathup T} \newcommand{ae}{\operatorname{a.e.}} \newcommand{aut}{\operatorname{aut}} \newcommand{adj}{\operatorname{adj}} \newcommand{char}{\operatorname{char}} \newcommand{cov}{\operatorname{Cov}} \newcommand{cl}{\operatorname{cl}} \newcommand{cont}{\operatorname{cont}} \newcommand{e}{\mathbb E} \newcommand{pp}{\operatorname{primitive}} \newcommand{dist}{\operatorname{dist}} \newcommand{diam}{\operatorname{diam}} \newcommand{fp}{\operatorname{Fp}} \newcommand{from}{\leftarrow} \newcommand{Gal}{\operatorname{Gal}} \newcommand{GCD}{\operatorname{GCD}} \newcommand{LCM}{\operatorname{LCM}} \newcommand{fg}{\mathrm{fg}} \newcommand{gf}{\mathrm{gf}} \newcommand{im}{\operatorname{Im}} \newcommand{image}{\operatorname{image}} \newcommand{inj}{\hookrightarrow} \newcommand{irr}{\operatorname{irr}} \newcommand{lcm}{\operatorname{lcm}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{loc}{ {\operatorname{loc}}} \newcommand{null}{\operatorname{null}} \newcommand{part}{\partial} \newcommand{pf}{\operatorname{Pf}} \newcommand{pv}{\operatorname{Pv}} \newcommand{rank}{\operatorname{rank}} \newcommand{range}{\operatorname{range}} \newcommand{re}{\operatorname{Re}} \newcommand{span}{\operatorname{span}} \newcommand{su}{\operatorname{supp}} \newcommand{sgn}{\operatorname{sgn}} \newcommand{syn}{\operatorname{syn}} \newcommand{var}{\operatorname{Var}} \newcommand{res}{\operatorname{Res}} \newcommand{data}{\operatorname{data}} \newcommand{erfc}{\operatorname{erfc}} \newcommand{erfcx}{\operatorname{erfcx}} \newcommand{tr}{\operatorname{tr}} \newcommand{col}{\operatorname{Col}} \newcommand{row}{\operatorname{Row}} \newcommand{sol}{\operatorname{Sol}} \newcommand{lub}{\operatorname{lub}} \newcommand{glb}{\operatorname{glb}} \newcommand{ltrieq}{\mathrel{\unlhd}} \newcommand{ltri}{\mathrel{\lhd}} \newcommand{lr}{\leftrightarrow} \newcommand{phat}{^\widehat{\,\,\,}} \newcommand{what}{\widehat} \newcommand{wbar}{\overline} \newcommand{wtilde}{\widetilde} \newcommand{iid}{\operatorname{i.i.d.}} \newcommand{Exp}{\operatorname{Exp}} \newcommand{abs}[1]{\left| {#1}\right|} \newcommand{d}[2]{D_{\text{KL}}\left (#1\middle\| #2\right)} \newcommand{n}[1]{\|#1\|} \newcommand{norm}[1]{\left\|{#1}\right\|} \newcommand{pd}[2]{\left \langle {#1},{#2} \right \rangle} \newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{p}[1]{\left({#1}\right)} \newcommand{c}[1]{\left \{ {#1}\right\}} \newcommand{s}[1]{\left [{#1}\right]} \newcommand{a}[1]{\left \langle{#1}\right\rangle} \newcommand{cc}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{f}{\mathfrak F} \newcommand{fi}{\mathfrak F^{-1}} \newcommand{Fi}{\mathcal F^{-1}} \newcommand{l}{\mathfrak L} \newcommand{li}{\mathfrak L^{-1}} \newcommand{Li}{\mathcal L^{-1}} \newcommand{const}{\text{const.}} $$

Denoising Diffusion Probabilistic Models: Variance Explosion #

The scaling of $\symbf X _ {n - 1}$ slightly complicates the mathematical form of the equations. In this note we derive DDPM without the scaling factor. This simplifies the expressions.

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. ArXiv, abs/2006.11239.

Markov forward diffusion #

Suppose $\symbf X _ 0 \sim p _ 0(\symbf x)$ is a random variable in $\R^d$ with data distribution.

Consider a sequence of strictly increasing square root of variance $(\sigma _ 0, \sigma _ 1, \ldots, \sigma _ T)$ where $\sigma _ 0 = 0$ and $\sigma _ T$ is sufficiently large.

Let $(\symbf Z _ n) _ {n = 1}^{T} \sim _ {\iid} \mathcal N(\symbf 0, \bI _ d)$ be $T$ independent gaussian random variables.

For $n \in \c{1, \ldots, T}$ define $\beta _ n > 0$ as $\beta _ n = \sigma _ n^2 - \sigma _ {n - 1}^2$. Define $(\symbf X _ {n}) _ {n = 1}^T$ recursively as: $$ \symbf X _ n = \sqrt{\beta _ n} \symbf Z _ n + \symbf X _ {n - 1}, \quad n \in \c{1, \ldots, T} $$ The Markov process $(\symbf X _ n) _ {n = 0}^T$ is called the forward diffusion, which gradually adds noise into $\symbf X _ 0$.

Suppose $q(\symbf x _ 0, \ldots, \symbf x _ T)$ is the joint density function of $(\symbf X _ 0, \ldots, \symbf X _ T)$. The forward diffusion process has conditional density given by: $$ q(\symbf x _ {1..T}|\symbf x _ 0) = \prod _ {n=1}^T q(\symbf x _ {n} | \symbf x _ {n-1}); \quad q(\symbf x _ n | \symbf x _ {n-1}) = \mathcal N\p{\symbf x _ n; \symbf x _ {n-1}, \beta _ n \bI _ d} $$ The marginal conditional density is given by: $$ q(\symbf x _ n | \symbf x _ 0) = \mathcal N(\symbf x _ n ; \symbf x _ 0, \sigma _ n^2\bI _ d) $$ Therefore for some $\symbf E _ n \sim \mathcal N(\symbf 0, \bI _ d)$ depending on $(\symbf Z _ 1, \ldots, \symbf Z _ n)$, we have: $$ \symbf X _ n = \sigma _ n \symbf E _ n + \symbf X _ 0 $$

A posterior of forward diffusion #

The posterior density can be derived by Bayes' theorem

$$ q(\symbf{x} _ {n-1}|\symbf{x} _ {n}, \symbf{x} _ 0) = \frac{q(\symbf{x} _ {n}|\symbf{x} _ {n-1}, \symbf{x} _ 0)q(\symbf{x} _ {n-1}|\symbf{x} _ 0)}{q(\symbf{x} _ n|\symbf{x} _ 0)} $$

Where we already know:

$q(\symbf{x} _ n | \symbf{x} _ {n - 1}, \symbf{x} _ 0) = q(\symbf{x} _ n | \symbf{x} _ {n-1}) = \mathcal N(\symbf{x} _ n; \symbf{x} _ {n-1}, \beta _ n \symbf{I})$.
$q(\symbf{x} _ n | \symbf{x} _ 0) = \mathcal N(\symbf{x} _ n ; \symbf{x} _ 0, \sigma _ n^2 \bI)$ and $q(\symbf{x} _ {n-1} | \symbf{x} _ 0) = \mathcal N(\symbf{x} _ {n - 1}; \symbf{x} _ 0, \sigma _ {n - 1}^2 \symbf{I})$.

Let $\gamma _ n = \sigma _ {n - 1}^2 / \sigma _ n^2$. The conditional density $q(\symbf x _ {n - 1} | \symbf x _ n, \symbf x _ 0), n \in \c{2, \ldots, T}$ has the closed form solution: $$ \begin{aligned} q(\symbf{x} _ {n - 1} | \symbf{x} _ n, \symbf{x} _ 0) &= \mathcal N\p{\symbf{x} _ {n - 1}; \tilde \mu _ n(\symbf{x} _ n, \symbf{x} _ 0), \tilde \beta _ n I}\\ \tilde \mu _ n(\symbf x _ n, \symbf x _ 0) & = \gamma _ n \symbf x _ n + (1 - \gamma _ n) \symbf x _ 0; \quad \tilde \beta _ n = \sigma _ {n - 1}^2 (1 - \gamma _ n) \end{aligned} $$

Discretized reverse diffusion #

Consider a reverse time Markov process with states in $\R^d$ decomposed as: $$ p _ \theta(\symbf x _ 0, \ldots, \symbf x _ T) = p _ \theta(\symbf x _ T) \prod _ {n = 1}^T p _ \theta(\symbf x _ {n - 1} | \symbf x _ {n}),\quad p _ \theta(\symbf x _ T) := \mathcal N(\symbf x _ T;\symbf 0, \sigma _ T^2\bI _ d) $$ this process is also called the denoising process.

ELBO of DDPM(VE) #

Similar to DDPM, the form of the ELBO is the same: $$ \begin{align} \mathcal L _ \theta = E\s{ \log p _ \theta (\symbf{X} _ 0 | \symbf{X} _ 1) - \sum _ {n=2}^T\d{q(\symbf{x} _ {n-1}|\symbf{X} _ n, \symbf{X} _ 0)}{p _ \theta(\symbf{x} _ {n-1}|\symbf{X} _ n)} - \d{q(\symbf x _ T | \symbf X _ 0)}{p _ \theta(\symbf x _ T)} } + \const \end{align} $$

Consider the following restricted sampler formulation for $p _ \theta(\symbf x _ {n - 1} | \symbf x _ n)$, $n \ge 2$ consider: $$ p _ \theta(\symbf x _ {n - 1} |\symbf x _ n) = \mathcal N(\symbf x _ {n - 1}; \mu _ \theta(\symbf x _ n, n), \Sigma _ \theta(\symbf x _ n, n)), \quad \Sigma _ \theta(\symbf x _ n, n) = \omega^2 _ n \bI _ d $$

Further define that: $$ \mu _ \theta(\symbf x _ n, n) := \tilde \mu _ n(\symbf x _ n, D _ \theta(\symbf x _ n, \sigma _ n)) $$

Take the KL term at $D _ n(\symbf x _ n, \symbf x _ 0) := \d{q(\symbf x _ {n - 1} | \symbf x _ n, \symbf x _ 0)}{p _ \theta(\symbf x _ {n - 1} | \symbf x _ n)}$ as an example. According to the analytic KL divergence of diagonal Gaussians: $$ D _ n(\symbf x _ n, \symbf x _ 0) = \frac{1}{2 \omega _ n^2} \norm{\tilde \mu _ n(\symbf x _ n, \symbf x _ 0) - \tilde \mu _ n (\symbf x _ n, D _ \theta(\symbf x _ n, \sigma _ n))} _ 2^2 = \frac{(1 - \gamma _ n)^2}{2 \omega _ n^2} \norm{\symbf x _ 0 - D _ \theta(\symbf x _ n, \sigma _ n)} _ 2^2 \label{equ:ddpmveloss} $$

Optimal Variance TODO #

Consider the following decomposition of the ELBO: $$ \begin{aligned} \L _ \theta(\symbf{X} _ 0) & = E\s{ \log p _ \theta(\symbf{X} _ T) + \sum _ {n=1}^T\log \frac{p _ \theta(\symbf{X} _ {n-1}|\symbf{X} _ {n})}{q(\symbf{X} _ n|\symbf{X} _ {n-1})} } \\ & = E\s{ \log p _ \theta(\symbf{X} _ T) + \sum _ {n=1}^T\log \p{\frac{p _ \theta(\symbf{X} _ {n-1}|\symbf{X} _ {n})}{q(\symbf{X} _ {n-1}|\symbf{X} _ {n})} \cdot \frac{q(\symbf{X} _ {n-1})}{q(\symbf{X} _ n)}} }\\ & = E\s{ \log \frac{p _ \theta(\symbf{X} _ T)}{q(\symbf{X} _ T)} + \sum _ {n=1}^T\log \frac{p _ \theta(\symbf{X} _ {n-1}|\symbf{X} _ {n})}{q(\symbf{X} _ {n-1}|\symbf{X} _ {n})} + \log q(\symbf{X} _ 0) } \\ & = -\d{q(\symbf{X} _ T)}{p _ \theta(\symbf{X} _ T)} - E\s{\d{q(\symbf{x} _ {n-1}|\symbf{X} _ n)}{p _ \theta(\symbf{x} _ {n-1}|\symbf{X} _ n)} + \log q(\symbf{X} _ 0)} \end{aligned} $$ Notice that we haven't discussed the choice of variance $\omega _ n^2$ here. It is left for further discussion. It is straight forward to derive the optimal variance for some special cases:

When $\symbf X _ 0$ is deterministic or $q _ 0(\symbf x _ 0) = \delta(\symbf x _ 0 - \symbf c _ 0)$, then: $$ q(\symbf x _ {n - 1} | \symbf x _ {n}) = q(\symbf x _ {n - 1} | \symbf x _ n, \symbf x _ 0) = \mathcal N\p{\symbf{x} _ {n - 1}; \tilde \mu _ n(\symbf{x} _ n, \symbf{x} _ 0), \tilde \beta _ n I} $$ So $\omega _ n = \tilde \beta _ n$ is the optimal solution.

When $\symbf X _ 0$ has density $q _ 0(\symbf{x} _ 0) = \mathcal N(\symbf{x} _ 0; \symbf{0}, \sigma _ {\mathrm{data}}^2\symbf{I})$, then

$q(\symbf{x} _ {n-1} | \symbf{x} _ n) = {q(\symbf{x} _ {n-1}) q(\symbf{x} _ n | \symbf{x} _ {n - 1})} /{q(\symbf{x} _ n)}$.
$q(\symbf{x} _ n) = \mathcal N(\symbf{x} _ n; \symbf{0}, (\sigma _ {\mathrm{data}}^2 + \sigma _ {n}^2)\symbf{I})$, $q(\symbf{x} _ {n-1}) = \mathcal N(\symbf{x} _ {n-1}; \symbf{0}, (\sigma _ {\mathrm{data}}^2 + \sigma _ {n - 1}^2)\symbf{I})$.
$q(\symbf{x} _ n | \symbf{x} _ {n - 1}) = \mathcal N(\symbf{x} _ n; \symbf{x} _ {n-1}, \beta _ n \symbf{I})$.

Therefore: $$ q(\symbf x _ {n - 1} | \symbf x _ {n}) = \mathcal N\p{\symbf x _ {n - 1}; \frac{\sigma _ {\mathrm{data}}^2 + \sigma _ {n - 1}^2}{\sigma _ {\mathrm{data}}^2 + \sigma _ n^2}\cdot \symbf x _ n, \frac{\sigma _ {\mathrm{data}}^2 + \sigma _ {n - 1}^2}{\sigma _ {\mathrm{data}}^2 + \sigma _ n^2} \cdot \beta _ n} $$

Continuous Time Likelihood Weighting of DDPM(VE) #

[1] Kingma, Diederik, et al. "Variational diffusion models." Advances in neural information processing systems 34 (2021): 21696-21707.
[2] Song, Yang, et al. "Maximum likelihood training of score-based diffusion models." Advances in Neural Information Processing Systems 34 (2021): 1415-1428.

Inspired by [1], in the following derivation, we demonstrates what happens when we take $T \to \infty$.

We fix the minimal and the maximal variance $\sigma _ 1$ and $\sigma _ T$, from now on we will denote them as $\sigma _ {\min}$ and $\sigma _ {\max}$ instead.

Suppose there is a function $\sigma(t): [0, 1] \to [\sigma _ {\min}, \sigma _ {\max}]$ such that:

$\sigma(t)$ is strictly increasing and continuously differentiable.
$\sigma(0) = 0$ and $\sigma(n / T) = \sigma _ n$ for $n \in \c{1,\ldots, T}$.

Consider equation $\ref{equ:ddpmveloss}$, notice that equivalently, for $2 \le n \le T$: $$ \begin{align} D _ n(\symbf x _ n, \symbf x _ 0) & = \frac{\sigma _ n^2 - \sigma _ {n - 1}^2}{2 \sigma _ n^2 \sigma _ {n - 1}^2} \norm{\symbf x _ 0 - D _ \theta(\symbf x _ n, \sigma _ n)} _ 2^2 \\ & \gtrapprox \frac{\sigma^2(n/T) - \sigma^2((n - 1) / T)}{2 \sigma^4(n/T)} \norm{\symbf x _ 0 - D _ \theta(\symbf x _ {n}, \sigma(n/T))} _ 2^2 \end{align} $$

Now notice that as the partition becomes finer and $T \to \infty$: $$ \begin{align} \lim _ {T \to \infty}\sum _ {n = 2}^T D _ n(\symbf x _ n, \symbf x _ 0) &\to \int _ {\sigma _ {\min}}^{\sigma _ {\max}} \frac{D[\sigma^2] (t)}{2 \sigma^4(t)} \norm{\symbf x _ {0} - D _ \theta(\symbf x _ t, \sigma(t))} _ 2^2 \dd t \\ &= \int _ {\sigma _ {\min}}^{\sigma _ {\max}} D[\sigma^2] (t) \norm{S _ \theta(\symbf x _ t, \sigma(t)) - \nabla \log p(\symbf x _ t | \symbf x _ 0)} _ 2^2 \dd t \end{align} $$ This is consistent with the result given in reference [1] and [2]. The verification is left to the readers.