VIII. Differential Equations

The discrete score matching models we've studied so far can be generalized as continuous differential equations. Presentation here follows Song et al. 2021.

In this section, we'll first give the generalized definitions. Then we'll interpret them in the context of what we've seen before.

Fundamentals

Results in this section are presented without proof for now, due to constraints on my time. Apologies.

Define. A forward-time diffusion SDE is modeled \[ \begin{align*} d\mathbf{x} = f(\mathbf{x}, t)dt +g(t)d\mathbf{w} \end{align*} \] where \(\mathbf{w}\) is a a Wiener process, \(f(\cdot)\) is a "drift" coefficient, and \(g(\cdot)\) is a "diffusion" coefficient.

As time flows forward, noise is added to the observations.

Result. The reverse-time diffusion SDE was shown by Anderson (1982) to be \[ \begin{align*} d\mathbf{x} = (f(\mathbf{x}, t) - g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x}))dt + g(t)d\tilde{\mathbf{w}}, \end{align*} \] where \(\bar{\mathbf{w}}\) is a backward-flowing Wiener process.

As time flows backward, noise is removed from the observations. From this equation it's clear that reversing the SDE comes down to learning to estimate the gradient \(\nabla_\mathbf{x}\log p_t(\mathbf{x})\) i.e. score matching.

Define. The probability flow ODE is defined, \[ \begin{equation*} d\mathbf{x} = \left(f(\mathbf{x},t) - \frac{1}{2}g(t)^2\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\right)dt. \end{equation*} \] Result. Every reverse-time diffusion SDE has a corresponding probability flow ODE with the same marginal densities.

Variance-Exploding SDEs

Here we'll present a specific instantiation of the SDE framework, the variance-exploding SDE. This framework naturally extends the discrete score matching framework to continous time.

Define. The variance-exploding SDE and corresponding probability flow ODE are modeled as \[ \begin{align*} d\mathbf{x} & = \underbrace{\sqrt{d\sigma^2}}_{g(t)}d\mathbf{w} \tag{forward-time SDE} \\d\mathbf{x} & = -\frac{1}{2}\underbrace{\nabla_\mathbf{x}\log p_t(\mathbf{x})}_{\approx\ s_\theta(\mathbf{x})}d\sigma^2 \tag{reverse-time ODE} \end{align*} \] Justification. Recall that in a discrete score matching model, at training time we add noise \[ \begin{align*} \mathbf{x}_t \sim p_t(\mathbf{x}_t| \mathbf{x}_0) = \mathbf{x}_0 + \sigma_t\boldsymbol\epsilon_t,\quad\quad\boldsymbol\epsilon_t \sim N(\mathbf{0}, \mathbf{I}). \end{align*} \] Due to the additive variance of Gaussian random variables, we can interpret this as a Markov chain \[ \begin{align*} \mathbf{x}_t & = \mathbf{x}_{t-1}+ \sqrt{\sigma_t^2- \sigma_{t-1}^2}\boldsymbol\epsilon_t. \end{align*} \] Taking the limit as \(T\rightarrow\infty\) recovers the desired forward-time SDE solution. Our setup has \[ \begin{align*} f(\mathbf{x},t) & = 0 & g(t) & = \sqrt{d\sigma^2}. \end{align*} \] Remark. Just as in the discrete score matching case, training comes down to estimating the score with a neural network, \(s_\theta(\mathbf{x}) \approx \nabla_{\mathbf{x}}\log p_t(\mathbf{x}).\) However, instead of using Langevin dynamics to produce samples, at inference time we'll solve the corresponding probability flow ODE.

Result. An Euler step on the probability flow ODE from \(\mathbf{x}_t\) to \(\mathbf{x}_{t-1}\) is given by discretizing \[ \begin{align*} \mathbf{x}_{t-1} & = \mathbf{x}_t - \frac{1}{2}s_\theta(\mathbf{x}_t)\left(\sigma_t^2 - \sigma_{t-1}^2\right)\\ & = \mathbf{x}_t + \frac{1}{2\sigma_t}\hat{\boldsymbol\epsilon}(\mathbf{x}_{t})\left(\sigma_t^2 - \sigma_{t-1}^2\right). \end{align*} \]

Equivalence above comes from the fact from the previous page that we can specify the neural network to model either the score, or the noise that was added to \(\mathbf{x}_0\) at sampling time, \[ \begin{align*} \underbrace{\nabla_{\mathbf{x}_t} p_t(\mathbf{x}_t|\mathbf{x}_0)}_{\approx\ s_\theta(\mathbf{x}_t)} & = \frac{-1}{\sigma_t^2}(\mathbf{x}_t - \mathbf{x}_0) = \frac{-1}{\sigma_t} \underbrace{\boldsymbol\epsilon_t}_{\approx\ \hat{\boldsymbol\epsilon}(\mathbf{x}_t)}. \end{align*} \]

Diffusion Models as Variance-Exploding SDEs

Now we can connect what we've learned to the diffusion models we're familiar with. Recall the notation from prior pages.

Big Result. A diffusion model is equivalent to training and inference using a variance-exploding SDE on \(\bar{\mathbf{x}}_t\) with scales \(\lambda_t\), defined \[ \begin{align*} \bar{\mathbf{x}}_t &\triangleq \frac{\mathbf{x}_t}{\alpha_t}&\lambda_t &\triangleq \frac{\sigma_t}{\alpha_t} &d\bar{\mathbf{x}}_t & = \hat{\boldsymbol\epsilon}(\alpha_t \bar{\mathbf{x}}_t)d\lambda_t. \end{align*} \] This is the continuous analogue to the result from our previous page connecting diffusion models to discrete score matching. We'll break this up into two smaller results.

Result 1. At training time, making the DDPM assumption (so that the model is Markovian) recovers the desired forward-time SDE formulation. Recall that the DDPM assumption specifies \[ \begin{align*} \boldsymbol\gamma_t^2 = \frac{\sigma_{t-1}^2}{\sigma_t^2}\underbrace{\left(1-\frac{\alpha_t^2}{\alpha_{t-1}^2}\right)}_{\beta_t}. \end{align*} \] Proof. We then have the following model \(q(\mathbf{x}_t|\mathbf{x}_{t-1})\), where \(\beta_t = 1 - \frac{\alpha_t^2}{\alpha_{t-1}^2}\). \[ \begin{align*} \mathbf{x}_t &= \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt\beta_t\boldsymbol{\epsilon}_t\quad\quad\ \boldsymbol\epsilon_t \sim N(\mathbf{0},\mathbf{I})\\ & = \frac{\alpha_t}{\alpha_{t-1}}\mathbf{x}_{t-1}+ \sqrt{1-\frac{\alpha_{t}^2}{\alpha_{t-1}^2}}\boldsymbol\epsilon_t\\ \frac{\mathbf{x}_t}{\alpha_t} - \frac{\mathbf{x}_{t-1}}{\alpha_{t-1}} & = \frac{\sqrt{\alpha_{t-1}^2-\alpha_t^2}}{\alpha_t\alpha_{t-1}}\boldsymbol{\epsilon}_t\\ \frac{\mathbf{x}_t}{\alpha_t} - \frac{\mathbf{x}_{t-1}}{\alpha_{t-1}} & = \sqrt{\frac{\sigma_t^2}{\alpha_t^2} - \frac{\sigma_{t-1}^2}{\alpha_{t-1}^2}}\boldsymbol{\epsilon}_t\\ \bar{\mathbf{x}}_t - \bar{\mathbf{x}}_{t-1} & = \sqrt{\lambda_t^2 - \lambda_{t-1}^2}\boldsymbol\epsilon_t \end{align*} \] The forward-time SDE interpretation is clear when taking the limit as \(T\rightarrow \infty\). \[ \begin{align*} d\bar{\mathbf{x}}_t & = \underbrace{\sqrt{d\lambda^2_t}}_{g(t)}d\mathbf{w} \end{align*} \]

Result 2. At inference time, making the assumption \(\boldsymbol\gamma=0\) means a DDIM step coincides with an Euler step in the probability flow ODE, where we take steps in \(\lambda_t\) instead of \(\lambda_t^2\) [Footnote 1].

Proof. Recall that the DDIM update rule, \[ \begin{align*} \mathbf{x}_{t-1} & = \alpha_{t-1}\hat{\mathbf{x}}_0(\mathbf{x}_t) + \sigma_{t-1}\hat{\boldsymbol\epsilon}(\mathbf{x}_t)\\ & = \frac{\alpha_{t-1}}{\alpha_t}(\mathbf{x}_t-\sigma_t\hat{\boldsymbol\epsilon}(\mathbf{x}_t)) + \sigma_{t-1}\hat{\boldsymbol\epsilon}(\mathbf{x}_t)\\ \frac{\mathbf{x}_{t-1}}{\alpha_{t-1}} & = \frac{\mathbf{x}_t}{\alpha_t}+\left(\frac{\sigma_{t-1}}{\alpha_{t-1}}-\frac{\sigma_t}{\alpha_t}\right)\hat{\boldsymbol\epsilon}(\mathbf{x}_t)\\ \frac{\mathbf{x}_{t-1}}{\alpha_{t-1}}-\frac{\mathbf{x}_{t}}{\alpha_{t}} & = \left(\frac{\sigma_{t-1}}{\alpha_{t-1}}-\frac{\sigma_{t}}{\alpha_{t}}\right)\hat{\boldsymbol\epsilon}(\mathbf{x}_t). \end{align*} \] The reverse-time ODE intepretation is clear when taking the limit as \(T\rightarrow\infty\). \[ \begin{align*} d\bar{\mathbf{x}}_{t} & = \hat{\boldsymbol\epsilon}\left(\frac{\bar{\mathbf{x}}_t}{\sqrt{\lambda_t^2+1}}\right)d\lambda_t. \end{align*} \] Note that \(\sqrt{\lambda_t^2+1} = 1/\alpha_t\).

Remark. Meanwhile, plugging \(f(\cdot)\) and \(g(\cdot)\) into the expression for the probability flow ODE yields \[ \begin{align*} d\bar{\mathbf{x}}_t &= \left(f(\bar{\mathbf{x}},t) - \frac{1}{2}g(t)^2\underbrace{\nabla_{\bar{\mathbf{x}}}\log p_t(\bar{\mathbf{x}})}_{s_\theta(\bar{\mathbf{x})}}\right)dt\\ & = \frac{1}{2}\frac{1}{\lambda_t}\hat{\boldsymbol\epsilon}\left(\frac{\bar{\mathbf{x}}_t}{\sqrt{\lambda_t^2+1}}\right)d\lambda_t^2 \end{align*} \] (To see the above, recall that \(\nabla_{\bar{\mathbf{x}}}\log p_t(\bar{\mathbf{x}}) = -\boldsymbol\epsilon_t/\lambda_t\).)

Because \(d\lambda_t^2 = 2\lambda_td\lambda_t\), this is equivalent to the DDIM step. However, the corresponding Euler steps are not exactly the same, because naively discretizing the probability flow ODE would take steps in \(\lambda_t^2\) space whereas the DDIM update rule takes steps in \(\lambda_t\) space (Song et al. 2021). Empirically, it turns out that taking steps in \(\lambda_t\) space outperforms \(\lambda_t^2\) space most of the time [Footnote 1].

Variance-Preserving SDEs

There's another way to interpret diffusion models as variance-preserving SDEs. This class of SDEs is motivated by directly extending the DDPM framework to continuous time. In subsequent sections we won't spend as much time focusing on this class of models, but include it for the sake of completeness.

Define. The variance-preserving SDE and corresponding reverse-time SDE are modeled as \[ \begin{align*} d\mathbf{x} & = \underbrace{-\frac{1}{2}\beta_t\mathbf{x}}_{f(\mathbf{x},t)}dt + \underbrace{\sqrt{\beta_t}}_{g(t)}d\mathbf{w} \tag{forward-time SDE}\\ d\mathbf{x} & = \left(-\frac{1}{2}\beta_t\mathbf{x}_t - \beta_t\underbrace{\nabla_\mathbf{x}\log p_t(\mathbf{x})}_{\approx s_\theta(\mathbf{x})}\right)dt + \sqrt{\beta_t}d\bar{\mathbf{w}}\tag{reverse-time SDE} \end{align*} \] Note that this SDE has a probability flow ODE as well, but we'll ignore it for the purposes of succinctness. We really only need to introduce this definition to give us the following result.

Diffusion Models as Variance-Preserving SDEs

Big Result. A diffusion model is equivalent to training and inference using a variance-preserving SDE.

Again we'll break this up into a handful of smaller results.

Result 1. At training time, making the DDPM assumption and then using a few approximations valid in the continuous-time limit recovers the desired forward-time SDE formulation.

Proof. Making the Markovian assumption results in the following for \(q(\mathbf{x}_{t}|\mathbf{x}_{t-1})\), where \(\beta_t = 1 - \frac{\alpha_t^2}{\alpha_{t-1}^2}\). \[ \begin{align*} \mathbf{x}_t & = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol\epsilon_t\quad\quad\ \boldsymbol\epsilon_t \sim N(\mathbf{0},\mathbf{I})\\ & \approx \left(1-\frac{1}{2}\beta_t\right)\mathbf{x}_{t-1} + \sqrt{\beta_t}\boldsymbol\epsilon_t\\ \mathbf{x}_{t}- \mathbf{x}_{t-1}& \approx -\frac{1}2{}\beta_t\mathbf{x}_{t-1}+\sqrt{\beta_t}\boldsymbol\epsilon_t\\ & \approx -\frac{1}{2}\beta_{t-1}\mathbf{x}_{t-1} + \sqrt{\beta_{t-1}}\boldsymbol\epsilon_t \end{align*} \] On the second line we used a first-order Taylor approximation \(\sqrt{1-u} \approx 1-\frac{1}{2}u\) which is valid when \(u \approx 0\). This is a reasonable assumption for us because \(\alpha_t \approx \alpha_{t-1}\) as \(T\rightarrow\infty\) and so \(\beta_t\approx 0\).

Again taking the limit as \(T\rightarrow \infty\), from here we find the forward-time SDE formulation \[ \begin{align*} d\mathbf{x} & = \underbrace{-\frac{1}{2}\beta_t\mathbf{x}_t}_{f(\mathbf{x},t)} + \underbrace{\sqrt{\beta_t}}_{g(t)}d\mathbf{w}. \end{align*} \] Result 2. At inference time, a DDPM step (the variant where we add stochastic noise proportional to the "upper bound" on variance) is equivalent to a step in the reverse-time variance-preserving SDE.

Proof. Recall the DDPM update rule, where \(\gamma_t^2 = \frac{\sigma_{t-1}^2}{\sigma_t^2}\beta_t\). Note that here we're sampling from the variant of DDPM where we add an amount of noise set to \(\beta_t\) that is an upper bound on the actual variance. \[ \begin{align*} \mathbf{x}_{t-1}& = \frac{1}{\sqrt{1-\beta_t}}\left(\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t\right) + \sqrt{\beta_t}\mathbf{z}_t\\ & \approx \left(1+\frac{1}{2}\beta_t\right)(\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t))+ \sqrt{\beta_t}\mathbf{z}_t\\ & \approx \left(1+\frac{1}{2}\beta_t\right)\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t)+ \sqrt{\beta_t}\mathbf{z}_t\\ \mathbf{x}_{t-1}-\mathbf{x}_t & = \frac{1}{2}\beta_t\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t)+ \sqrt{\beta_t}\mathbf{z}_t\\ d\mathbf{x} & = \frac{1}{2}\beta_t\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t) + \sqrt{\beta_t}\mathbf{z}_t \end{align*} \]

On the second line we used a first-order Taylor approximation \(1/\sqrt{1-u}\approx 1+\frac{1}{2}u\) which holds when \(u \approx 0\). On the third line made the assumption that \(\beta_t^2\approx 0\).

Exact Log-Likelihoods

Because the probability flow ODE is deterministic, we can interpret it as a Neural ODE (Chen et al. 2018). Using the instantaneous change of variables formula, we can extract exact log-likelihood estimates.

Our setup for the probability flow ODE is \[ \begin{align*} d\mathbf{x} & = \hat{\boldsymbol\epsilon}_\theta(\mathbf{x},\sigma)d\sigma \end{align*} \]

From this we can write \[ \begin{align*} \underbrace{\log p(\mathbf{x}_0)}_\text{over data} & = \underbrace{\log p(\mathbf{x}_T)}_\text{over prior} -\int_{\sigma_T}^{\sigma_0}\text{tr}\left(\frac{\partial}{\partial\mathbf{x}_\sigma} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)\right)d\sigma\\ & = \log p(\mathbf{x}_T) + \int_{\sigma_0}^{\sigma_T}\text{tr}\left(\frac{\partial}{\partial\mathbf{x}_\sigma} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)\right)d\sigma \end{align*} \]

We can solve the ODE from \(\sigma_0 \rightarrow \sigma_T\) using the options we'll later discuss on the Advanced Solvers page.

Computing the trace can be approximated using the Skilling-Hutchinson trace estimator, \[ \begin{align*} \text{tr}\left(\frac{\partial}{\partial\mathbf{x}_\sigma} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)\right) & =\mathbb{E}_{\mathbf{v}}\left[\mathbf{v}^\top \left(\frac{\partial}{\partial\mathbf{x}_\sigma} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma) \right) \mathbf{v}\right] \end{align*} \] where \(\mathbf{v}\) is typically drawn from a Rademacher distribution. The Jacobian \(\left(\frac{\partial}{\partial\mathbf{x}_\sigma} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)\right)\) can be computed using automatic differentiation packages.

Miscellaneous Results

Here we provide a result which doesn't fit cleanly elsewhere.

Result. At inference time, sampling a score matching model with a deterministic variant of Langevin dynamics is equivalent to an Euler step in the probability flow ODE, where we take steps in \(\lambda_t^2\).

Proof. Recall that the Langevin dynamics update can be expressed \[ \begin{align*} \bar{\mathbf{x}}_{t-1} &= \bar{\mathbf{x}}_t + \frac{1}{2}\eta_t s_\theta(\bar{\mathbf{x}}_t) + \sqrt{\eta_t}\ \mathbf{z}_t, \quad \quad \mathbf{z}_t\sim N(\mathbf{0},\mathbf{I}) \end{align*} \] Consider a deterministic version of the above where we update \[ \begin{align*} \bar{\mathbf{x}}_{t-1} &= \bar{\mathbf{x}}_t + \frac{1}{2}\eta_ts_\theta(\bar{\mathbf{x}}_t)\\ \bar{\mathbf{x}}_t - \bar{\mathbf{x}}_{t-1} & =-\frac{1}{2}\eta_ts_\theta(\bar{\mathbf{x}}_t) \end{align*} \] If we set \(\eta_t = \lambda_t^2-\lambda_{t-1}^2\), then we would recover an Euler update for the probability flow ODE in \(\lambda_t^2\) space.

[Footnote 1] For more details on choice of Euler update space, see the page on Advanced Solvers.