X. Advanced Solvers

Throughout this section we'll adopt the generic notation of a Variance-Exploding SDE. We note that, as previously shown, diffusion models are a special case of such SDEs. But on this page focusing on the general case will simplify the notation.

Specifically, we model the marginal distribution as \[ \begin{align*} \mathbf{x}_t \sim p_t(\mathbf{x}_t| \mathbf{x}_0) = \mathbf{x}_0 + \sigma_t\boldsymbol\epsilon_t,\quad\quad\boldsymbol\epsilon_t \sim N(\mathbf{0}, \mathbf{I}). \end{align*} \]

Recall (from the discrete score matching section) that the corresponding score is \[ \begin{align*} \underbrace{\nabla_{\mathbf{x}_t} p_t(\mathbf{x}_t|\mathbf{x}_0)}_{\approx\ s_\theta(\mathbf{x}_t)} & = \frac{-1}{\sigma_t^2}(\mathbf{x}_t - \mathbf{x}_0) = \frac{-1}{\sigma_t} \underbrace{\boldsymbol\epsilon_t}_{\approx\ \hat{\boldsymbol\epsilon}(\mathbf{x}_t)}. \end{align*} \] In terms of notation, we'll approximate the score with \(s_\theta(\mathbf{x})\) and the noise with \(\hat{\boldsymbol\epsilon}_\theta(\mathbf{x})\), noting that both are valid modeling choices for the neural network.

Choice of Integration Space

Result. There are three equivalent ways of expressing the probability flow ODE, depending on the space we want to integrate over. \[ \begin{align*} d\mathbf{x} & = \hat{\boldsymbol\epsilon}_\theta(\mathbf{x},\sigma)d\sigma && = -\sigma s_\theta(\mathbf{x},\sigma)d\sigma\\ & = \frac{1}{2}\frac{1}{\sigma}\hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x},\sigma)d\sigma^2&& = -\frac{1}{2}s_\theta(\mathbf{x},\sigma)d\sigma^2\\ & = \sigma\hat{\boldsymbol\epsilon}_\theta(\mathbf{x},\sigma)d\log\sigma && = -\sigma^2s_\theta(\mathbf{x},\sigma)d\log\sigma \end{align*} \] In other words, we can choose to integrate in one of three spaces: \(\{\sigma, \sigma^2, \log\sigma\}\).

Proof. It's easy to see the above because \[ \begin{align*} d\sigma^2 & = 2\sigma d\sigma&&& d\log\sigma & = \frac{1}{\sigma}d\sigma. \end{align*} \]

Remark. Different from previous pages, we adopt notation \(\hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}, \sigma)\) instead of \(\hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x})\) to make the function's dependence on \(\sigma\) more explicit.

Remark. Throughout rest of this section we'll focus on the \(\sigma\) integration space with \(\hat{\boldsymbol\epsilon}\) network parameterization, which coincides with the popular DDIM sampler. That is, solving the following ODE will the the focus for the rest of this page. \[ \begin{align*} d\mathbf{x} & = \hat{\boldsymbol\epsilon}_\theta(\mathbf{x},\sigma)d\sigma \end{align*} \] But these ODE methods that will be described can easily be extended to other integration spaces and network parameterizations.

Euler's Method and Heun's Method

These are two pretty straightforward methods.

Define. An Euler step on the probability flow ODE is given by discretizing \[ \begin{align*} \mathbf{x}_{t+1} & = \mathbf{x}_{t} + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t},\sigma_t)\left(\sigma_{t+1} - \sigma_{t}\right). \end{align*} \]

Define. A Heun step on the probability flow ODE is given by discretizing \[ \begin{align*} \mathbf{x}_{t+1}' & = \mathbf{x}_t + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t},\sigma_t)\left(\sigma_{t+1} - \sigma_{t}\right)\\ \mathbf{x}_{t+1} & = \mathbf{x}_t + \frac{1}{2}\left(\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t},\sigma_t) + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t+1}', \sigma_{t+1})\right)\left(\sigma_{t+1} - \sigma_{t}\right) \end{align*} \] Heun's method is a straightforward application of the trapezoidal rule to solving the ODE. Its use was proposed by Karras et al. 2022.

Exponential Integrator Methods

Define. The exponential integrator methods try to solve the ODE by taking exact steps \[ \begin{align*} \mathbf{x}_{t+1} = \mathbf{x}_t + \underbrace{\int_{\sigma_t}^{\sigma_{t+1}}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)d\sigma}_\text{approximate this integral}. \end{align*} \] The challenge comes down to approximating the integral.

Note \(\mathbf{x}_\sigma\) here denotes the path taken in the space of \(\mathbf{x}\) as a function of \(\sigma\).

Remark. Our focus will be on multi-step methods which re-use prior function evaluations (saved in a cache). Specifically, we'll assume the following setup for a \(k=3\) order method: \[ \begin{align*} \underbrace{(\mathbf{x}_{t-2},\sigma_{t-2})\quad (\mathbf{x}_{t-1},\sigma_{t-1}) }_\text{history}\quad \underbrace{(\mathbf{x}_{t},\sigma_{t})}_\text{current} \quad \underbrace{(\mathbf{x}_{t+1},\sigma_{t+1}) }_\text{target} \end{align*} \] The typical strategy is to take Euler steps until we've built enough of a cache to apply a \(k\)-th order method, then only apply \(k\)-th order methods from then on.

Multi-step methods are the current state of the art (Lu et al. 2022, Zhang and Chen 2023). The number of function evalutions can be brought down very low (say, 5 or 10).

Taylor Approximation Method

Here we'll approximate the integral with a Taylor series centered at \(\sigma_t\). The higher order derivatives can be estimated using prior function evaluations. Then they can be assumed to be constants when analytically computing the integral.

Define. The Taylor approximation of order \(k\) centered at \(\sigma_t\) is given by \[ \begin{align*} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)& \approx \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t) + \sum_{i=0}^{k-1}\frac{1}{i!}(\sigma-\sigma_t)^i\hat{\boldsymbol\epsilon}^{(i)}_\theta,\\ \hat{\boldsymbol\epsilon}^{(i)}_\theta & \triangleq \frac{\partial^i}{\partial\sigma^{i}}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)\Big|_{\sigma=\sigma_0} \tag{notation for derivative} \end{align*} \] Result. Assuming step sizes in \(\sigma\) are equally spaced, we have the following Taylor Approximation step on the probability flow ODE for \(k=3\). \[ \begin{align*} \mathbf{x}_{t+1} & = \mathbf{x}_t + \left(\frac{5}{3}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-\frac{5}{6}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})+\frac{1}{6}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\right)(\sigma_{t+1}-\sigma_t) \end{align*} \] Proof. The integral is given by the approximation \[ \begin{align*} \int_{\sigma_t}^{\sigma_{t+1}}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)d\sigma & \approx \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)\int_{\sigma_t}^{\sigma_{t+1}}d\sigma + \hat{\boldsymbol\epsilon}^{(1)}_\theta\int_{\sigma_t}^{\sigma_{t+1}}(\sigma-\sigma_t)d\sigma + \frac{1}{2}\hat{\boldsymbol\epsilon}^{(2)}_\theta\int_{\sigma_t}^{\sigma_{t+1}}(\sigma-\sigma_t)^2d\sigma\\ & =\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)(\sigma_{t+1}-\sigma_t) + \frac{1}{2}\hat{\boldsymbol\epsilon}^{(1)}_\theta(\sigma_{t+1}-\sigma_t)^2 +\frac{1}{6} \hat{\boldsymbol\epsilon}^{(2)}_\theta(\sigma_{t+1}-\sigma_t)^3\\ & =\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)h + \frac{1}{2}\hat{\boldsymbol\epsilon}^{(1)}_\theta h^2 +\frac{1}{6} \hat{\boldsymbol\epsilon}^{(2)}_\theta h^3\\ & \approx \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)h + \frac{1}{2}\frac{\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})}{h}h^2 +\\&\quad \frac{1}{6} \frac{\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-2\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})+\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})}{h^2}h^3\\ & =\left(\frac{5}{3}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-\frac{5}{6}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})+\frac{1}{6}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\right)h \end{align*} \] Above we introduced the notation \(h=(\sigma_{t+1}-\sigma_t)\) which follows from the constant step size assumption.

Additionally we used the approximation of higher order derivatives using estimates \[ \begin{align*} \hat{\boldsymbol\epsilon}_\theta^{(1)} & =\frac{\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})}{h}\\ \hat{\boldsymbol\epsilon}_\theta^{(2)} & = \frac{1}{h}\left(\frac{\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)-\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})}{h}-\frac{\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})-\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})}{h}\right) \end{align*} \] Remark. In practice, we rarely take steps that are constant in \(\sigma\), or any of the alternative integration spaces (cf. the previous page on inference-time noise schedule). Instead of making the constant step size approximation, the coefficients above can instead be computed using arbitrary step sizes by using more accurate estimates of higher order derivatives.

Remark. DPM solver (Lu et al. 2022) combines this method with the \(\log\sigma\) integration space.

Adams-Bashforth Method

Here we'll approximate the integral with a Lagrange interpolating polynomial using cached computations. Then the integral can be analytically computed.

Define. The Lagrange polynomial estimate of order \(k\) is given by \[ \begin{align*} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma) \approx \sum_{i=0}^{k-1}\left(\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-i},\sigma_{t-i})\prod_{j\neq i}^{k-1}\frac{\sigma-\sigma_{t-j}}{\sigma_{t-i}-\sigma_{t-j}}\right) \end{align*} \] Example. When \(k=3\), we have \[ \begin{align*} \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma) & \approx \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)\frac{(\sigma-\sigma_{t-1})(\sigma-\sigma_{t-2})}{(\sigma_{t}-\sigma_{t-1})(\sigma_{t}-\sigma_{t-2})} \\& + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-2})}{(\sigma_{t-1}-\sigma_{t})(\sigma_{t-1}-\sigma_{t-2})} \\ &+ \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-1})}{(\sigma_{t-2}-\sigma_{t})(\sigma_{t-2}-\sigma_{t-1})} \end{align*} \] Result. Assuming step sizes in \(\sigma\) are equally spaced, we have the following expression for \(k=3\). \[ \begin{align*} \mathbf{x}_{t+1}& = \mathbf{x}_t+ \left(\frac{23}{12}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t) - \frac{4}{3}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1}) + \frac{5}{12}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\right)(\sigma_{t+1}-\sigma_t) \end{align*} \] Proof. The integral is given by the approximation \[ \begin{align*} \int_{\sigma_t}^{\sigma_{t+1}}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_\sigma,\sigma)d\sigma & \approx \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t-1})(\sigma-\sigma_{t-2})}{(\sigma_{t}-\sigma_{t-1})(\sigma_{t}-\sigma_{t-2})}d\sigma \\ &\quad+ \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-2})}{(\sigma_{t-1}-\sigma_{t})(\sigma_{t-1}-\sigma_{t-2})}d\sigma\\ &\quad + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-1})}{(\sigma_{t-2}-\sigma_{t})(\sigma_{t-2}-\sigma_{t-1})} d\sigma\\ & = \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t-1})(\sigma-\sigma_{t-2})}{2h^2}d\sigma \\ &\quad+ \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-2})}{-h^2}d\sigma\\ &\quad + \hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\int_{\sigma_t}^{\sigma_{t+1}}\frac{(\sigma-\sigma_{t})(\sigma-\sigma_{t-1})}{2h^2} d\sigma\\ & = \frac{1}{2h^2}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t)\left(\frac{23}{6}h^3\right) - \frac{1}{h^2}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1})\left(\frac{4}{3}h^3\right) + \frac{1}{2h^2}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\left(\frac{5}{6}h^3\right)\\ & = \left(\frac{23}{12}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_t,\sigma_t) - \frac{4}{3}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-1},\sigma_{t-1}) + \frac{5}{12}\hat{\boldsymbol\epsilon}_\theta(\mathbf{x}_{t-2},\sigma_{t-2})\right)h \end{align*} \] Remark. Once again, in practice, we rarely take steps that are constant in \(\sigma\), or any of the alternative integration spaces. The coefficients above can instead be computed using programmatic solvers to support arbitrary step sizes.

Remark. Interestingly, if we repeat the above exercise with \(k=2\), the Taylor Approximation method and the Adams-Bashforth method result in the exact same update steps, assuming step sizes are constant in \(\sigma\).

Remark. DEIS solver (Zhang and Chen 2023) combines this method with either the \(t\) and \(\sigma\) integration spaces.