Home

[Return to Index]


VI. DDPM vs DDIM

Our presentation so far has followed the DDIM formulation in Song et al. 2021. It was built on earlier work which introduced the DDPM formulation in Ho et al. 2020. In this section we'll make clear the differences between the two papers.


Define. The DDPM formulation is an instantiation of DDIM but where in the backward model we'll choose a specific choice of \(\boldsymbol\gamma\) so that the model is completely Markovian. Specifically, \[ \begin{align*} \gamma^2_t = \frac{\sigma_{t-1}^2}{\sigma_t^2}\underbrace{\left(1-\frac{\alpha_t^2}{\alpha_{t-1}^2}\right)}_{\beta_t}. \end{align*} \] Using notation that's consistent with Ho et al. 2020, we'll define \[ \begin{align*} \beta_t & \triangleq \left(1-\frac{\alpha_t^2}{\alpha_{t-1}^2}\right). \end{align*} \] The completely Markovian property (soon to be proven) can be seen in this graphical model.

Remark. Recall that because we are training with simple losses, the choice of \(\boldsymbol\gamma\) doesn't make a difference in the training procedure. Thus, the only difference lies in the sampling procedure. We'll be drawing samples from the following. \[ \begin{align*} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) & = N\left(\alpha_{t-1}f_\theta(\mathbf{x}_t) + \sqrt{\sigma_{t-1}^2 - \gamma_t^2}\frac{\mathbf{x}_t-\alpha_tf_\theta(\mathbf{x}_t)}{{\sigma_t}}, \gamma_t^2 \mathbf{I}\right)\\ & = N\left(\alpha_{t-1} \underbrace{f_\theta(\mathbf{x}_t)}_{\hat{\mathbf{x}}_0} + \frac{\sigma_{t-1}^2\alpha_t}{\sigma_t\alpha_{t-1}}\underbrace{\frac{\mathbf{x}_t-\alpha_tf_\theta(\mathbf{x}_t)}{{\sigma_t}}}_{\hat{\boldsymbol\epsilon}_t}, \gamma_t^2\mathbf{I}\right), \end{align*} \]

where above we simplified with a bit of algebra, \[ \begin{align*} \sqrt{\sigma_{t-1}^2 - \gamma_t^2} & = \sqrt{\sigma_{t-1}^2 - \frac{\sigma_{t-1}^2}{\sigma_t^2}\left(1 - \frac{\alpha_t^2}{\alpha_{t-1}^2}\right)}\\ & = \sigma_{t-1}\sqrt{1 - \frac{\alpha_{t-1}^2 - \alpha_t^2}{\sigma_t^2\alpha_{t-1}^2}}\\ & = \frac{\sigma_{t-1}}{\sigma_t\alpha_{t-1}}\sqrt{\sigma_t^2\alpha_{t-1}^2 - \alpha_{t-1}^2+\alpha_t^2}\\ & = \frac{\sigma_{t-1}}{\sigma_t\alpha_{t-1}}\sqrt{\alpha_{t-1}^2 (\sigma_t^2 -1) +\alpha_t^2}\\ & = \frac{\sigma_{t-1}}{\sigma_t\alpha_{t-1}}\sqrt{-\alpha_{t-1}^2 \alpha_t^2 +\alpha_t^2}\\ & = \frac{\sigma_{t-1}}{\sigma_t\alpha_{t-1}}\sqrt{\alpha_t^2(1 - \alpha_{t-1}^2)}\\ & = \frac{\sigma_{t-1}^2\alpha_t}{\sigma_t\alpha_{t-1}}. \end{align*} \] Result. There's an equivalent way for us to write the sampling procedure only in terms of \(\beta_t\) and the estimated score \(s_\theta(\mathbf{x}_t)\). It will be convenient for us down the road when we study diffusion models as differential equations (Song et al. 2021). \[ \begin{align*} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = N\left(\frac{1}{\sqrt{1-\beta_t}}\left(\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t\right), \gamma_t^2\mathbf{I}\right) \end{align*} \] Proof. Starting from the algebra earlier, we can write \[ \begin{align*} \mathbf{x}_{t-1}& = \alpha_{t-1}\hat{\mathbf{x}}_0 + \frac{\sigma_{t-1}^2\alpha_t}{\sigma_t\alpha_{t-1}}\hat{\boldsymbol\epsilon}_t + \gamma_t\mathbf{z}_t, \quad\quad\mathbf{z}_t\sim N(\mathbf{0},\mathbf{I})\\ & = \frac{\alpha_{t-1}}{\alpha_t}\mathbf{x}_t-\frac{\alpha_{t-1}}{\alpha_t}\sigma_t\hat{\boldsymbol\epsilon_t} + \frac{\sigma_{t-1}^2\alpha_t}{\sigma_t\alpha_{t-1}}\hat{\boldsymbol\epsilon}_t+\gamma_t\mathbf{z}_t\\ & = \frac{\alpha_{t-1}}{\alpha_t}\mathbf{x}_t + \frac{\alpha_{t-1}}{\alpha_{t}\sigma_t}\left(\frac{\sigma_{t-1}^2\alpha_t^2}{\alpha_{t-1}^2}-\sigma_t^2\right)\hat{\boldsymbol\epsilon}_t+\gamma_t\mathbf{z}_t\\ & = \frac{\alpha_{t-1}}{\alpha_t}\mathbf{x}_t + \frac{\alpha_{t-1}}{\alpha_{t}\sigma_t}\left(\frac{\alpha_t^2-\alpha_{t-1}^2}{\alpha_{t-1}^2}\right)\hat{\boldsymbol\epsilon}_t+\gamma_t\mathbf{z}_t\\ & = \frac{\alpha_{t-1}}{\alpha_t}\mathbf{x}_t - \frac{\alpha_{t-1}}{\alpha_{t}\sigma_t}\beta_t\hat{\boldsymbol\epsilon}_t+\gamma_t\mathbf{z}_t\\ & = \frac{\alpha_{t-1}}{\alpha_t}\mathbf{x}_t + \frac{\alpha_{t-1}}{\alpha_{t}}\beta_ts_\theta(\mathbf{x}_t)+\gamma_t\mathbf{z}_t\\ & = \frac{1}{\sqrt{1-\beta_t}}\left(\mathbf{x}_t + \beta_ts_\theta(\mathbf{x}_t)\right) + \gamma_t\mathbf{z}_t. \end{align*} \] Recall the score is \(s_\theta(\mathbf{x}_t) \approx \nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t|\mathbf{x}_0 )= -\frac{\mathbf{x}_t-\alpha_t\mathbf{x}_0}{\sigma_t^2} = -\boldsymbol\epsilon_t / \sigma_t\).


Result. The DDPM backward model is completely Markovian, without needing to be conditioned on \(\mathbf{x}_0\). That is, we can factorize the distribution as \[ \begin{align*} q(\mathbf{x}_{1:T}|\mathbf{x}_0) & = \prod_{t>0}q(\mathbf{x}_{t}|\mathbf{x}_{t-1}). \end{align*} \]

Moreover, it turns out that the exact form of the backward model is \[ \begin{align*} q(\mathbf{x}_t|\mathbf{x}_{t-1}) & = N\left(\frac{\alpha_t}{\alpha_{t-1}}\mathbf{x}_t,\beta_t\mathbf{I}\right) = N\left(\sqrt{1-\beta_t}\mathbf{x}_t,\beta_t\mathbf{I}\right). \end{align*} \]

Proof. To prove this we'll need the expression for \(q(\mathbf{x}_{t}|\mathbf{x}_{t-1}|\mathbf{x}_0)\). Bayes' Rule tells us that \[ \begin{align*} q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0) & \propto q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)q(\mathbf{x}_t|\mathbf{x}_0)\\ & = q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0) \end{align*} \]

Now recall from earlier, when we proved the key result of the backward model that we already know the log-likelihood of the joint distribution will be proportional to \[ \begin{align*} \log q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0) & \propto -\frac{1}{2}\left(\frac{\|\mathbf{x}_{t-1} - A\mathbf{x}_t-B\mathbf{x}_0\|^2}{\gamma_t^2} + \frac{\|\mathbf{x}_t-\alpha_t\mathbf{x}_0\|^2}{\sigma_t^2}\right)\\ & \propto -\frac{1}{2\gamma_t^2}\left( - 2A\mathbf{x}_{t-1}^\top\mathbf{x}_t + A^2\|\mathbf{x}_t\|^2+2AB\mathbf{x}_{t}^\top\mathbf{x}_0\right) \\ &\quad\ -\frac{1}{2\sigma_t^2}\left(\|\mathbf{x}_t\|^2 -2\alpha_t\mathbf{x}_t^\top \mathbf{x}_0\right) \end{align*} \] where \[ \begin{align*} A & \triangleq\sqrt{\sigma_{t-1}^2 - \gamma_t^2}/\sigma_t \\ B & \triangleq \left(\alpha_{t-1} - \frac{\alpha_t}{\sigma_t}\sqrt{\sigma_{t-1}^2 - \gamma_t^2}\right) = \alpha_{t-1}-\alpha_tA. \end{align*} \] First, we can prove the Markovian property by showing the terms involving \(\mathbf{x}_0\) cancel each other out and therefore \(q(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_0)\) has no dependence on \(\mathbf{x}_0\). Specifically, it suffices to show that \[ \begin{align*} \frac{2AB\mathbf{x}_t^\top\mathbf{x}_0}{2\gamma_t^2} \stackrel{?}{=}\frac{2\alpha_t\mathbf{x}_t^\top\mathbf{x}_0}{2\sigma_t^2}. \end{align*} \]

This is straightforward from some algebra, where we re-use our simplification of \(\sqrt{\sigma_{t-1}^2 - \gamma_t^2}\). \[ \begin{align*} \frac{AB}{\gamma_t^2} & = \underbrace{\frac{\sigma_{t-1}^2\alpha_t}{\sigma_t^2\alpha_{t-1}}}_A\underbrace{\left(\alpha_{t-1}-\frac{\sigma_{t-1}^2\alpha_t^2}{\sigma_t^2\alpha_{t-1}}\right)}_B\underbrace{\left(\frac{\sigma_{t}^2}{\sigma_{t-1}^2}\right)\left(1-\frac{\alpha_t^2}{\alpha_{t-1}^2}\right)^{-1}}_{1/\gamma_t^2}\\ & = \underbrace{\frac{\alpha_t}{\alpha_{t-1}}\left(\frac{\alpha_{t-1}^2}{\alpha_{t-1}^2-\alpha_t^2}\right)}_{A/\gamma_t^2}\underbrace{\left(\frac{\alpha_{t-1}^2\sigma_t^2 - \alpha_t^2\sigma_{t-1}^2}{\sigma_t^2\alpha_{t-1}}\right)}_B\\ & = \frac{\alpha_t\left(\alpha_{t-1}^2\sigma_t^2 - \alpha_t^2\sigma_{t-1}^2\right)}{\sigma_t^2(\alpha_{t-1}^2-\alpha_t^2)}\\ & = \frac{\alpha_t\left(\alpha_{t-1}^2(1-\alpha_t^2) - \alpha_t^2(1-\alpha_{t-1}^2)\right)}{\sigma_t^2(\alpha_{t-1}^2-\alpha_t^2)}\\ & = \frac{\alpha_t}{\sigma_t^2} \end{align*} \] Next, we want the exact expression for \(q(\mathbf{x}_t|\mathbf{x}_{t-1})\). The approach is to once again collect the quadratic and linear terms in the log-likelihood of and match them to derive the parameters of a Gaussian distribution over \(\mathbf{x}_{t}\). The quadratic terms collect to \[ \begin{align*} -\frac{A^2\|\mathbf{x}_t\|^2}{2\gamma_t^2} -\frac{\|\mathbf{x}_t\|^2}{2\sigma_t^2} \end{align*} \]

which implies that the distribution \(q(\mathbf{x}_t|\mathbf{x}_{t-1})\) has precision matrix as follows. \[ \begin{align*} \boldsymbol\Sigma^{-1} & = \left(\frac{A^2}{\gamma_t^2}+\frac{1}{\sigma_t^2}\right)\mathbf{I}\\ & = \left(\underbrace{\frac{\alpha_t}{\alpha_{t-1}}\left(\frac{\alpha_{t-1}^2}{\alpha_{t-1}^2-\alpha_t^2}\right)}_{A/\gamma_t^2}\underbrace{\frac{\sigma_{t-1}^2\alpha_t}{\sigma_t^2\alpha_{t-1}}}_A + \frac{1}{\sigma_t^2}\right)\mathbf{I}\\ & = \left(\frac{\alpha_t^2\sigma_{t-1}^2}{(\alpha_{t-1}^2-\alpha_t^2)\sigma_t^2} + \frac{1}{\sigma_t^2}\right)\mathbf{I}\\ & = \left(\frac{\alpha_t^2\sigma_{t-1}^2 + \alpha_{t-1}^2-\alpha_t^2}{\sigma_t^2(\alpha_{t-1}^2-\alpha_t^2)}\right)\mathbf{I}\\ & = \left(\frac{\alpha_{t-1}^2}{\alpha_{t-1}^2-\alpha_t^2}\right)\mathbf{I}\\ & = \beta_t^{-1}\mathbf{I}\\ \boldsymbol\Sigma& = \beta_t\mathbf{I} \end{align*} \]

On the second line above we re-used some of the algebra we previously derived to express \(A/\gamma_t^2\) as the product of simplified quantities \(A/\gamma_t^2\) and \(A\).

The linear terms in the log-likelihood collect to \[ \begin{align*} -\frac{-2A\mathbf{x}_{t-1}^\top\mathbf{x}_t}{2\gamma_t^2} \end{align*} \] which implies that the distribution \(q(\mathbf{x}_t|\mathbf{x}_{t-1})\) has mean as follows. \[ \begin{align*} \boldsymbol\mu & = \boldsymbol\Sigma \left(\frac{A}{\gamma_t^2}\mathbf{x}_{t-1}\right)\\ & = \beta_t\frac{\alpha_t\alpha_{t-1}}{\alpha_{t-1}^2-\alpha_t^2}\mathbf{x}_{t-1}\\ & = \frac{\alpha_{t-1}^2-\alpha_t^2}{\alpha_{t-1}^2}\frac{\alpha_t\alpha_{t-1}}{\alpha_{t-1}^2-\alpha_t^2}\mathbf{x}_{t-1}\\ & = \frac{\alpha_t}{\alpha_{t-1}}\mathbf{x}_{t-1}\\ & = \sqrt{1-\beta_t}\mathbf{x}_t \end{align*} \] Again, on the second line above we re-used some of the algebra previously derived for \(A/\gamma_t^2\).

This concludes the proof.

Result. With this specification of \(\boldsymbol\gamma\), the weighting coefficients \(\omega_t\) in the ELBO can be written as half the difference in the signal to noise ratios between consecutive timesteps (Kingma et al. 2021).

Even though we typically optimize the simple loss and therefore ignore \(\omega_t\), this is an interesting result which is important to understand Variational Diffusion Models.

Proof. Again substitute our simplification of \(\sqrt{\sigma_{t-1}^2 - \gamma_t^2}\) and use some algebra. \[ \begin{align*} \omega_t & = \frac{1}{2\gamma_t^2}\underbrace{\left(\alpha_{t-1}-\frac{\alpha_t}{\sigma_t}\sqrt{\sigma_{t-1}^2-\gamma_t^2}\right)^2}_{B^2}\\ & = \frac{1}{2\gamma_t^2}\left(\frac{\alpha_{t-1}^2\sigma_t^2-\alpha_t^2\sigma_{t-1}^2}{\sigma_t^2\alpha_{t-1}}\right)^2\\ & = \frac{\sigma_t^2\alpha_{t-1}^2}{2\sigma_{t-1}^2(\alpha_{t-1}^2-\alpha_t^2)}\left(\frac{\alpha_{t-1}^2-\alpha_t^2}{\sigma_t^2\alpha_{t-1}}\right)^2\\ & = \frac{\sigma_{t-1}^2-\alpha_t^2}{2\sigma_t^2\sigma_{t-1}^2}\\ & = \frac{1}{2}\Big(\underbrace{\frac{\alpha_{t-1}^2}{\sigma_{t-1}^2}}_{\text{SNR}_{t-1}} - \underbrace{\frac{\alpha_t^2}{\sigma_t^2}}_{\text{SNR}_t}\Big) \end{align*} \]


Forward Model Variance Choices

There's one more minor difference to point out between DDPM and DDIM. Recall that we're sampling from the forward model \(p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\) which approximates \(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\).

In DDIM, the forward model variance is always set to \(\gamma_t^2 \mathbf{I}\) which is the actual variance of \(q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_0)\).

In DDPM, the forward model variance is set to either \(\gamma_t^2 \mathbf{I}\) (which is the actual variance), or \(\beta_t\mathbf{I}\) which is an upper bound on the variance of \(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\). The authors apparently found no significant difference in sample quality between these two options. Most papers nowadays ignore the upper bound option and instead either use the actual variance in the forward sampling process (which is more correct) or instead use guidance.

Nonetheless, we motivate the upper bound interpretation of the variance below. The proof is specific to DDPM and does not generalize to DDIM.

Result. Suppose \(\mathbf{x}_0\sim N(\mathbf{c}, \tau_0\mathbf{I})\) for some fixed variance \(\tau_0 \in [0, 1]\), so \(\mathbf{x}_0\) is isotropically distributed around a constant \(\mathbf{c}\). Then we have an upper bound on the variance of \(q(\mathbf{x}_{t-1}|\mathbf{x}_t)\), \[ \begin{align*} \mathrm{Var}[q(\mathbf{x}_{t-1}|\mathbf{x}_t)] \preceq \beta_t \mathbf{I}. \end{align*} \] Proof. The marginal distribution of \(q(\mathbf{x}_t)\) in this setup is the following. \[ \begin{align*} \mathbf{x}_t|\mathbf{x}_0 & = \alpha_t\mathbf{x}_0 + \sigma_t\boldsymbol\epsilon_t\\ \mathbf{x}_0 &= \sqrt\tau_0 \boldsymbol\epsilon_0 + \mathbf{c}\\ \mathbf{x}_t & = \alpha_t(\sqrt\tau_0 \boldsymbol\epsilon_0+\mathbf{c}) + \sigma_t\boldsymbol\epsilon_t\\ q(\mathbf{x}_t)& = N\left(\alpha_t\boldsymbol{c}, \underbrace{\alpha_t^2\tau_0+\sigma_t^2}_{\tau_t}\mathbf{I}\right) \end{align*} \] Once again we can apply Bayes' Rule to find that \(q(\mathbf{x}_{t-1}|\mathbf{x}_t)\) is a Gaussian.

To simplify notation we'll define \(\tau_t \triangleq \alpha_t^2\tau_0+\sigma_t^2\). \[ \begin{align*} \log q(\mathbf{x}_{t-1}|\mathbf{x}_t) & \propto \log q(\mathbf{x}_{t-1})+ \log q(\mathbf{x}_{t}|\mathbf{x}_{t-1})\\ & \propto -\frac{1}{2}\left(\frac{\|\mathbf{x}_{t-1}-\alpha_{t-1}\mathbf{c}\|^2_2}{\tau_{t-1}} + \frac{\|\sqrt{1-\beta_t}\mathbf{x}_{t-1}-\mathbf{x}_t\|^2_2}{\beta_t}\right)\\ & \propto -\frac{1}{2}\left(\left(\frac{1-\beta_t}{\beta_t} + \frac{1}{\tau_{t-1}}\right)\|\mathbf{x}_{t-1}\|^2_2+(\dots)\mathbf{x}_{t-1}^\top(\dots) \right)\\ & = -\frac{1}{2}\left(\frac{\beta_t+(1-\beta_t)\tau_{t-1}}{\beta_t\tau_{t-1}}\|\mathbf{x}_{t-1}\|^2_2+(\dots)\mathbf{x}_{t-1}^\top(\dots) \right) \end{align*} \] Because we're only interested in the variance of this distribution (and not the mean), we ignore the term that arises that's linear in \(\mathbf{x}_{t-1}\) and hide irrelevant constants in ellipses above. From this we can determine the variance of \(q(\mathbf{x}_{t-1}|\mathbf{x}_t)\) like we did in earlier proofs. The next line below shows the desired upper bound. \[ \begin{align*} \mathrm{Var}[q(\mathbf{x}_{t-1}|\mathbf{x}_t)] & = \frac{\beta_t\tau_{t-1}}{\beta_t+(1-\beta_t)\tau_{t-1}}\mathbf{I} = \frac{\beta_t}{1+\beta_t\left(\frac{1}{\tau_{t-1}} - 1\right)}\mathbf{I} \preceq \beta_t\mathbf{I} \end{align*} \] The inequality holds because \(\frac{1}{\tau_{t-1}} \geq 1\) and \(\beta_t \geq 0\). This completes the proof.