Home

[Return to Index]


II. The Backward Model

Previously, we assumed a conditionally Markovian structure for the backward model, \[ \begin{align*} q(\mathbf{x}_{1:T}|\mathbf{x}_0) & = \prod_{t>0} q(\mathbf{x}_t|\mathbf{x}_{t-1},\mathbf{x}_0)\\ & = q(\mathbf{x}_T|\mathbf{x}_0) \prod_{t>1} q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0). \end{align*} \] In this section we'll place specific distributional assumptions which will be key to deriving several properties.

Define. The backward process is parameterized by a fixed noise schedule \(\{\boldsymbol\alpha, \boldsymbol\sigma\}\) and fixed backward model variance schedule \(\boldsymbol\gamma\), each a non-negative \(T\)-dimensional vector. It's defined as the following. \[ \begin{align*} q(\mathbf{x}_T|\mathbf{x}_0) & = N(\alpha_T\mathbf{x}_0,\sigma_T^2 \mathbf{I})\\ q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) & = N\left(\alpha_{t-1}\mathbf{x}_0 + \sqrt{\sigma_{t-1}^2 - \gamma_t^2}\frac{\mathbf{x}_t-\alpha_t\mathbf{x}_0}{{\sigma_t}}, \gamma_t^2 \mathbf{I}\right) \end{align*} \]

We will later discuss the role of \(\{\alpha_t, \sigma_t, \gamma_t\}\). For now we can think of them as hyper-parameters with the following constraints.

  1. Variance Preserving: \(\alpha_t^2 + \sigma_t^2 = 1\).
  2. Monotonic: \(\alpha_t > \alpha_{t+1}\) and consequently \(\sigma_t < \sigma_{t=1}\).
  3. Boundary Conditions: \(\alpha_0 = 1,\sigma_0=0,\alpha_T\approx 0,\sigma_T\approx 1\).
  4. Gamma Bound: \(0 < \gamma_t < \sigma_{t-1}\).

This definition (and the constraints listed) aren't actually very intuitive yet. They're really only chosen for convenience so that the following key result holds. Note that it holds for any valid setting of \(\{\boldsymbol\alpha, \boldsymbol\sigma, \boldsymbol\gamma\}\).


Key Result. We have a closed form for the distribution of \(\mathbf{x}_t\) conditioned on \(\mathbf{x}_0\), \[ \begin{align*} q(\mathbf{x}_t|\mathbf{x}_0) = N(\alpha_t\mathbf{x}_0,\sigma_t^2 \mathbf{I}). \end{align*} \] This is the most important result on this page. Equivalently, with re-parameterization we can write \[ \begin{align*} \tilde{\mathbf{x}}_t & = \alpha_t \mathbf{x}_0 + \sigma_t \boldsymbol\epsilon_t&\boldsymbol\epsilon_t &\sim N(\mathbf{0}, \mathbf{I}) \end{align*} \] Recall from the previous section we decided to use notation \(\tilde{\mathbf{x}}_t\) to denote a random variable drawn from distribution \(q(\mathbf{x}_t|\mathbf{x}_0)\).

Proof. This proof is actually quite long and intimidating. If reading for the first time, you may want to skip it.

We'll use an inductive proof descending from \(t=T,\dots,0\). The base case holds by assumption. We need to show the following. \[ \begin{align*} q(\mathbf{x}_{t-1}|\mathbf{x}_0) & = \int_{\mathbf{x}_t}q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)q(\mathbf{x}_t|\mathbf{x}_0)d\mathbf{x}_t\\ & \stackrel{?}{=} N(\alpha_{t-1}\mathbf{x}_0,\sigma_{t-1}^2\mathbf{I}) \end{align*} \]

This boils down to Equation (2.115) of Bishop (2006), which we'll now prove. We'll simplify notation a bit by writing \(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)\) as a linear model in terms of \(\mathbf{x}_t\), introducing scalars \(A,B\). \[ \begin{align*} q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) & = N\left(\alpha_{t-1}\mathbf{x}_0 + \sqrt{\sigma_{t-1}^2 - \gamma_t^2}\frac{\mathbf{x}_t-\alpha_t\mathbf{x}_0}{{\sigma_t}}, \gamma_t^2 \mathbf{I}\right)\\ & \triangleq N(A\mathbf{x}_t +B\mathbf{x}_0,\gamma_t^2\mathbf{I}),\\ A & \triangleq\sqrt{\sigma_{t-1}^2 - \gamma_t^2}/\sigma_t \\ B & \triangleq \left(\alpha_{t-1} - \frac{\alpha_t}{\sigma_t}\sqrt{\sigma_{t-1}^2 - \gamma_t^2}\right) = \alpha_{t-1}-\alpha_tA \end{align*} \] The approach will be to compute the joint distribution \(q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0)\). It turns out this joint distribution is be another Gaussian (a classical result we won't prove here), i.e. \([\mathbf{x}_{t-1},\mathbf{x}_t]\sim N(\boldsymbol\mu, \boldsymbol\Sigma)\). Then the marginal distribution \(q(\mathbf{x}_{t-1}|\mathbf{x}_0)\) that we're interested in will simply be the components of \(\boldsymbol\mu,\boldsymbol\Sigma\) that correspond to the \(\mathbf{x}_{t-1}\) block rather than the \(\mathbf{x}_t\) block.

Recall that a Gaussian distribution \(\mathbf{z}\sim N(\boldsymbol\mu, \boldsymbol\Sigma)\) has log-likelihood \[ \begin{align*} \log p(\mathbf{z}) & \propto -\frac{1}{2}(\mathbf{z} - \boldsymbol\mu)^\top \boldsymbol\Sigma^{-1}(\mathbf{z}-\boldsymbol\mu) \tag{in the general case}\\ & \propto -\frac{\|\mathbf{z}\|^2-2\mathbf{z}^\top\boldsymbol{\mu}}{2\sigma^2} \tag{in the isotropic case} \end{align*} \]

In our case, the log-likelihood will be the following. Note that we leave out irrelevant constants i.e. those with no dependency on \(\mathbf{x}_{t-1}\) or \(\mathbf{x}_t\). \[ \begin{align*} \log q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0) & = \log q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) + \log q(\mathbf{x}_t|\mathbf{x}_0)\\ & \propto -\frac{1}{2}\left(\frac{\|\mathbf{x}_{t-1} - A\mathbf{x}_t-B\mathbf{x}_0\|^2}{\gamma_t^2} + \frac{\|\mathbf{x}_t-\alpha_t\mathbf{x}_0\|^2}{\sigma_t^2}\right)\\ & \propto -\frac{1}{2\gamma_t^2}\left(\|\mathbf{x}_{t-1}\|^2 - 2A\mathbf{x}_{t-1}^\top\mathbf{x}_t - 2B\mathbf{x}_{t-1}^\top\mathbf{x}_0 + A^2\|\mathbf{x}_t\|^2+2AB\mathbf{x}_{t}^\top\mathbf{x}_0\right) \\ &\quad\ -\frac{1}{2\sigma_t^2}\left(\|\mathbf{x}_t\|^2 -2\alpha_t\mathbf{x}_t^\top \mathbf{x}_0\right) \end{align*} \] We need to do some pattern matching to compute the components of the corresponding Gaussian.

If we collect the quadratic terms in the log-likelihood we end up with \[ \begin{align*} &-\frac{1}{2}\left(\frac{\|\mathbf{x}_{t-1}\|^2 -2A\mathbf{x}_{t-1}^\top \mathbf{x}_t+A^2\|\mathbf{x}_t\|^2}{\gamma_t^2} + \frac{\|\mathbf{x}_t\|^2}{\sigma_t^2}\right)\\ =& -\frac{1}{2}\left(\frac{\|\mathbf{x}_{t-1}\|^2 -2A\mathbf{x}_{t-1}^\top \mathbf{x}_t}{\gamma_t^2} + \left(\frac{A^2}{\gamma^2}+\frac{1}{\sigma_t^2}\right) \|\mathbf{x}_t\|^2\right) \end{align*} \] which implies that the joint distribution \(q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0) = N(\boldsymbol\mu, \boldsymbol\Sigma)\) has block-wise precision matrix \[ \begin{equation*} \boldsymbol\Sigma^{-1} = \begin{bmatrix}&\frac{1}{\gamma_t^2}\mathbf{I}& -\frac{A}{\gamma_t^2}\mathbf{I} & \\ & -\frac{A}{\gamma_t^2}\mathbf{I} & \left(\frac{A^2}{\gamma^2}+\frac{1}{\sigma_t^2}\right) \mathbf{I}\end{bmatrix}. \end{equation*} \] Recall that a block-wise matrix can be inverted \[ \begin{equation*} \begin{bmatrix} \ \mathbf{P} & \mathbf{Q}\ \\ \ \mathbf{R} & \mathbf{S}\ \end{bmatrix}^{-1} = \begin{bmatrix}\ (\mathbf{P}- \mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1} & -(\mathbf{P}-\mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}\mathbf{Q}\mathbf{S}^{-1}\ \\ \ -\mathbf{S}^{-1}\mathbf{Q}(\mathbf{P}-\mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}& \mathbf{S}^{-1}+\mathbf{S}^{-1}\mathbf{R}(\mathbf{P}- \mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}\mathbf{Q}\mathbf{S}^{-1}. & \end{bmatrix} \end{equation*} \] In our case the covariance of the marginal \(q(\mathbf{x}_{t-1}|\mathbf{x}_0)\) that we're interested in is the Schur complement of block \(\mathbf{P}\), \[ \begin{align*} (\mathbf{P} - \mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1} &= \left(\frac{1}{\gamma_t^2}-\left(\frac{A^2}{\gamma^2}+\frac{1}{\sigma_t^2}\right)^{-1}\frac{A^2}{\gamma_t^4}\right)^{-1}\mathbf{I}\\ & = (A^2\sigma_t^2 + \gamma_t^2)\mathbf{I}\\ & = \left(\frac{\sigma_{t-1}^2 - \gamma_t^2}{\sigma_t^2}\sigma_t^2 + \gamma_t^2\right)\mathbf{I}\\ &=\sigma_{t-1}^2\mathbf{I} \end{align*} \] (On the third line above we substituted the expression for our linear model in \(A^2\).)

Similarly, if we collect the linear terms in the log-likelihood we end up with \[ \begin{align*} -\frac{1}{2}\left(\frac{-2B\mathbf{x}_{t-1}^\top \mathbf{x}_0 + 2AB\mathbf{x}_t^\top\mathbf{x}_0}{\gamma_t^2} + \frac{-2\alpha_t\mathbf{x}_t^\top \mathbf{x}_0}{\sigma_t^2}\right) \end{align*} \] which implies that the joint distribution \(q(\mathbf{x}_{t-1},\mathbf{x}_t|\mathbf{x}_0) = N(\boldsymbol\mu, \boldsymbol\Sigma)\) has mean \[ \begin{equation*} \boldsymbol\mu = \boldsymbol\Sigma \begin{bmatrix}\ \frac{B\mathbf{x}_0}{\gamma_t^2}\ \\ \left(\frac{\alpha_t}{\sigma_t^2}-\frac{AB}{\gamma_t^2}\right)\mathbf{x}_0\end{bmatrix} = \begin{bmatrix}\ \mathbf{P} & \mathbf{Q}\ \\\ \mathbf{R} & \mathbf{S}\ \end{bmatrix}^{-1}\begin{bmatrix}\ \frac{B\mathbf{x}_0}{\gamma_t^2}\ \\ \left(\frac{\alpha_t}{\sigma_t^2}-\frac{AB}{\gamma_t^2}\right)\mathbf{x}_0\end{bmatrix}. \end{equation*} \] In our case the mean of the marginal \(q(\mathbf{x}_{t-1}|\mathbf{x}_0)\) that we're interested in is the first block of this vector. In order to expand it we need to compute another block of the covariance matrix \(\boldsymbol\Sigma\), \[ \begin{align*} -(\mathbf{P}-\mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}\mathbf{Q}\mathbf{S}^{-1} & = \sigma_{t-1}^2 \left(\frac{A}{\gamma_t^2}\right)\left(\frac{\sigma_t^2\gamma_t^2}{A^2\sigma_t^2 + \gamma_t^2}\right)\mathbf{I}\\ & = \frac{A\sigma_{t-1}^2\sigma_t^2}{A^2\sigma_{t}^2+\gamma_t^2}\mathbf{I}\\ & = A\sigma_t^2\mathbf{I} \end{align*} \] (On the third line above we substituted the expression in our linear model for \(A^2\) in the denominator.)

Putting it together, we have mean \[ \begin{align*} \begin{bmatrix}(\mathbf{P} - \mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}\\ -(\mathbf{P}-\mathbf{Q}\mathbf{S}^{-1}\mathbf{R})^{-1}\mathbf{Q}\mathbf{S}^{-1}\end{bmatrix}^\top \begin{bmatrix}\frac{B\mathbf{x}_0}{\gamma_t^2} \\\frac{\alpha_t\mathbf{x}_0}{\sigma_t^2} \end{bmatrix} & = \sigma_{t-1}^2\frac{B}{\gamma_t^2}\mathbf{x}_0+ A\sigma_t^2\left(\frac{\alpha_t}{\sigma_t^2}-\frac{AB}{\gamma_t^2}\right)\mathbf{x}_0\\ & = \sigma_{t-1}^2\frac{B}{\gamma_t^2}\mathbf{x}_0+ A\sigma_t^2\left(\frac{\alpha_t\gamma_t^2-\sigma_t^2AB}{\gamma_t^2\sigma_t^2}\right)\mathbf{x}_0\\ & = \left(\frac{\sigma_{t-1}^2}{\gamma_t^2}B + \alpha_tA-\frac{\sigma_t^2}{\gamma_t^2}A^2B\right)\mathbf{x}_0\\ & = \left(\frac{\sigma_{t-1}^2}{\gamma_t^2}B + \alpha_tA-\frac{\sigma_{t-1}^2-\gamma_t^2}{\gamma_t^2}B\right)\mathbf{x}_0\\ & = \left(\alpha_tA +B \right)\mathbf{x}_0\\ & = \left(\frac{\alpha_t}{\sigma_t}\sqrt{\sigma_{t-1}^2-\gamma_t^2} + \alpha_{t-1} - \frac{\alpha_t}{\sigma_t}\sqrt{\sigma_{t-1}^2-\gamma_t^2}\right)\mathbf{x}_0\\ & = \alpha_{t-1}\mathbf{x}_0 \end{align*} \]

(On the fourth line above we substituted the expression in our linear model for \(A^2\). On the fifth line we substituted the expressions for \(A,B\). All other lines are simplifications.)

This concludes the proof.


Remark. For intuition, let's put this result in the context of our prior derivation for the ELBO. \[ \begin{align*} L_t(\mathbf{x}_0) & = \mathbb{E}_{\tilde{\mathbf{x}}_t\sim q(\mathbf{x}_t|\mathbf{x}_0)}\left[D_\mathrm{KL}(\ \underbrace{q(\mathbf{x}_{t-1}|\tilde{\mathbf{x}}_t, \mathbf{x}_0)}_{\text{groundtruth}}\ \|\ \underbrace{p_\theta(\mathbf{x}_{t-1}|\tilde{\mathbf{x}}_t)}_\text{prediction}\ )\right] \end{align*} \] We are trying to learn \(p_\theta(\mathbf{x}_{t-1}|\tilde{\mathbf{x}}_t)\) to approximate the fixed distribution \(q(\mathbf{x}_{t-1}|\tilde{\mathbf{x}}_t, \mathbf{x}_0)\).

The key result tells us the distribution of \(\tilde{\mathbf{x}}_t\sim q(\mathbf{x}_t|\mathbf{x}_0)\). Writing it down again, \[ \begin{equation*}\tilde{\mathbf{x}}_t = \alpha_t \mathbf{x}_0 + \sigma_t \boldsymbol\epsilon_t\quad\quad\boldsymbol\epsilon_t \sim N(\mathbf{0}, \mathbf{I})\end{equation*}. \] It's easy to see from the re-parameterized notation that \[ \begin{align*} \mathbb{E}[\tilde{\mathbf{x}}_t] & = \alpha_t \mathbf{x}_0\\ \mathrm{Var}[\tilde{\mathbf{x}}_t] & = \alpha_t^2 \mathrm{Var}[\mathbf{x}_0] + \sigma_t^2 \mathbf{I}. \end{align*} \]

So we can interpret \(\tilde{\mathbf{x}}_t\) as a "noisy" version of \(\mathbf{x}_0\), interpolating toward noise. How noisy is it?

Pseudocode. We can now update our pseudocode accordingly.

def compute_L_t(x_0):
    t = sample_t(lower=1, upper=T)
    eps = randn_like(x_0)
    monte_carlo_x_t = alpha_t * x_0 + sigma_t * eps  # this is the only line that changed
    true_distn = get_gt_q_x_t_minus_1(monte_carlo_x_t, x_0, t)  # this has a closed form
    pred_distn = get_pred_p_x_t_minus_1(monte_carlo_x_t, t)  # gradient flows into model
    loss = compute_kl_div(true_distn, pred_distn)
    return loss

Remark. The variance-preserving assumption ensures that under standard whitening of the data so that \(\mathrm{Var}[\mathbf{x}_0] = \mathbf{I}\), we have constant \(\mathrm{Var}[\tilde{\mathbf{x}}_t] = \mathbf{I}\) regardless of \(t\).