Deconstructing Diffusion Models: Part 1 - Mathematics

Introduction

If standard deep learning is about recognizing patterns, diffusion models are about creating them from scratch. From DALL-E to Midjourney, diffusion models have changed generative AI. But how do you teach a neural network to create structure from pure randomness?

The trick is counterintuitive: first, teach it to systematically forget.

This post covers the mathematical foundation of Denoising Diffusion Probabilistic Models (DDPMs): the Forward Process, the Reparameterization Trick, and the Evidence Lower Bound (ELBO).

The Forward Process: Order to Chaos

A diffusion model defines a Markov chain over $T$ timesteps (typically $T=1000$). Starting from a real image $x_0 \sim q(x)$, we add Gaussian noise at each step according to a variance schedule $\beta_1, \beta_2, \dots, \beta_T$.

The transition from step $t-1$ to $t$:

q(x_t \mid x_{t-1}) := \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})

Scaling the mean by $\sqrt{1-\beta_t}$ keeps the variance of $x_t$ bounded near 1.0 throughout the chain. This is a subtle but essential design choice: without the scaling factor, each step would inflate the variance and the process would diverge instead of converging to a known distribution. By $t = T$, the image $x_T$ is indistinguishable from isotropic Gaussian noise $\mathcal{N}(0, \mathbf{I})$.

In our implementation, $\beta$ follows a linear schedule from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. The early steps add almost imperceptible noise, while the later steps are more aggressive. This gradual ramp matters: it gives the reverse process a smooth gradient of difficulty to learn from, rather than an abrupt jump from clean to destroyed.

The Reparameterization Trick: Sampling at Arbitrary Timesteps

Training on timestep $t=500$ by applying the transition equation 500 times in sequence is not practical. A property of Gaussians lets us sample $x_t$ directly from $x_0$ in closed form.

Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$. Recursively applying the noise addition gives:

q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t)\mathbf{I})

Using the reparameterization trick:

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})

This is the engine of diffusion training. Pick any image $x_0$, any timestep $t$, and you can instantly produce the corrupted $x_t$ along with the exact noise $\epsilon$ that created it. No sequential simulation, no iterating through intermediate steps. One multiplication, one addition, done.

Note the behavior at the extremes. When $t$ is small, $\bar{\alpha}_t \approx 1$, so $x_t \approx x_0$ -- the image is barely touched. When $t$ is large, $\bar{\alpha}_t \approx 0$, so $x_t \approx \epsilon$ -- the image is pure noise. The cumulative product $\bar{\alpha}_t$ acts as a smooth interpolation between the original data and Gaussian static.

The Reverse Process and the ELBO

The forward process $q(x_t \mid x_{t-1})$ destroys structure. The reverse process $p_\theta(x_{t-1} \mid x_t)$ must learn to restore it.

p_\theta(x_{t-1} \mid x_t) := \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))

We train $\theta$ by maximizing log-likelihood. Since exact likelihood is intractable, we optimize a Variational Lower Bound -- the Evidence Lower Bound (ELBO). The full ELBO decomposes into a sum of KL divergences between the learned reverse transitions $p_\theta(x_{t-1} \mid x_t)$ and the tractable posterior $q(x_{t-1} \mid x_t, x_0)$. Ho et al. (2020) showed that after simplification, the ELBO reduces to a direct objective: rather than predicting the clean image $x_0$, the network should predict the noise $\epsilon$ that was added to $x_t$ at timestep $t$.

This is an elegant result. The posterior $q(x_{t-1} \mid x_t, x_0)$ is itself Gaussian with a mean that depends on $x_0$ and $x_t$. By reparameterizing $x_0$ in terms of $x_t$ and $\epsilon$ (inverting the forward process equation), the optimal $\mu_\theta$ turns out to be a function of the predicted noise. The network does not need to reconstruct the full image -- it only needs to identify what does not belong.

The Simplified Objective

The loss function is MSE between the true noise $\epsilon$ and the network's prediction $\epsilon_\theta$:

L(\theta) := \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

From Math to Code

That is it. We reduced an intractable stochastic process to: noise an image, predict the noise, minimize MSE. The entire training algorithm is five lines of pseudocode: sample an image $x_0$, pick a random $t$, sample noise $\epsilon$, compute $x_t$ via the reparameterization trick, and backpropagate $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$.

Part 2 implements this in PyTorch -- the linear noise schedule with precomputed $\bar{\alpha}_t$ coefficients, a time-conditioned UNet that injects sinusoidal position embeddings at the bottleneck, and the training loop that runs this procedure over 60,000 MNIST images for 15 epochs.