If standard deep learning is about recognizing patterns, diffusion models are about creating them from scratch. From DALL-E to Midjourney, diffusion models have changed generative AI. But how do you teach a neural network to create structure from pure randomness?
The trick is counterintuitive: first, teach it to systematically forget.
This post covers the mathematical foundation of Denoising Diffusion Probabilistic Models (DDPMs): the Forward Process, the Reparameterization Trick, and the Evidence Lower Bound (ELBO).
The Forward Process: Order to Chaos
A diffusion model defines a Markov chain over $T$ timesteps (typically $T=1000$). Starting from a real image $x_0 \sim q(x)$, we add Gaussian noise at each step according to a variance schedule $\beta_1, \beta_2, \dots, \beta_T$.
The transition from step $t-1$ to $t$:
Scaling the mean by $\sqrt{1-\beta_t}$ keeps the variance of $x_t$ bounded near 1.0 throughout the chain. This is a subtle but essential design choice: without the scaling factor, each step would inflate the variance and the process would diverge instead of converging to a known distribution. By $t = T$, the image $x_T$ is indistinguishable from isotropic Gaussian noise $\mathcal{N}(0, \mathbf{I})$.
In our implementation, $\beta$ follows a linear schedule from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$. The early steps add almost imperceptible noise, while the later steps are more aggressive. This gradual ramp matters: it gives the reverse process a smooth gradient of difficulty to learn from, rather than an abrupt jump from clean to destroyed.
The Reparameterization Trick: Sampling at Arbitrary Timesteps
Training on timestep $t=500$ by applying the transition equation 500 times in sequence is not practical. A property of Gaussians lets us sample $x_t$ directly from $x_0$ in closed form.
Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$. Recursively applying the noise addition gives:
Using the reparameterization trick:
This is the engine of diffusion training. Pick any image $x_0$, any timestep $t$, and you can instantly produce the corrupted $x_t$ along with the exact noise $\epsilon$ that created it. No sequential simulation, no iterating through intermediate steps. One multiplication, one addition, done.
Note the behavior at the extremes. When $t$ is small, $\bar{\alpha}_t \approx 1$, so $x_t \approx x_0$ -- the image is barely touched. When $t$ is large, $\bar{\alpha}_t \approx 0$, so $x_t \approx \epsilon$ -- the image is pure noise. The cumulative product $\bar{\alpha}_t$ acts as a smooth interpolation between the original data and Gaussian static.
The Reverse Process and the ELBO
The forward process $q(x_t \mid x_{t-1})$ destroys structure. The reverse process $p_\theta(x_{t-1} \mid x_t)$ must learn to restore it.
We train $\theta$ by maximizing log-likelihood. Since exact likelihood is intractable, we optimize a Variational Lower Bound -- the Evidence Lower Bound (ELBO). The full ELBO decomposes into a sum of KL divergences between the learned reverse transitions $p_\theta(x_{t-1} \mid x_t)$ and the tractable posterior $q(x_{t-1} \mid x_t, x_0)$. Ho et al. (2020) showed that after simplification, the ELBO reduces to a direct objective: rather than predicting the clean image $x_0$, the network should predict the noise $\epsilon$ that was added to $x_t$ at timestep $t$.
This is an elegant result. The posterior $q(x_{t-1} \mid x_t, x_0)$ is itself Gaussian with a mean that depends on $x_0$ and $x_t$. By reparameterizing $x_0$ in terms of $x_t$ and $\epsilon$ (inverting the forward process equation), the optimal $\mu_\theta$ turns out to be a function of the predicted noise. The network does not need to reconstruct the full image -- it only needs to identify what does not belong.
The Simplified Objective
The loss function is MSE between the true noise $\epsilon$ and the network's prediction $\epsilon_\theta$:
From Math to Code
That is it. We reduced an intractable stochastic process to: noise an image, predict the noise, minimize MSE. The entire training algorithm is five lines of pseudocode: sample an image $x_0$, pick a random $t$, sample noise $\epsilon$, compute $x_t$ via the reparameterization trick, and backpropagate $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$.
Part 2 implements this in PyTorch -- the linear noise schedule with precomputed $\bar{\alpha}_t$ coefficients, a time-conditioned UNet that injects sinusoidal position embeddings at the bottleneck, and the training loop that runs this procedure over 60,000 MNIST images for 15 epochs.