Deconstructing Diffusion Models: Part 2 - PyTorch Implementation

Introduction

Part 1 established the DDPM objective: corrupt an image with noise schedule $\beta_t$, train a network to predict the noise $\epsilon$ added at timestep $t$. Here we translate that into PyTorch -- an unconditional DDPM that generates MNIST digits from scratch.

View Full Setup on GitHub

The Noise Schedule

We precompute the forward/reverse coefficients for $T=1000$ timesteps. Linear schedule from $\beta = 10^{-4}$ to $\beta = 0.02$. We also precompute the cumulative product $\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)$ and its square root, since these are the coefficients we need for the reparameterization trick during training. The posterior variance $\tilde{\beta}_t = \beta_t \cdot \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}$ is precomputed as well -- we will need it for the reverse sampling step.

class LinearNoiseSchedule:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=2e-2):
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)

        # We need these components for the forward process (q_sample)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

The Network: A Time-Conditioned UNet

A standard UNet maps image to image. Ours takes a noisy image $x_t$ and the timestep $t$, and outputs predicted noise $\epsilon_\theta$. The network needs $t$ because the noise scale changes drastically between $t=1$ and $t=1000$.

We encode $t$ with a Sinusoidal Position Embedding -- the same mechanism used in Transformers to encode token position. The scalar timestep $t$ is mapped into a 256-dimensional continuous vector using sine and cosine functions at geometrically spaced frequencies, then projected through a linear layer with GELU activation. This embedding is added to the bottleneck feature maps, giving every layer downstream access to the noise level.

The UNet itself follows the classic encoder-decoder pattern with skip connections. The encoder has two downsampling stages (28 to 14 to 7), the bottleneck operates at 7x7 with 128 channels, and the decoder mirrors the encoder with transposed convolutions. Skip connections concatenate encoder features with decoder features at matching resolutions. The whole model is 324,705 parameters -- small enough to train on a laptop.

class SimpleUNet(nn.Module):
    def __init__(self):
        # Time embedding structure
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(256),
            nn.Linear(256, 256),
            nn.GELU()
        )
        # Standard Down/Up Conv blocks...

    def forward(self, x, time):
        # Compute time embedding
        t_emb = self.time_mlp(time)[:, :, None, None]

        # Standard encoder passes ...
        x = self.bottleneck(x)

        # Inject time conditioning
        x = x + t_emb

        # Standard decoder passes with skip connections ...
        return self.output(x)

The Training Loop

Ho et al. Algorithm 1, in PyTorch:

# For each batch of clean images x_0:
optimizer.zero_grad()

# 1. Pick a random timestep t for each image in the batch
t = torch.randint(0, T, (batch_size,), device=device)

# 2. Sample random normal noise
noise = torch.randn_like(x_0)

# 3. Corrupt x_0 to x_t using the forward process formula
x_t = schedule.sqrt_alphas_cumprod[t] * x_0 + \
      schedule.sqrt_one_minus_alphas_cumprod[t] * noise

# 4. Ask the UNet to predict the noise
predicted_noise = model(x_t, t)

# 5. The loss is simply the Mean Squared Error
loss = F.mse_loss(predicted_noise, noise)

loss.backward()
optimizer.step()

That is the whole thing. The network learns to separate structure from noise across all 1,000 corruption levels. Each batch samples fresh random timesteps, so over the course of training, every noise level gets roughly equal coverage.

The Sampling Loop (Reverse Process)

Generation starts from pure Gaussian noise $\mathcal{N}(0, \mathbf{I})$. At each step from $T$ down to 0, we compute the predicted mean of the reverse posterior, then add a controlled amount of noise to maintain stochasticity. At the final step ($t=0$), we return the mean directly with no added noise.

Here is the actual reverse step from our implementation:

@torch.no_grad()
def p_sample(self, model, x, t, t_index):
    """Reverse step: Sample x_{t-1} given x_t"""
    betas_t = self.schedule.extract(self.schedule.betas, t, x.shape)
    sqrt_one_minus_alphas_cumprod_t = self.schedule.extract(
        self.schedule.sqrt_one_minus_alphas_cumprod, t, x.shape
    )
    sqrt_recip_alphas_t = 1.0 / torch.sqrt(
        self.schedule.extract(self.schedule.alphas, t, x.shape)
    )

    # Compute the predicted mean (Eq. 11 from Ho et al.)
    model_mean = sqrt_recip_alphas_t * (
        x - betas_t * model(x, t) / sqrt_one_minus_alphas_cumprod_t
    )

    if t_index == 0:
        return model_mean
    else:
        posterior_variance_t = self.schedule.extract(
            self.schedule.posterior_variance, t, x.shape
        )
        noise = torch.randn_like(x)
        return model_mean + torch.sqrt(posterior_variance_t) * noise

The key line is the mean computation. The network predicts $\epsilon_\theta(x_t, t)$, and we plug that into the closed-form expression for the reverse posterior mean: $\mu_\theta = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta \right)$. The added noise term $\sqrt{\tilde{\beta}_t} \cdot z$ keeps the samples from collapsing to a single deterministic output.

The full generation loop calls p_sample 1,000 times in sequence, stepping from $T=999$ down to $0$:

@torch.no_grad()
def sample(self, model, shape, device='cpu'):
    """Full reverse loop: Generate samples from pure noise"""
    x = torch.randn(shape, device=device)

    for i in reversed(range(0, self.schedule.num_timesteps)):
        t = torch.full((shape[0],), i, device=device, dtype=torch.long)
        x = self.p_sample(model, x, t, i)

    return x

That is the complete generation pipeline. Start with random static, run 1,000 denoising steps, get a sample from the learned distribution. Part 3 shows the results -- and they are surprisingly good for a 324K-parameter model trained for 15 epochs.