Deconstructing Diffusion Models: Part 2 - PyTorch Implementation

Introduction

In Part 1, we established the mathematical objective of a Denoising Diffusion Probabilistic Model (DDPM): forward-corrupt an image with a specific noise profile $\beta_t$, and train a neural network to predict the exact noise vector $\epsilon$ that was added at timestep $t$.

Today, we take that math and translate it into a pure PyTorch implementation. We will build an unconditional DDPM that learns to generate handwritten MNIST digits from scratch.

View Full Setup on GitHub

The Noise Schedule

First, we need a scheduler to precompute the coefficients for the forward and reverse processes over $T=1000$ timesteps.

class LinearNoiseSchedule:
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=2e-2):
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        
        # We need these components for the forward process (q_sample)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)

The Network: A Time-Conditioned UNet

A standard UNet maps an input image to an output image. Our UNet must map a noisy image $x_t$ and the current timestep $t$ to a predicted noise tensor $\epsilon_\theta$. The network needs to know $t$ because the noise scale varies drastically between $t=1$ and $t=1000$.

We inject $t$ into the network using a Sinusoidal Position Embedding—the exact same mechanism used in Transformers. We map the scalar $t$ into a high-dimensional continuous representation, run it through an MLP, and add it to the feature maps at the bottleneck of the UNet.

class SimpleUNet(nn.Module):
    def __init__(self):
        # Time embedding structure
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(256),
            nn.Linear(256, 256),
            nn.GELU()
        )
        # Standard Down/Up Conv blocks...
        
    def forward(self, x, time):
        # Compute time embedding
        t_emb = self.time_mlp(time)[:, :, None, None]
        
        # Standard encoder passes ...
        x = self.bottleneck(x)
        
        # Inject time conditioning
        x = x + t_emb
        
        # Standard decoder passes with skip connections ...
        return self.output(x)

The Training Loop

The training algorithm (Ho et al., Algorithm 1) is stunningly simple to implement in PyTorch:

# For each batch of clean images x_0:
optimizer.zero_grad()

# 1. Pick a random timestep t for each image in the batch
t = torch.randint(0, T, (batch_size,), device=device)

# 2. Sample random normal noise
noise = torch.randn_like(x_0)

# 3. Corrupt x_0 to x_t using the forward process formula
x_t = schedule.sqrt_alphas_cumprod[t] * x_0 + \
      schedule.sqrt_one_minus_alphas_cumprod[t] * noise

# 4. Ask the UNet to predict the noise
predicted_noise = model(x_t, t)

# 5. The loss is simply the Mean Squared Error
loss = F.mse_loss(predicted_noise, noise)

loss.backward()
optimizer.step()

That is the entire forward training process. The network learns to pull out structure from noise at every possible noise level.

The Sampling Loop (Reverse Process)

Once trained, we generate new data by starting from pure Gaussian noise $\mathcal{N}(0, \mathbf{I})$ and running the reverse process parameterized by our UNet. At each step from $T$ down to 0, we subtract a scaled fraction of the network's predicted noise, and add back a small amount of Langevin noise to prevent collapse.

Next Steps: Visualizing the Generation

We now have a complete mathematical theory and a pure PyTorch implementation. In Part 3, we will run this network on the MNIST dataset and visualize the results. We will see the exact grid of pure static slowly resolving into clear, recognizable digits as the model steps backward through time.