Deconstructing GANs: Part 2 - PyTorch Implementation

Introduction

Part 1 covered the math: the minimax objective, the optimal discriminator, and the JSD connection. Here we turn those equations into working PyTorch code.

We implement two architectures: a Vanilla GAN with fully-connected layers, and a DCGAN that uses convolutions for spatial structure. The differences in design choices matter more than you might expect.

The Vanilla Generator

The Generator maps a 100-dimensional noise vector $z$ to a $28 \times 28$ image through fully-connected layers with BatchNorm and ReLU:

class Generator(nn.Module):
    def __init__(self, z_dim=100):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(True),
            nn.Linear(256, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(True),
            nn.Linear(512, 1024),
            nn.BatchNorm1d(1024),
            nn.ReLU(True),
            nn.Linear(1024, 784),
            nn.Tanh(),
        )

    def forward(self, z):
        out = self.net(z)
        return out.view(-1, 1, 28, 28)

The Tanh output clamps values to $[-1, 1]$, which must match the data normalization. If there is a mismatch here, the network will not converge.

The Vanilla Discriminator

The Discriminator flattens the $28 \times 28$ input and classifies it through descending linear layers:

class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

We use LeakyReLU(0.2) rather than ReLU. Standard ReLU can kill neurons in the Discriminator permanently, cutting off the gradient signal to the Generator. The 0.2 negative slope keeps gradients flowing even for negative activations.

The DCGAN Architecture

Radford et al. (2015) laid out architectural guidelines that stabilize GAN training significantly:

Replace pooling with strided convolutions ($D$) and transposed convolutions ($G$)
Use BatchNorm everywhere except $D$'s input layer and $G$'s output layer
ReLU in $G$, LeakyReLU in $D$
Drop fully-connected hidden layers; use convolutional structure throughout

DCGAN Generator

The noise vector is projected into a spatial feature map, then upsampled through ConvTranspose2d layers:

class DCGenerator(nn.Module):
    def __init__(self, z_dim=100):
        super().__init__()
        self.project = nn.Sequential(
            nn.Linear(z_dim, 256 * 7 * 7),
            nn.BatchNorm1d(256 * 7 * 7),
            nn.ReLU(True),
        )
        self.conv_blocks = nn.Sequential(
            nn.ConvTranspose2d(256, 128, 4, 2, 1),  # -> (128, 14, 14)
            nn.BatchNorm2d(128),
            nn.ReLU(True),
            nn.ConvTranspose2d(128, 64, 4, 2, 1),   # -> (64, 28, 28)
            nn.BatchNorm2d(64),
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 1, 3, 1, 1),     # -> (1, 28, 28)
            nn.Tanh(),
        )

Transposed convolutions learn spatial upsampling filters instead of forcing the network to generate each pixel independently. This gives the generator a natural bias toward local spatial coherence.

DCGAN Discriminator

The Discriminator mirrors the Generator with strided convolutions:

class DCDiscriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_blocks = nn.Sequential(
            nn.Conv2d(1, 64, 4, 2, 1),         # -> (64, 14, 14)
            nn.LeakyReLU(0.2, inplace=True),
            nn.Conv2d(64, 128, 4, 2, 1),       # -> (128, 7, 7)
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.2, inplace=True),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 7 * 7, 1),
            nn.Sigmoid(),
        )

No BatchNorm on the first convolutional layer, per the DCGAN paper. Normalizing raw pixel inputs tends to destabilize early training.

Weight Initialization

All weights are initialized from $\mathcal{N}(0, 0.02)$. The DCGAN paper found this necessary for stable convergence --- without it, training can collapse within the first few epochs.

def weights_init(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1 or classname.find("Linear") != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm") != -1:
        nn.init.normal_(m.weight.data, 1.0, 0.02)
        nn.init.constant_(m.bias.data, 0)

The Training Loop

Each batch involves two optimization steps:

Step 1 --- Train $D$: Forward pass on real images (label 1) and fake images from $G$ (label 0). Compute BCELoss on both, backprop, update $D$.

Step 2 --- Train $G$: Generate fakes and pass them through $D$. $G$'s loss uses label 1 --- it wants $D$ to classify its outputs as real. Backprop through $D$ (frozen) into $G$, update $G$.

One detail that is easy to miss: we call .detach() on the fake images during $D$'s update to prevent gradients from leaking back into $G$.

Optimizer Configuration

We use Adam with $\text{lr} = 0.0002$ and $(\beta_1, \beta_2) = (0.5, 0.999)$, following the DCGAN paper. The lower $\beta_1$ (default is 0.9) dampens momentum, which reduces the adversarial oscillations.

Up Next

In Part 3, we train both models on MNIST, look at loss dynamics, compare sample quality, and discuss where things go wrong --- including mode collapse.

Deconstructing GANs from Scratch

Part 2: PyTorch Implementation