Deconstructing Autoencoders: Part 2 - PyTorch Implementation

Introduction

Part 1 covered the math: the information bottleneck, MSE loss, and the manifold hypothesis. Here we translate that into four PyTorch implementations---Vanilla, Denoising, Sparse, and Convolutional---each targeting a different design constraint.

Vanilla Autoencoder

The baseline architecture is a symmetric encoder-decoder with fully-connected layers:

# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, latent_dim),
)

# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 784),
    nn.Sigmoid(),
)

Architecture Choices

Layer sizes (784 → 512 → 256 → d): Progressive halving creates a smooth compression funnel rather than an abrupt squeeze.
ReLU activations: Without non-linearities, stacking linear layers collapses to a single linear layer (PCA). ReLU lets the network learn non-linear manifolds.
Sigmoid output: Constrains reconstruction to $[0, 1]$, matching normalized MNIST pixel values.
No Sigmoid on the bottleneck: The latent code $\mathbf{z}$ is unconstrained---free to use the full real line.

Denoising Autoencoder

The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:

def add_noise(self, x):
    noise = self.noise_factor * torch.randn_like(x)
    noisy = x + noise
    return torch.clamp(noisy, 0.0, 1.0)

def forward(self, x):
    if self.training:
        x_input = self.add_noise(x)
    else:
        x_input = x
    z = self.encode(x_input)
    return self.decode(z)

Key Design Decisions

Noise injection is training-only: The self.training flag means evaluation uses clean inputs, which is necessary for fair comparison with other variants.
Clamping to $[0,1]$: Without clamping, noisy inputs can go negative or exceed 1, shifting the distribution away from what the Sigmoid output can represent.
Clean reconstruction targets: Loss is computed against the original $\mathbf{x}$, not the noisy $\tilde{\mathbf{x}}$. This forces the network to denoise.
Noise factor 0.3: Roughly 30% of the pixel range as standard deviation---enough to challenge the network but not enough to destroy digit identity.

Sparse Autoencoder

The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:

def forward(self, x):
    z = self.encode(x)
    x_hat = self.decode(z)
    return x_hat, z  # return both for sparsity loss

The training loop then computes the composite loss:

recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1

L1 Sparsity Formulation

The total loss is:

\mathcal{L} = \underbrace{\frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2}_{\text{reconstruction}} + \underbrace{\lambda \cdot \frac{1}{N}\sum_{i=1}^N \|\mathbf{z}_i\|_1}_{\text{sparsity penalty}}

The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.

Convolutional Autoencoder

The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:

# Encoder
self.enc_conv = nn.Sequential(
    nn.Conv2d(1, 16, 3, stride=2, padding=1),   # 28x28 -> 14x14
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, 3, stride=2, padding=1),  # 14x14 -> 7x7
    nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)

# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
    nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
                       output_padding=1),         # 7x7 -> 14x14
    nn.ReLU(inplace=True),
    nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
                       output_padding=1),          # 14x14 -> 28x28
    nn.Sigmoid(),
)

Why Convolutions Win

Spatial structure: FC layers flatten images into 784-d vectors, treating adjacent pixels the same as distant ones. Convolutions preserve the 2D grid and learn local features (edges, corners, curves).
Parameter efficiency: A 3$\times$3 conv with 16 filters has $1 \times 16 \times 3 \times 3 = 144$ parameters vs. $784 \times 512 = 401{,}408$ for the equivalent FC layer.
Translation equivariance: A "7" in the top-left produces the same feature activations as one in the bottom-right, shifted accordingly. Fewer examples needed.

ConvTranspose2d: Learned Upsampling

The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:

$7 \times 7 \to 14 \times 14$ (first layer)
$14 \times 14 \to 28 \times 28$ (second layer, restoring original resolution)

The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).

Training Configuration

All four variants are trained with:

Dataset: 5,000-image MNIST subset (for fast CPU training)
Optimizer: Adam (lr=$10^{-3}$)
Loss: MSE for vanilla and conv; MSE against clean targets for denoising; MSE + L1 for sparse
Epochs: 30
Batch size: 128
Latent dim: 2 (for visualization) and 32 (for quality comparison)

What Does the Bottleneck Learn?

All four variants are implemented. Part 3 trains them on MNIST and digs into the results: 2D latent space scatter plots colored by digit class, denoising under heavy Gaussian noise, and reconstruction quality across architectures. Full code is on GitHub.

Deconstructing Autoencoders from Scratch

Part 2: PyTorch Implementation