Deconstructing Autoencoders: Part 2 - PyTorch Implementation

Introduction

In Part 1, we established the mathematical foundation of autoencoders---the information bottleneck, MSE reconstruction loss, and the manifold hypothesis. Today, we translate that math into code. We implement four distinct autoencoder variants from scratch in pure PyTorch, each illustrating a different design principle.

Vanilla Autoencoder

The baseline architecture is a symmetric encoder-decoder with fully-connected layers:

# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, latent_dim),
)

# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 784),
    nn.Sigmoid(),
)

Architecture Choices

Layer sizes (784 → 512 → 256 → d): The encoder progressively halves the representation, creating a smooth compression funnel rather than an abrupt bottleneck.
ReLU activations: Introduce non-linearity so the autoencoder can learn non-linear manifolds. Without activations, stacking linear layers is equivalent to a single linear layer (PCA).
Sigmoid output: Constrains the reconstruction to $[0, 1]$, matching the normalized pixel range of MNIST images.
No Sigmoid on the bottleneck: The latent code $\mathbf{z}$ is unconstrained, allowing the network to use the full real line.

Denoising Autoencoder

The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:

def add_noise(self, x):
    noise = self.noise_factor * torch.randn_like(x)
    noisy = x + noise
    return torch.clamp(noisy, 0.0, 1.0)

def forward(self, x):
    if self.training:
        x_input = self.add_noise(x)
    else:
        x_input = x
    z = self.encode(x_input)
    return self.decode(z)

Key Design Decisions

Noise injection is training-only: The self.training flag ensures that during evaluation, the autoencoder processes clean inputs. This is critical for fair comparison with other variants.
Clamping to $[0,1]$: After adding Gaussian noise, we clamp to keep pixel values in the valid range. Without clamping, the noisy input could have negative values or values above 1, which would shift the input distribution away from what the Sigmoid output can produce.
Clean reconstruction targets: The loss is always computed against the original clean input $\mathbf{x}$, not the noisy input $\tilde{\mathbf{x}}$. This is what forces the network to denoise.
Noise factor 0.3: This adds substantial corruption---roughly 30% of the pixel range as standard deviation. High enough to challenge the network, low enough that the digit identity is usually preserved.

Sparse Autoencoder

The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:

def forward(self, x):
    z = self.encode(x)
    x_hat = self.decode(z)
    return x_hat, z  # return both for sparsity loss

The training loop then computes the composite loss:

recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1

L1 Sparsity Formulation

The total loss is:

\mathcal{L} = \underbrace{\frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2}_{\text{reconstruction}} + \underbrace{\lambda \cdot \frac{1}{N}\sum_{i=1}^N \|\mathbf{z}_i\|_1}_{\text{sparsity penalty}}

The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.

Convolutional Autoencoder

The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:

# Encoder
self.enc_conv = nn.Sequential(
    nn.Conv2d(1, 16, 3, stride=2, padding=1),   # 28x28 -> 14x14
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, 3, stride=2, padding=1),  # 14x14 -> 7x7
    nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)

# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
    nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
                       output_padding=1),         # 7x7 -> 14x14
    nn.ReLU(inplace=True),
    nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
                       output_padding=1),          # 14x14 -> 28x28
    nn.Sigmoid(),
)

Why Convolutions Win

Spatial structure: Fully-connected layers flatten images into 784-dimensional vectors, treating adjacent pixels the same as distant ones. Convolutions preserve the 2D grid, so the encoder learns local spatial features (edges, corners, curves).
Parameter efficiency: A 3$\times$3 conv with 16 filters has $1 \times 16 \times 3 \times 3 = 144$ parameters. The equivalent fully-connected layer would have $784 \times 512 = 401{,}408$ parameters.
Translation equivariance: A "7" in the top-left corner produces the same feature activations as a "7" in the bottom-right, shifted accordingly. This built-in invariance reduces the number of examples needed to learn.

ConvTranspose2d: Learned Upsampling

The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:

$7 \times 7 \to 14 \times 14$ (first layer)
$14 \times 14 \to 28 \times 28$ (second layer, restoring original resolution)

The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).

Training Configuration

All four variants are trained with:

Dataset: 5,000-image MNIST subset (for fast CPU training)
Optimizer: Adam (lr=$10^{-3}$)
Loss: MSE for vanilla and conv; MSE against clean targets for denoising; MSE + L1 for sparse
Epochs: 30
Batch size: 128
Latent dim: 2 (for visualization) and 32 (for quality comparison)

Next Steps: What Does the Bottleneck Learn?

With all four variants implemented, we can now train them and investigate what the bottleneck actually discovers. In Part 3, we visualize the 2D latent space as a scatter plot colored by digit class, demonstrate the denoising autoencoder's ability to remove heavy Gaussian noise, and compare reconstruction quality across all four architectures.

Full code is on the GitHub repo. Stay tuned for the training results!

Deconstructing Autoencoders from Scratch

Part 2: PyTorch Implementation