Back to Autoencoders Hub

Deconstructing Autoencoders from Scratch

Part 2: PyTorch Implementation

Introduction

Part 1 covered the math: the information bottleneck, MSE loss, and the manifold hypothesis. Here we translate that into four PyTorch implementations---Vanilla, Denoising, Sparse, and Convolutional---each targeting a different design constraint.

Vanilla Autoencoder

The baseline architecture is a symmetric encoder-decoder with fully-connected layers:

# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, latent_dim),
)

# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 784),
    nn.Sigmoid(),
)

Architecture Choices

Denoising Autoencoder

The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:

def add_noise(self, x):
    noise = self.noise_factor * torch.randn_like(x)
    noisy = x + noise
    return torch.clamp(noisy, 0.0, 1.0)

def forward(self, x):
    if self.training:
        x_input = self.add_noise(x)
    else:
        x_input = x
    z = self.encode(x_input)
    return self.decode(z)

Key Design Decisions

Sparse Autoencoder

The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:

def forward(self, x):
    z = self.encode(x)
    x_hat = self.decode(z)
    return x_hat, z  # return both for sparsity loss

The training loop then computes the composite loss:

recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1

L1 Sparsity Formulation

The total loss is:

$$ \mathcal{L} = \underbrace{\frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2}_{\text{reconstruction}} + \underbrace{\lambda \cdot \frac{1}{N}\sum_{i=1}^N \|\mathbf{z}_i\|_1}_{\text{sparsity penalty}} $$

The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.

Convolutional Autoencoder

The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:

# Encoder
self.enc_conv = nn.Sequential(
    nn.Conv2d(1, 16, 3, stride=2, padding=1),   # 28x28 -> 14x14
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, 3, stride=2, padding=1),  # 14x14 -> 7x7
    nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)

# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
    nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
                       output_padding=1),         # 7x7 -> 14x14
    nn.ReLU(inplace=True),
    nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
                       output_padding=1),          # 14x14 -> 28x28
    nn.Sigmoid(),
)

Why Convolutions Win

ConvTranspose2d: Learned Upsampling

The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:

The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).

Training Configuration

All four variants are trained with:

What Does the Bottleneck Learn?

All four variants are implemented. Part 3 trains them on MNIST and digs into the results: 2D latent space scatter plots colored by digit class, denoising under heavy Gaussian noise, and reconstruction quality across architectures. Full code is on GitHub.