Back to Autoencoders Hub

Deconstructing Autoencoders from Scratch

Part 2: PyTorch Implementation

Introduction

In Part 1, we established the mathematical foundation of autoencoders---the information bottleneck, MSE reconstruction loss, and the manifold hypothesis. Today, we translate that math into code. We implement four distinct autoencoder variants from scratch in pure PyTorch, each illustrating a different design principle.

Vanilla Autoencoder

The baseline architecture is a symmetric encoder-decoder with fully-connected layers:

# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, latent_dim),
)

# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
    nn.Linear(latent_dim, 256),
    nn.ReLU(inplace=True),
    nn.Linear(256, 512),
    nn.ReLU(inplace=True),
    nn.Linear(512, 784),
    nn.Sigmoid(),
)

Architecture Choices

Denoising Autoencoder

The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:

def add_noise(self, x):
    noise = self.noise_factor * torch.randn_like(x)
    noisy = x + noise
    return torch.clamp(noisy, 0.0, 1.0)

def forward(self, x):
    if self.training:
        x_input = self.add_noise(x)
    else:
        x_input = x
    z = self.encode(x_input)
    return self.decode(z)

Key Design Decisions

Sparse Autoencoder

The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:

def forward(self, x):
    z = self.encode(x)
    x_hat = self.decode(z)
    return x_hat, z  # return both for sparsity loss

The training loop then computes the composite loss:

recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1

L1 Sparsity Formulation

The total loss is:

$$ \mathcal{L} = \underbrace{\frac{1}{N}\sum_{i=1}^N \|\mathbf{x}_i - \hat{\mathbf{x}}_i\|^2}_{\text{reconstruction}} + \underbrace{\lambda \cdot \frac{1}{N}\sum_{i=1}^N \|\mathbf{z}_i\|_1}_{\text{sparsity penalty}} $$

The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.

Convolutional Autoencoder

The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:

# Encoder
self.enc_conv = nn.Sequential(
    nn.Conv2d(1, 16, 3, stride=2, padding=1),   # 28x28 -> 14x14
    nn.ReLU(inplace=True),
    nn.Conv2d(16, 32, 3, stride=2, padding=1),  # 14x14 -> 7x7
    nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)

# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
    nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
                       output_padding=1),         # 7x7 -> 14x14
    nn.ReLU(inplace=True),
    nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
                       output_padding=1),          # 14x14 -> 28x28
    nn.Sigmoid(),
)

Why Convolutions Win

ConvTranspose2d: Learned Upsampling

The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:

The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).

Training Configuration

All four variants are trained with:

Next Steps: What Does the Bottleneck Learn?

With all four variants implemented, we can now train them and investigate what the bottleneck actually discovers. In Part 3, we visualize the 2D latent space as a scatter plot colored by digit class, demonstrate the denoising autoencoder's ability to remove heavy Gaussian noise, and compare reconstruction quality across all four architectures.

Full code is on the GitHub repo. Stay tuned for the training results!