In Part 1, we established the mathematical foundation of autoencoders---the information bottleneck, MSE reconstruction loss, and the manifold hypothesis. Today, we translate that math into code. We implement four distinct autoencoder variants from scratch in pure PyTorch, each illustrating a different design principle.
Vanilla Autoencoder
The baseline architecture is a symmetric encoder-decoder with fully-connected layers:
# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(inplace=True),
nn.Linear(512, 256),
nn.ReLU(inplace=True),
nn.Linear(256, latent_dim),
)
# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256),
nn.ReLU(inplace=True),
nn.Linear(256, 512),
nn.ReLU(inplace=True),
nn.Linear(512, 784),
nn.Sigmoid(),
)
Architecture Choices
- Layer sizes (784 → 512 → 256 → d): The encoder progressively halves the representation, creating a smooth compression funnel rather than an abrupt bottleneck.
- ReLU activations: Introduce non-linearity so the autoencoder can learn non-linear manifolds. Without activations, stacking linear layers is equivalent to a single linear layer (PCA).
- Sigmoid output: Constrains the reconstruction to $[0, 1]$, matching the normalized pixel range of MNIST images.
- No Sigmoid on the bottleneck: The latent code $\mathbf{z}$ is unconstrained, allowing the network to use the full real line.
Denoising Autoencoder
The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:
def add_noise(self, x):
noise = self.noise_factor * torch.randn_like(x)
noisy = x + noise
return torch.clamp(noisy, 0.0, 1.0)
def forward(self, x):
if self.training:
x_input = self.add_noise(x)
else:
x_input = x
z = self.encode(x_input)
return self.decode(z)
Key Design Decisions
- Noise injection is training-only: The
self.trainingflag ensures that during evaluation, the autoencoder processes clean inputs. This is critical for fair comparison with other variants. - Clamping to $[0,1]$: After adding Gaussian noise, we clamp to keep pixel values in the valid range. Without clamping, the noisy input could have negative values or values above 1, which would shift the input distribution away from what the Sigmoid output can produce.
- Clean reconstruction targets: The loss is always computed against the original clean input $\mathbf{x}$, not the noisy input $\tilde{\mathbf{x}}$. This is what forces the network to denoise.
- Noise factor 0.3: This adds substantial corruption---roughly 30% of the pixel range as standard deviation. High enough to challenge the network, low enough that the digit identity is usually preserved.
Sparse Autoencoder
The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:
def forward(self, x):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat, z # return both for sparsity loss
The training loop then computes the composite loss:
recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1
L1 Sparsity Formulation
The total loss is:
The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.
Convolutional Autoencoder
The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:
# Encoder
self.enc_conv = nn.Sequential(
nn.Conv2d(1, 16, 3, stride=2, padding=1), # 28x28 -> 14x14
nn.ReLU(inplace=True),
nn.Conv2d(16, 32, 3, stride=2, padding=1), # 14x14 -> 7x7
nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)
# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
output_padding=1), # 7x7 -> 14x14
nn.ReLU(inplace=True),
nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
output_padding=1), # 14x14 -> 28x28
nn.Sigmoid(),
)
Why Convolutions Win
- Spatial structure: Fully-connected layers flatten images into 784-dimensional vectors, treating adjacent pixels the same as distant ones. Convolutions preserve the 2D grid, so the encoder learns local spatial features (edges, corners, curves).
- Parameter efficiency: A 3$\times$3 conv with 16 filters has $1 \times 16 \times 3 \times 3 = 144$ parameters. The equivalent fully-connected layer would have $784 \times 512 = 401{,}408$ parameters.
- Translation equivariance: A "7" in the top-left corner produces the same feature activations as a "7" in the bottom-right, shifted accordingly. This built-in invariance reduces the number of examples needed to learn.
ConvTranspose2d: Learned Upsampling
The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:
- $7 \times 7 \to 14 \times 14$ (first layer)
- $14 \times 14 \to 28 \times 28$ (second layer, restoring original resolution)
The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).
Training Configuration
All four variants are trained with:
- Dataset: 5,000-image MNIST subset (for fast CPU training)
- Optimizer: Adam (lr=$10^{-3}$)
- Loss: MSE for vanilla and conv; MSE against clean targets for denoising; MSE + L1 for sparse
- Epochs: 30
- Batch size: 128
- Latent dim: 2 (for visualization) and 32 (for quality comparison)
Next Steps: What Does the Bottleneck Learn?
With all four variants implemented, we can now train them and investigate what the bottleneck actually discovers. In Part 3, we visualize the 2D latent space as a scatter plot colored by digit class, demonstrate the denoising autoencoder's ability to remove heavy Gaussian noise, and compare reconstruction quality across all four architectures.
Full code is on the GitHub repo. Stay tuned for the training results!