Part 1 covered the math: the information bottleneck, MSE loss, and the manifold hypothesis. Here we translate that into four PyTorch implementations---Vanilla, Denoising, Sparse, and Convolutional---each targeting a different design constraint.
Vanilla Autoencoder
The baseline architecture is a symmetric encoder-decoder with fully-connected layers:
# Encoder: 784 -> 512 -> 256 -> latent_dim
self.encoder = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(inplace=True),
nn.Linear(512, 256),
nn.ReLU(inplace=True),
nn.Linear(256, latent_dim),
)
# Decoder: latent_dim -> 256 -> 512 -> 784
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256),
nn.ReLU(inplace=True),
nn.Linear(256, 512),
nn.ReLU(inplace=True),
nn.Linear(512, 784),
nn.Sigmoid(),
)
Architecture Choices
- Layer sizes (784 → 512 → 256 → d): Progressive halving creates a smooth compression funnel rather than an abrupt squeeze.
- ReLU activations: Without non-linearities, stacking linear layers collapses to a single linear layer (PCA). ReLU lets the network learn non-linear manifolds.
- Sigmoid output: Constrains reconstruction to $[0, 1]$, matching normalized MNIST pixel values.
- No Sigmoid on the bottleneck: The latent code $\mathbf{z}$ is unconstrained---free to use the full real line.
Denoising Autoencoder
The denoising variant uses the same encoder-decoder architecture but corrupts the input during training:
def add_noise(self, x):
noise = self.noise_factor * torch.randn_like(x)
noisy = x + noise
return torch.clamp(noisy, 0.0, 1.0)
def forward(self, x):
if self.training:
x_input = self.add_noise(x)
else:
x_input = x
z = self.encode(x_input)
return self.decode(z)
Key Design Decisions
- Noise injection is training-only: The
self.trainingflag means evaluation uses clean inputs, which is necessary for fair comparison with other variants. - Clamping to $[0,1]$: Without clamping, noisy inputs can go negative or exceed 1, shifting the distribution away from what the Sigmoid output can represent.
- Clean reconstruction targets: Loss is computed against the original $\mathbf{x}$, not the noisy $\tilde{\mathbf{x}}$. This forces the network to denoise.
- Noise factor 0.3: Roughly 30% of the pixel range as standard deviation---enough to challenge the network but not enough to destroy digit identity.
Sparse Autoencoder
The sparse variant adds an L1 penalty on the bottleneck activations. The key change is that forward() returns both the reconstruction and the latent code:
def forward(self, x):
z = self.encode(x)
x_hat = self.decode(z)
return x_hat, z # return both for sparsity loss
The training loop then computes the composite loss:
recon, z = model(flat)
mse = criterion(recon, flat)
l1 = sparsity_weight * torch.mean(torch.abs(z))
loss = mse + l1
L1 Sparsity Formulation
The total loss is:
The L1 norm $\|\mathbf{z}\|_1 = \sum_k |z_k|$ is non-differentiable at zero, but PyTorch's automatic differentiation handles this correctly using subgradients. The sparsity weight $\lambda = 10^{-3}$ balances reconstruction quality against sparsity---too high and reconstructions degrade, too low and the penalty has no effect.
Convolutional Autoencoder
The convolutional variant replaces fully-connected layers with Conv2d and ConvTranspose2d:
# Encoder
self.enc_conv = nn.Sequential(
nn.Conv2d(1, 16, 3, stride=2, padding=1), # 28x28 -> 14x14
nn.ReLU(inplace=True),
nn.Conv2d(16, 32, 3, stride=2, padding=1), # 14x14 -> 7x7
nn.ReLU(inplace=True),
)
self.enc_fc = nn.Linear(32 * 7 * 7, latent_dim)
# Decoder
self.dec_fc = nn.Linear(latent_dim, 32 * 7 * 7)
self.dec_conv = nn.Sequential(
nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1,
output_padding=1), # 7x7 -> 14x14
nn.ReLU(inplace=True),
nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1,
output_padding=1), # 14x14 -> 28x28
nn.Sigmoid(),
)
Why Convolutions Win
- Spatial structure: FC layers flatten images into 784-d vectors, treating adjacent pixels the same as distant ones. Convolutions preserve the 2D grid and learn local features (edges, corners, curves).
- Parameter efficiency: A 3$\times$3 conv with 16 filters has $1 \times 16 \times 3 \times 3 = 144$ parameters vs. $784 \times 512 = 401{,}408$ for the equivalent FC layer.
- Translation equivariance: A "7" in the top-left produces the same feature activations as one in the bottom-right, shifted accordingly. Fewer examples needed.
ConvTranspose2d: Learned Upsampling
The decoder uses transposed convolutions (sometimes called "deconvolutions") to upsample feature maps. With stride=2 and output_padding=1, each ConvTranspose2d doubles the spatial dimensions:
- $7 \times 7 \to 14 \times 14$ (first layer)
- $14 \times 14 \to 28 \times 28$ (second layer, restoring original resolution)
The output_padding=1 resolves the ambiguity that arises because multiple input sizes can produce the same output size under stride-2 convolution ($\lfloor(n-1)/2\rfloor + 1$ is the same for $n=13$ and $n=14$).
Training Configuration
All four variants are trained with:
- Dataset: 5,000-image MNIST subset (for fast CPU training)
- Optimizer: Adam (lr=$10^{-3}$)
- Loss: MSE for vanilla and conv; MSE against clean targets for denoising; MSE + L1 for sparse
- Epochs: 30
- Batch size: 128
- Latent dim: 2 (for visualization) and 32 (for quality comparison)
What Does the Bottleneck Learn?
All four variants are implemented. Part 3 trains them on MNIST and digs into the results: 2D latent space scatter plots colored by digit class, denoising under heavy Gaussian noise, and reconstruction quality across architectures. Full code is on GitHub.