Deconstructing VAEs: Part 3 - Walking Through Latent Space

Introduction

With the math (Part 1) and code (Part 2) in place, we train two VAE models on MNIST---one with a 2D latent space, one with 20D---and compare their latent structure, reconstruction quality, and generation ability.

Training Results

Latent Dimension 2

The 2D VAE (1,068,820 parameters) trained on a 5,000-image MNIST subset for 30 epochs with Adam ($lr = 10^{-3}$):

	Total Loss	Recon Loss	KL Loss
Train (Epoch 1)	236.16	235.39	0.77
Train (Epoch 30)	147.14	140.92	6.21
Test (Epoch 30)	153.77	147.70	6.08

Total loss dropped 37.7% over 30 epochs. KL divergence rose from 0.77 to 6.21 as the encoder learned to spread information across the latent dimensions. Reconstruction error dominated the loss throughout.

Latent Dimension 20

The 20D VAE (1,082,680 parameters) with identical training:

	Total Loss	Recon Loss	KL Loss
Train (Epoch 1)	236.63	235.20	1.44
Train (Epoch 30)	110.93	94.06	16.87
Test (Epoch 30)	114.97	98.29	16.68

The 20D model achieves 25.2% lower test loss than the 2D model (114.97 vs 153.77). Reconstruction loss falls from 147.70 to 98.29, at the cost of higher KL divergence (16.68 vs 6.08).

The Reconstruction--KL Trade-off

The comparison between latent dimensions reveals a fundamental trade-off:

latent_dim=2: Low KL (6.08), high reconstruction error (147.70). The model is forced to compress 10 digit classes into just 2 dimensions, so it cannot represent fine details.
latent_dim=20: Higher KL (16.68), much lower reconstruction error (98.29). With 20 dimensions, the encoder can spread information across latent axes, preserving more detail.

The KL term acts as an information bottleneck. In 2D, the model transmits roughly 6 nats per sample. In 20D, roughly 17 nats, which is why reconstructions are sharper.

Lower-dimensional latent spaces give smoother, more interpretable representations but sacrifice reconstruction fidelity. Higher-dimensional spaces preserve more detail but are harder to visualize and risk posterior collapse on unused dimensions.

Latent Space Visualization

With latent_dim=2, we can directly scatter-plot the encoded test set, coloring each point by its digit class. The result shows:

Distinct clusters: Each digit occupies a recognizable region of the latent plane.
Meaningful proximity: Visually similar digits (3, 5, 8) cluster near each other; dissimilar digits (0, 1) are far apart.
Smooth transitions: Cluster boundaries are soft, not sharp---the KL regularization prevents gaps in the latent space.

This structure emerges from the ELBO alone. The model was never told which digits are similar---it inferred proximity from pixel-level reconstruction pressure and the KL smoothness constraint.

Latent Space Interpolation

We select one example of digit 1 and one of digit 7, encode each to its mean latent vector, and linearly interpolate between them in 10 steps. Decoding each interpolated point produces a smooth morphing sequence.

Intermediate frames look like plausible handwritten characters blending features of both digits, with no abrupt jumps or artifacts. Nearby latent points decode to nearby images.

Linear interpolation works here because the KL term pushes the latent distribution toward a symmetric, unimodal Gaussian prior. The straight-line path between two encoded points passes through regions of reasonable probability density.

Random Sampling

Sampling $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and decoding produces novel digits. The 2D model's samples are blurry but recognizable; the 20D model's are sharper and more varied.

Sample quality tests prior-posterior alignment directly. If the KL term is working, the aggregate posterior $q_\phi(\mathbf{z}|\mathbf{x})$ should overlap with the standard normal prior. Poor alignment produces nonsense when sampling from the prior.

Reconstruction Quality

Side-by-side comparison of original test images and their reconstructions:

2D model: Captures overall digit shape and pose but loses fine details (stroke thickness, serifs, curves). Reconstructions appear "averaged" across similar examples.
20D model: Preserves significantly more detail---individual stroke characteristics, thickness variations, and digit-specific features are retained.

Training Dynamics

The training curves reveal characteristic VAE behavior:

Early training (epochs 1--5): Reconstruction loss drops rapidly as the decoder learns basic digit structure. KL starts near zero (the encoder has not yet learned to use the latent space meaningfully).
Mid training (epochs 5--15): KL grows steadily as the encoder begins encoding class-discriminative information. Reconstruction loss continues to decrease but more slowly.
Late training (epochs 15--30): Both losses plateau. The model has found its equilibrium between reconstruction fidelity and latent space regularity.

Conclusion

The ELBO creates a tension between reconstruction fidelity and latent regularity. Where that tension settles determines the character of the representation: 2D gives a clean, visualizable manifold; 20D gives sharp reconstructions and better generation.

Our from-scratch results match the theory: the reparameterization trick enables stable gradient-based training, the KL term produces smooth latent geometry, and the reconstruction term drives the model to capture meaningful visual structure.

Deconstructing VAEs from Scratch

Part 3: Walking Through Latent Space