Deconstructing VAEs: Part 3 - Walking Through Latent Space

Introduction

We derived the math in Part 1 and built the code in Part 2. Now we train two VAE models on MNIST and explore the structure of their latent spaces. The results reveal how VAEs organize knowledge, interpolate between concepts, and generate novel data.

Training Results

Latent Dimension 2

The 2D VAE (1,068,820 parameters) trained on a 5,000-image MNIST subset for 30 epochs with Adam ($lr = 10^{-3}$):

	Total Loss	Recon Loss	KL Loss
Train (Epoch 1)	236.16	235.39	0.77
Train (Epoch 30)	147.14	140.92	6.21
Test (Epoch 30)	153.77	147.70	6.08

The total loss dropped 37.7% from epoch 1 to epoch 30. The KL divergence grew from 0.77 to 6.21 as the encoder learned to use the latent space more expressively, while reconstruction error dominated the loss throughout.

Latent Dimension 20

The 20D VAE (1,082,680 parameters) with identical training:

	Total Loss	Recon Loss	KL Loss
Train (Epoch 1)	236.63	235.20	1.44
Train (Epoch 30)	110.93	94.06	16.87
Test (Epoch 30)	114.97	98.29	16.68

The 20D model achieves 25.2% lower test loss than the 2D model (114.97 vs 153.77). The reconstruction loss drops dramatically (98.29 vs 147.70), at the cost of higher KL divergence (16.68 vs 6.08).

The Reconstruction--KL Trade-off

The comparison between latent dimensions reveals a fundamental trade-off:

latent_dim=2: Low KL (6.08), high reconstruction error (147.70). The model is forced to compress 10 digit classes into just 2 dimensions, so it cannot represent fine details.
latent_dim=20: Higher KL (16.68), much lower reconstruction error (98.29). With 20 dimensions, the encoder can spread information across latent axes, preserving more detail.

The KL term acts as an "information bottleneck." In 2D, the bottleneck is severe---the model can only transmit roughly 6 nats of information per sample. In 20D, it transmits roughly 17 nats, enabling sharper reconstructions.

This is not a flaw; it is a design choice. Lower-dimensional latent spaces produce smoother, more interpretable representations at the cost of reconstruction fidelity. Higher-dimensional spaces preserve more detail but are harder to visualize and may exhibit "posterior collapse" on unused dimensions.

Latent Space Visualization

With latent_dim=2, we can directly scatter-plot the encoded test set, coloring each point by its digit class. The result shows:

Distinct clusters: Each digit occupies a recognizable region of the latent plane.
Meaningful proximity: Visually similar digits (3, 5, 8) cluster near each other; dissimilar digits (0, 1) are far apart.
Smooth transitions: Cluster boundaries are soft, not sharp---the KL regularization prevents gaps in the latent space.

This structure emerges purely from the ELBO objective. We never told the model which digits are similar; it discovered this from pixel-level reconstruction pressure combined with the smoothness constraint of the KL term.

Latent Space Interpolation

We select one example of digit 1 and one of digit 7, encode each to its mean latent vector, and linearly interpolate between them in 10 steps. Decoding each interpolated point produces a smooth morphing sequence.

The transition is gradual---intermediate frames resemble plausible handwritten characters that blend features of both digits. There are no abrupt jumps or nonsensical artifacts. This smoothness is direct evidence that the latent space is well-organized: nearby points decode to nearby images.

Linear interpolation works because the KL term pushes the latent distribution toward a Gaussian prior, which is symmetric and unimodal. The "shortest path" between two encoded points passes through regions of reasonable probability density.

Random Sampling

Sampling $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and decoding produces novel digit images. For the 2D model, the samples are blurry but recognizable---the severe bottleneck limits detail. For the 20D model, generated digits are sharper and more varied, reflecting the richer latent representation.

The quality of random samples is a direct test of prior-posterior alignment. If the KL term is doing its job, the region where $q_\phi(\mathbf{z}|\mathbf{x})$ places mass should overlap significantly with the standard normal prior. Poor alignment would produce nonsensical samples from the prior.

Reconstruction Quality

Side-by-side comparison of original test images and their reconstructions:

2D model: Captures overall digit shape and pose but loses fine details (stroke thickness, serifs, curves). Reconstructions appear "averaged" across similar examples.
20D model: Preserves significantly more detail---individual stroke characteristics, thickness variations, and digit-specific features are retained.

Training Dynamics

The training curves reveal characteristic VAE behavior:

Early training (epochs 1--5): Reconstruction loss drops rapidly as the decoder learns basic digit structure. KL starts near zero (the encoder has not yet learned to use the latent space meaningfully).
Mid training (epochs 5--15): KL grows steadily as the encoder begins encoding class-discriminative information. Reconstruction loss continues to decrease but more slowly.
Late training (epochs 15--30): Both losses plateau. The model has found its equilibrium between reconstruction fidelity and latent space regularity.

Conclusion

VAEs learn structured, smooth latent representations that support interpolation, random generation, and meaningful organization of data. The ELBO objective creates a natural tension between reconstruction fidelity and latent space regularity, and the balance point determines the character of the learned representation.

The results from our from-scratch implementation confirm the theoretical predictions: the reparameterization trick enables stable training, the KL term produces smooth latent geometry, and the reconstruction term drives the model to capture meaningful visual features.

Deconstructing VAEs from Scratch

Part 3: Walking Through Latent Space