Deconstructing Autoencoders: Part 1 - The Math of Compression

Introduction

What happens when you force 784-dimensional data through a bottleneck of just 2 dimensions? Remarkably, the data doesn't collapse into noise---it organizes itself. Similar inputs cluster together, and the bottleneck learns a compact coordinate system that captures the essential structure of the data.

This is the core idea behind autoencoders: learn to compress data into a low-dimensional representation and then reconstruct it, using reconstruction error as the sole training signal. No labels, no supervision---just the pressure of compression.

Autoencoders are among the oldest ideas in deep learning, dating back to the 1980s, yet they remain foundational. The latent spaces inside modern diffusion models, the embeddings in retrieval systems, and the bottleneck layers in Transformers all trace their lineage back to the same principle: forcing data through a narrow channel reveals its structure.

In Part 1 of this "Build in Public" mini-series, we deconstruct the mathematical foundation of autoencoders: the information bottleneck, reconstruction loss, and the manifold hypothesis. In Part 2, we implement four variants in pure PyTorch. In Part 3, we train on MNIST and visualize what the bottleneck learns.

The Information Bottleneck

An autoencoder consists of two functions:

Encoder $f_\theta: \mathbb{R}^n \to \mathbb{R}^d$ maps high-dimensional input $\mathbf{x}$ to a low-dimensional latent code $\mathbf{z}$
Decoder $g_\phi: \mathbb{R}^d \to \mathbb{R}^n$ maps the latent code back to a reconstruction $\hat{\mathbf{x}}$

The bottleneck is the constraint $d \ll n$. For MNIST images, $n = 784$ (28$\times$28 pixels). If we set $d = 32$, the encoder must compress each image by a factor of 24.5$\times$. If we set $d = 2$, the compression ratio is 392$\times$.

This compression is lossy by construction. The network cannot preserve every pixel value---it must learn which information matters for reconstruction and which can be discarded. This is what makes the bottleneck useful: it forces the network to discover compact, meaningful representations.

The full autoencoder pipeline is:

\hat{\mathbf{x}} = g_\phi(f_\theta(\mathbf{x}))

The latent code $\mathbf{z} = f_\theta(\mathbf{x})$ is the compressed representation. If $d$ is small enough, $\mathbf{z}$ must capture the essential "factors of variation" in the data---for handwritten digits, this means stroke angle, thickness, curvature, and digit identity.

Reconstruction Loss

The training objective is to minimize the difference between input $\mathbf{x}$ and reconstruction $\hat{\mathbf{x}}$. The most common choice is the Mean Squared Error (MSE):

\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} \| \mathbf{x}_i - \hat{\mathbf{x}}_i \|^2 = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{n} (x_{ij} - \hat{x}_{ij})^2

where $N$ is the number of training samples and $n$ is the input dimensionality.

Why MSE?

MSE can be derived from a probabilistic perspective. If we assume the decoder output parameterizes a Gaussian distribution:

p(\mathbf{x} \mid \mathbf{z}) = \mathcal{N}(\mathbf{x}; g_\phi(\mathbf{z}), \sigma^2 \mathbf{I})

then maximizing the log-likelihood is equivalent to minimizing the squared error:

\log p(\mathbf{x} \mid \mathbf{z}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \| \mathbf{x} - g_\phi(\mathbf{z}) \|^2

The first term is a constant with respect to the parameters, so maximizing this expression reduces to minimizing $\| \mathbf{x} - \hat{\mathbf{x}} \|^2$. MSE is the maximum likelihood estimator under a Gaussian noise model.

Gradient Flow

Taking the gradient of the MSE loss with respect to the reconstruction:

\frac{\partial \mathcal{L}}{\partial \hat{x}_{ij}} = -\frac{2}{N}(x_{ij} - \hat{x}_{ij})

The gradient is proportional to the reconstruction error at each pixel. Pixels with large errors receive stronger gradients, driving the network to fix its worst mistakes first.

The Manifold Hypothesis

Why does compression work at all? The manifold hypothesis provides the answer: high-dimensional data typically lies on or near a much lower-dimensional manifold embedded in the ambient space.

Consider the space of all possible 28$\times$28 grayscale images: $\mathbb{R}^{784}$. The vast majority of points in this space look like random static. The set of images that look like handwritten digits occupies a vanishingly thin subspace---a manifold with far fewer intrinsic dimensions than 784.

Formally, if the data manifold $\mathcal{M}$ has intrinsic dimensionality $d^*$, then an autoencoder with bottleneck dimension $d \geq d^*$ can in principle achieve zero reconstruction error. When $d < d^*$, some information is necessarily lost, and the autoencoder learns to preserve the most important factors of variation.

This explains a key empirical observation: increasing the bottleneck dimension improves reconstruction up to a point, after which further increases yield diminishing returns. That inflection point approximates the intrinsic dimensionality of the data manifold.

Variants: Beyond Vanilla Compression

The basic autoencoder can be extended in several directions, each addressing a different limitation:

Denoising Autoencoders

Instead of reconstructing clean inputs, corrupt the input with noise and train the network to recover the original:

\tilde{\mathbf{x}} = \mathbf{x} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2\mathbf{I})

\mathcal{L}_{\text{DAE}} = \| \mathbf{x} - g_\phi(f_\theta(\tilde{\mathbf{x}})) \|^2

The network must learn to separate signal from noise, which forces it to capture the underlying data distribution rather than memorizing individual samples. Vincent et al. (2008) showed this is equivalent to learning a score function of the data distribution---a connection that resurfaced in modern diffusion models.

Sparse Autoencoders

Add an L1 penalty on the latent activations to encourage sparsity:

\mathcal{L}_{\text{SAE}} = \| \mathbf{x} - \hat{\mathbf{x}} \|^2 + \lambda \| \mathbf{z} \|_1 = \| \mathbf{x} - \hat{\mathbf{x}} \|^2 + \lambda \sum_{k=1}^{d} |z_k|

The L1 penalty drives most latent units toward zero, so only a few are active for any given input. This produces disentangled representations where each latent unit corresponds to a specific feature, and it allows using overcomplete representations ($d > n$) without the autoencoder learning the identity function.

Convolutional Autoencoders

Replace fully-connected layers with Conv2d (encoder) and ConvTranspose2d (decoder) to preserve spatial structure. Fully-connected autoencoders flatten images into vectors, destroying the 2D spatial relationships between pixels. Convolutional autoencoders process images as 2D feature maps, enabling the network to learn spatially local features like edges and textures.

Next Steps: From Math to Code

We have established the mathematical foundation: the information bottleneck forces compression, MSE reconstruction loss drives learning, and the manifold hypothesis explains why compression produces useful representations.

In Part 2, we implement all four variants---Vanilla, Denoising, Sparse, and Convolutional---in pure PyTorch. We will walk through the architecture choices, the noise injection mechanism, the L1 penalty, and the Conv2d/ConvTranspose2d encoder-decoder design.

Stay tuned for the code drop as we build Autoencoders from scratch!

Deconstructing Autoencoders from Scratch

Part 1: The Math of Compression