Variational Autoencoders occupy a unique position in deep learning: they sit at the intersection of neural networks and probabilistic inference. Unlike standard autoencoders, which learn a deterministic compression, VAEs learn a distribution over latent representations---and this seemingly small change unlocks the ability to generate entirely new data.
In Part 1 of this series, we derive the mathematical foundations that make VAEs work: latent variable models, the Evidence Lower Bound (ELBO), KL divergence, and the reparameterization trick.
Latent Variable Models
The core assumption behind a VAE is that observed data $\mathbf{x}$ is generated by some unobserved (latent) variable $\mathbf{z}$:
We choose the prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$---a standard multivariate Gaussian. The likelihood $p(\mathbf{x} | \mathbf{z})$ is parameterized by a neural network (the decoder).
The problem: computing $p(\mathbf{x})$ requires integrating over all possible $\mathbf{z}$, which is intractable for high-dimensional latent spaces. We need an approximation.
Variational Inference and the ELBO
Since we cannot compute the true posterior $p(\mathbf{z} | \mathbf{x})$, we approximate it with a learned distribution $q_\phi(\mathbf{z} | \mathbf{x})$ (the encoder), parameterized by $\phi$.
Starting from the log-evidence:
We introduce $q_\phi(\mathbf{z} | \mathbf{x})$ and apply Jensen's inequality:
This lower bound is the Evidence Lower Bound (ELBO):
The reconstruction term encourages accurate decoding. The KL term regularizes the latent space, pushing $q_\phi$ toward the prior.
KL Divergence: Closed Form
When both the approximate posterior and prior are Gaussian, the KL divergence has a closed-form solution. For $q_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:
In code, we parameterize $\log \sigma^2$ (i.e., $\log\_var$) instead of $\sigma$ directly, for numerical stability:
The Reparameterization Trick
Maximizing the ELBO requires computing gradients through the expectation $\mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})}[\cdot]$. But sampling $\mathbf{z} \sim q_\phi(\mathbf{z} | \mathbf{x})$ is a stochastic operation---we cannot backpropagate through a random number generator.
The reparameterization trick resolves this by expressing $\mathbf{z}$ as a deterministic function of $\boldsymbol{\mu}$, $\boldsymbol{\sigma}$, and an auxiliary noise variable:
Now $\mathbf{z}$ is differentiable with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ (and hence $\phi$), while $\boldsymbol{\epsilon}$ is treated as a fixed input. The stochasticity is "externalized," and standard backpropagation applies.
This single trick is what made VAEs practical. Without it, training would require high-variance score function estimators (REINFORCE), which converge far too slowly for deep networks.
Putting It Together
The full VAE training procedure:
- Encode: Map input $\mathbf{x}$ to distribution parameters $(\boldsymbol{\mu}, \log \boldsymbol{\sigma}^2)$.
- Reparameterize: Sample $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \odot e^{0.5 \cdot \log \boldsymbol{\sigma}^2}$.
- Decode: Reconstruct $\hat{\mathbf{x}}$ from $\mathbf{z}$.
- Compute ELBO: Reconstruction loss (BCE) + KL divergence.
- Backpropagate: Update encoder and decoder parameters jointly.
Conclusion
The mathematical machinery behind VAEs---variational inference, the ELBO, and the reparameterization trick---transforms an intractable generative modeling problem into a tractable optimization problem. In Part 2, we translate every equation into working PyTorch code.