Back to VAEs Hub

Deconstructing VAEs from Scratch

Part 1: The Math of Latent Spaces

Introduction

Variational Autoencoders occupy a unique position in deep learning: they sit at the intersection of neural networks and probabilistic inference. Unlike standard autoencoders, which learn a deterministic compression, VAEs learn a distribution over latent representations---and this seemingly small change unlocks the ability to generate entirely new data.

In Part 1 of this series, we derive the mathematical foundations that make VAEs work: latent variable models, the Evidence Lower Bound (ELBO), KL divergence, and the reparameterization trick.

Latent Variable Models

The core assumption behind a VAE is that observed data $\mathbf{x}$ is generated by some unobserved (latent) variable $\mathbf{z}$:

$$ p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z} $$

We choose the prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$---a standard multivariate Gaussian. The likelihood $p(\mathbf{x} | \mathbf{z})$ is parameterized by a neural network (the decoder).

The problem: computing $p(\mathbf{x})$ requires integrating over all possible $\mathbf{z}$, which is intractable for high-dimensional latent spaces. We need an approximation.

Variational Inference and the ELBO

Since we cannot compute the true posterior $p(\mathbf{z} | \mathbf{x})$, we approximate it with a learned distribution $q_\phi(\mathbf{z} | \mathbf{x})$ (the encoder), parameterized by $\phi$.

Starting from the log-evidence:

$$ \log p(\mathbf{x}) = \log \int p(\mathbf{x} | \mathbf{z}) \, p(\mathbf{z}) \, d\mathbf{z} $$

We introduce $q_\phi(\mathbf{z} | \mathbf{x})$ and apply Jensen's inequality:

$$ \log p(\mathbf{x}) \geq \mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})} \left[ \log p(\mathbf{x} | \mathbf{z}) \right] - D_{\text{KL}}\left( q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}) \right) $$

This lower bound is the Evidence Lower Bound (ELBO):

$$ \mathcal{L}(\theta, \phi; \mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})} \left[ \log p_\theta(\mathbf{x} | \mathbf{z}) \right]}_{\text{Reconstruction term}} - \underbrace{D_{\text{KL}}\left( q_\phi(\mathbf{z} | \mathbf{x}) \| p(\mathbf{z}) \right)}_{\text{Regularization term}} $$

The reconstruction term encourages accurate decoding. The KL term regularizes the latent space, pushing $q_\phi$ toward the prior.

KL Divergence: Closed Form

When both the approximate posterior and prior are Gaussian, the KL divergence has a closed-form solution. For $q_\phi(\mathbf{z} | \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ and $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$:

$$ D_{\text{KL}} = -\frac{1}{2} \sum_{j=1}^{J} \left( 1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2 \right) $$

In code, we parameterize $\log \sigma^2$ (i.e., $\log\_var$) instead of $\sigma$ directly, for numerical stability:

$$ D_{\text{KL}} = -\frac{1}{2} \sum_{j=1}^{J} \left( 1 + \log\_var_j - \mu_j^2 - e^{\log\_var_j} \right) $$

The Reparameterization Trick

Maximizing the ELBO requires computing gradients through the expectation $\mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})}[\cdot]$. But sampling $\mathbf{z} \sim q_\phi(\mathbf{z} | \mathbf{x})$ is a stochastic operation---we cannot backpropagate through a random number generator.

The reparameterization trick resolves this by expressing $\mathbf{z}$ as a deterministic function of $\boldsymbol{\mu}$, $\boldsymbol{\sigma}$, and an auxiliary noise variable:

$$ \mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) $$

Now $\mathbf{z}$ is differentiable with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ (and hence $\phi$), while $\boldsymbol{\epsilon}$ is treated as a fixed input. The stochasticity is "externalized," and standard backpropagation applies.

This single trick is what made VAEs practical. Without it, training would require high-variance score function estimators (REINFORCE), which converge far too slowly for deep networks.

Putting It Together

The full VAE training procedure:

  1. Encode: Map input $\mathbf{x}$ to distribution parameters $(\boldsymbol{\mu}, \log \boldsymbol{\sigma}^2)$.
  2. Reparameterize: Sample $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\epsilon} \odot e^{0.5 \cdot \log \boldsymbol{\sigma}^2}$.
  3. Decode: Reconstruct $\hat{\mathbf{x}}$ from $\mathbf{z}$.
  4. Compute ELBO: Reconstruction loss (BCE) + KL divergence.
  5. Backpropagate: Update encoder and decoder parameters jointly.

Conclusion

The mathematical machinery behind VAEs---variational inference, the ELBO, and the reparameterization trick---transforms an intractable generative modeling problem into a tractable optimization problem. In Part 2, we translate every equation into working PyTorch code.