Back to Normalizing Flows Hub

Deconstructing Normalizing Flows from Scratch

Part 1: The Math of Invertible Transforms

Introduction

What if a generative model could give you the exact probability of any data point? Not a lower bound (like VAEs), not an implicit density (like GANs), but the true log-likelihood under the model?

Normalizing Flows achieve this through a beautifully simple mathematical framework: construct an invertible mapping between a simple base distribution and a complex data distribution, and use the change of variables formula to track the exact density transformation at every step.

In this 3-part "Build in Public" series, we will deconstruct Normalizing Flows from first principles. Today, we lay the mathematical foundation: the change of variables formula, the Jacobian determinant, and the two key architectural ideas (planar flows and coupling layers) that make this computationally tractable. In Part 2, we build everything in pure PyTorch. In Part 3, we train on 2D density estimation benchmarks and analyze the results.

The Change of Variables Formula

The entire framework rests on one equation from probability theory. Suppose we have a random variable $z \sim p_z(z)$ drawn from a known base distribution (typically a standard Gaussian), and we apply an invertible, differentiable transformation $f: \mathbb{R}^d \to \mathbb{R}^d$ to obtain $x = f(z)$.

What is the density of $x$? The change of variables formula gives us:

$$ p_x(x) = p_z(f^{-1}(x)) \left| \det \frac{\partial f^{-1}}{\partial x} \right| $$

Taking the logarithm and parameterizing the forward direction as $x \to z$ (data to latent):

$$ \log p_x(x) = \log p_z(f(x)) + \log \left| \det J_f(x) \right| $$

where $J_f(x) = \frac{\partial f}{\partial x}$ is the Jacobian matrix of the transformation. This is the core equation of normalizing flows. It says: to evaluate the likelihood of any data point $x$, push it through the forward mapping $f$ to get $z$, evaluate the base density at $z$, and add a correction term for how the transformation stretches or compresses volume.

The Jacobian Determinant Problem

There is an immediate computational challenge. The Jacobian $J_f(x)$ is a $d \times d$ matrix. Computing its determinant costs $O(d^3)$ in general, which is prohibitive for high-dimensional data. The entire design challenge of normalizing flows is: how do we build expressive invertible transformations whose Jacobian determinant is cheap to compute?

Two foundational solutions have been proposed: planar flows and coupling layers.

Planar Flows

Rezende and Mohamed (2015) introduced one of the simplest possible flows. A planar flow applies the transformation:

$$ z' = z + u \cdot \tanh(w^\top z + b) $$

where $w, u \in \mathbb{R}^d$ and $b \in \mathbb{R}$ are learnable parameters. This warps the density along the hyperplane defined by $w$.

The Jacobian of this transformation has a special rank-1 structure:

$$ J = I + u \cdot \psi(z)^\top $$

where $\psi(z) = h'(w^\top z + b) \cdot w$ and $h' = 1 - \tanh^2$ is the derivative of tanh. By the matrix determinant lemma, the determinant of this rank-1 update to the identity is:

$$ \det(J) = 1 + u^\top \psi(z) $$

This is an $O(d)$ computation -- a massive improvement over the general $O(d^3)$.

The Invertibility Constraint

For the transformation to be invertible, we need $\det(J) \neq 0$, or equivalently $1 + u^\top \psi(z) \neq 0$. Rezende and Mohamed enforce this by constraining $u$ so that $w^\top u \geq -1$. The constrained $\hat{u}$ is computed as:

$$ \hat{u} = u + \left( m(w^\top u) - w^\top u \right) \frac{w}{\|w\|^2} $$

where $m(x) = -1 + \text{softplus}(x) = -1 + \log(1 + e^x)$. This guarantees $w^\top \hat{u} \geq -1$.

By stacking $K$ such planar transformations, we obtain increasingly complex densities. However, each individual planar flow can only warp the density along a single hyperplane, limiting expressiveness.

Affine Coupling Layers

Dinh et al. (2015, 2017) introduced a far more powerful approach. An affine coupling layer partitions the input $x \in \mathbb{R}^d$ into two groups: $x_1$ (the first $d/2$ dimensions) and $x_2$ (the rest). The transformation is:

$$\begin{align} y_1 &= x_1 \\ y_2 &= x_2 \odot \exp(s(x_1)) + t(x_1) \end{align}$$

where $s$ and $t$ are arbitrary neural networks (they do not need to be invertible!) that map $\mathbb{R}^{d/2} \to \mathbb{R}^{d/2}$.

The Jacobian of this transformation is lower triangular:

$$ J = \begin{pmatrix} I_{d/2} & 0 \\ \frac{\partial y_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{pmatrix} $$

The determinant of a triangular matrix is the product of its diagonal entries:

$$ \log |\det J| = \sum_{i} s_i(x_1) $$

This is $O(d)$ and requires no matrix inversion or decomposition. The inverse is equally trivial:

$$\begin{align} x_1 &= y_1 \\ x_2 &= (y_2 - t(y_1)) \odot \exp(-s(y_1)) \end{align}$$

The key insight is that while $x_1$ passes through unchanged, it controls the transformation of $x_2$ through arbitrarily complex neural networks $s$ and $t$. To ensure all dimensions are transformed, we alternate the mask between layers: in even layers, the first half is fixed; in odd layers, the second half is fixed.

Composing Flows

If we compose $K$ invertible transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$, the total log-determinant is simply the sum:

$$ \log |\det J_f| = \sum_{k=1}^{K} \log |\det J_{f_k}| $$

This additivity is what makes deep normalizing flows practical. Each layer adds a small, cheap log-det correction, and together they build up an expressive transformation.

Next Steps: From Math to Code

We have established the mathematical toolkit for normalizing flows: the change of variables formula for exact likelihood, the rank-1 determinant trick for planar flows, and the triangular Jacobian trick for coupling layers.

In Part 2, we will implement both architectures in pure PyTorch: a PlanarFlow with the softplus invertibility constraint, an AffineCouplingLayer with learned scale and translation MLPs, and a complete RealNVP model with 8 alternating coupling layers and batch normalization.

Stay tuned for the code drop as we build exact-likelihood generative models from scratch!