Most generative models give you approximate densities. VAEs optimize a lower bound. GANs give you no density at all. Normalizing Flows are different: they compute the exact log-likelihood of any data point by constructing an invertible mapping between a simple base distribution and the data distribution, using the change of variables formula to track density at every step.
This post covers the mathematical foundation: the change of variables formula, the Jacobian determinant, and the two architectural ideas -- planar flows and coupling layers -- that make the computation tractable. Part 2 implements everything in PyTorch. Part 3 trains on 2D benchmarks.
The Change of Variables Formula
The whole framework comes from one equation. Take a random variable $z \sim p_z(z)$ from a known base distribution (typically a standard Gaussian), and apply an invertible, differentiable transformation $f: \mathbb{R}^d \to \mathbb{R}^d$ to get $x = f(z)$.
The density of $x$ follows from the change of variables formula:
Taking the logarithm and parameterizing the forward direction as $x \to z$ (data to latent):
where $J_f(x) = \frac{\partial f}{\partial x}$ is the Jacobian matrix. This is the core equation: to evaluate the likelihood of $x$, push it through $f$ to get $z$, evaluate the base density at $z$, and add a correction for how the transformation stretches or compresses volume.
The Jacobian Determinant Problem
The immediate problem: the Jacobian $J_f(x)$ is a $d \times d$ matrix. Computing its determinant costs $O(d^3)$, which is prohibitive for high-dimensional data. The central design question is: how do we build expressive invertible transformations whose Jacobian determinant is cheap to compute?
Two solutions: planar flows and coupling layers.
Planar Flows
Rezende and Mohamed (2015) introduced one of the simplest possible flows. A planar flow applies the transformation:
where $w, u \in \mathbb{R}^d$ and $b \in \mathbb{R}$ are learnable parameters. This warps the density along the hyperplane defined by $w$.
The Jacobian of this transformation has a special rank-1 structure:
where $\psi(z) = h'(w^\top z + b) \cdot w$ and $h' = 1 - \tanh^2$ is the derivative of tanh. By the matrix determinant lemma, the determinant of this rank-1 update to the identity is:
This is an $O(d)$ computation -- a massive improvement over the general $O(d^3)$.
The Invertibility Constraint
For the transformation to be invertible, we need $\det(J) \neq 0$, or equivalently $1 + u^\top \psi(z) \neq 0$. Rezende and Mohamed enforce this by constraining $u$ so that $w^\top u \geq -1$. The constrained $\hat{u}$ is computed as:
where $m(x) = -1 + \text{softplus}(x) = -1 + \log(1 + e^x)$. This guarantees $w^\top \hat{u} \geq -1$.
By stacking $K$ such planar transformations, we obtain increasingly complex densities. However, each individual planar flow can only warp the density along a single hyperplane, limiting expressiveness.
Affine Coupling Layers
Dinh et al. (2015, 2017) introduced a more powerful approach. An affine coupling layer partitions the input $x \in \mathbb{R}^d$ into two groups: $x_1$ (the first $d/2$ dimensions) and $x_2$ (the rest). The transformation is:
where $s$ and $t$ are arbitrary neural networks (they do not need to be invertible!) that map $\mathbb{R}^{d/2} \to \mathbb{R}^{d/2}$.
The Jacobian of this transformation is lower triangular:
The determinant of a triangular matrix is the product of its diagonal entries:
This is $O(d)$ and requires no matrix inversion or decomposition. The inverse is equally trivial:
The key insight is that while $x_1$ passes through unchanged, it controls the transformation of $x_2$ through arbitrarily complex neural networks $s$ and $t$. To ensure all dimensions are transformed, we alternate the mask between layers: in even layers, the first half is fixed; in odd layers, the second half is fixed.
Composing Flows
If we compose $K$ invertible transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$, the total log-determinant is simply the sum:
This additivity makes deep flows practical. Each layer contributes a cheap log-det correction, and the composition builds up an expressive transformation.
From Math to Code
That covers the mathematical toolkit: the change of variables formula for exact likelihood, the rank-1 determinant trick for planar flows, and the triangular Jacobian for coupling layers.
In Part 2, we implement both architectures in pure PyTorch -- PlanarFlow with the softplus invertibility constraint, AffineCouplingLayer with learned scale and translation MLPs, and a full RealNVP model with 8 alternating coupling layers and batch normalization.