Deconstructing Normalizing Flows: Part 1 - The Math of Invertible Transforms

Introduction

Most generative models give you approximate densities. VAEs optimize a lower bound. GANs give you no density at all. Normalizing Flows are different: they compute the exact log-likelihood of any data point by constructing an invertible mapping between a simple base distribution and the data distribution, using the change of variables formula to track density at every step.

This post covers the mathematical foundation: the change of variables formula, the Jacobian determinant, and the two architectural ideas -- planar flows and coupling layers -- that make the computation tractable. Part 2 implements everything in PyTorch. Part 3 trains on 2D benchmarks.

The Change of Variables Formula

The whole framework comes from one equation. Take a random variable $z \sim p_z(z)$ from a known base distribution (typically a standard Gaussian), and apply an invertible, differentiable transformation $f: \mathbb{R}^d \to \mathbb{R}^d$ to get $x = f(z)$.

The density of $x$ follows from the change of variables formula:

p_x(x) = p_z(f^{-1}(x)) \left| \det \frac{\partial f^{-1}}{\partial x} \right|

Taking the logarithm and parameterizing the forward direction as $x \to z$ (data to latent):

\log p_x(x) = \log p_z(f(x)) + \log \left| \det J_f(x) \right|

where $J_f(x) = \frac{\partial f}{\partial x}$ is the Jacobian matrix. This is the core equation: to evaluate the likelihood of $x$, push it through $f$ to get $z$, evaluate the base density at $z$, and add a correction for how the transformation stretches or compresses volume.

The Jacobian Determinant Problem

The immediate problem: the Jacobian $J_f(x)$ is a $d \times d$ matrix. Computing its determinant costs $O(d^3)$, which is prohibitive for high-dimensional data. The central design question is: how do we build expressive invertible transformations whose Jacobian determinant is cheap to compute?

Two solutions: planar flows and coupling layers.

Planar Flows

Rezende and Mohamed (2015) introduced one of the simplest possible flows. A planar flow applies the transformation:

z' = z + u \cdot \tanh(w^\top z + b)

where $w, u \in \mathbb{R}^d$ and $b \in \mathbb{R}$ are learnable parameters. This warps the density along the hyperplane defined by $w$.

The Jacobian of this transformation has a special rank-1 structure:

J = I + u \cdot \psi(z)^\top

where $\psi(z) = h'(w^\top z + b) \cdot w$ and $h' = 1 - \tanh^2$ is the derivative of tanh. By the matrix determinant lemma, the determinant of this rank-1 update to the identity is:

\det(J) = 1 + u^\top \psi(z)

This is an $O(d)$ computation -- a massive improvement over the general $O(d^3)$.

The Invertibility Constraint

For the transformation to be invertible, we need $\det(J) \neq 0$, or equivalently $1 + u^\top \psi(z) \neq 0$. Rezende and Mohamed enforce this by constraining $u$ so that $w^\top u \geq -1$. The constrained $\hat{u}$ is computed as:

\hat{u} = u + \left( m(w^\top u) - w^\top u \right) \frac{w}{\|w\|^2}

where $m(x) = -1 + \text{softplus}(x) = -1 + \log(1 + e^x)$. This guarantees $w^\top \hat{u} \geq -1$.

By stacking $K$ such planar transformations, we obtain increasingly complex densities. However, each individual planar flow can only warp the density along a single hyperplane, limiting expressiveness.

Affine Coupling Layers

Dinh et al. (2015, 2017) introduced a more powerful approach. An affine coupling layer partitions the input $x \in \mathbb{R}^d$ into two groups: $x_1$ (the first $d/2$ dimensions) and $x_2$ (the rest). The transformation is:

\begin{align} y_1 &= x_1 \\ y_2 &= x_2 \odot \exp(s(x_1)) + t(x_1) \end{align}

where $s$ and $t$ are arbitrary neural networks (they do not need to be invertible!) that map $\mathbb{R}^{d/2} \to \mathbb{R}^{d/2}$.

The Jacobian of this transformation is lower triangular:

J = \begin{pmatrix} I_{d/2} & 0 \\ \frac{\partial y_2}{\partial x_1} & \text{diag}(\exp(s(x_1))) \end{pmatrix}

The determinant of a triangular matrix is the product of its diagonal entries:

\log |\det J| = \sum_{i} s_i(x_1)

This is $O(d)$ and requires no matrix inversion or decomposition. The inverse is equally trivial:

\begin{align} x_1 &= y_1 \\ x_2 &= (y_2 - t(y_1)) \odot \exp(-s(y_1)) \end{align}

The key insight is that while $x_1$ passes through unchanged, it controls the transformation of $x_2$ through arbitrarily complex neural networks $s$ and $t$. To ensure all dimensions are transformed, we alternate the mask between layers: in even layers, the first half is fixed; in odd layers, the second half is fixed.

Composing Flows

If we compose $K$ invertible transformations $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$, the total log-determinant is simply the sum:

\log |\det J_f| = \sum_{k=1}^{K} \log |\det J_{f_k}|

This additivity makes deep flows practical. Each layer contributes a cheap log-det correction, and the composition builds up an expressive transformation.

From Math to Code

That covers the mathematical toolkit: the change of variables formula for exact likelihood, the rank-1 determinant trick for planar flows, and the triangular Jacobian for coupling layers.

In Part 2, we implement both architectures in pure PyTorch -- PlanarFlow with the softplus invertibility constraint, AffineCouplingLayer with learned scale and translation MLPs, and a full RealNVP model with 8 alternating coupling layers and batch normalization.

Deconstructing Normalizing Flows from Scratch

Part 1: The Math of Invertible Transforms