In Part 1, we derived the change of variables formula and the Jacobian determinant tricks that make normalizing flows computationally tractable. In Part 2, we implemented PlanarFlow, AffineCouplingLayer, BatchNormFlow, and RealNVP from scratch in pure PyTorch.
Now, we train both architectures on 2D density estimation benchmarks and analyze the results. We evaluate on two classic datasets -- Two Moons and Two Circles -- that are simple enough to visualize but complex enough to test a flow's ability to model multi-modal and non-convex distributions.
Experimental Setup
Datasets
Both datasets contain 4,000 points in $\mathbb{R}^2$, generated with Gaussian noise ($\sigma = 0.06$) and normalized to zero mean and unit variance:
- Two Moons: Two interleaving half-circles, creating a non-convex, multi-modal density with a narrow gap between the crescents.
- Two Circles: Two concentric circles with radius ratio 0.4, testing the flow's ability to model rotationally symmetric but disconnected distributions.
Models
- RealNVP: 8 affine coupling layers with alternating even/odd masks, batch normalization between layers, hidden dimension 64. Total: 71,744 parameters.
- Planar Flow: 32 stacked planar transformations with softplus invertibility constraints. Total: 160 parameters.
Training
Both models are trained for 1,000 epochs with Adam (initial lr = $10^{-3}$), cosine annealing to $10^{-5}$, batch size 256, and gradient clipping at norm 5.0.
Results
| Model | Dataset | Best NLL | Parameters |
|---|---|---|---|
| RealNVP | Two Moons | 1.3375 | 71,744 |
| RealNVP | Two Circles | 2.0891 | 71,744 |
| Planar Flow ($K$=32) | Two Moons | 2.9236 | 160 |
| Planar Flow ($K$=32) | Two Circles | 2.8546 | 160 |
Quantitative Analysis
RealNVP achieves dramatically better likelihoods across both datasets:
- On Two Moons, RealNVP's NLL of 1.34 is 54% lower than the planar flow's 2.92.
- On Two Circles, RealNVP scores 2.09 versus the planar flow's 2.85 -- a 27% improvement.
- The Two Circles task is harder for RealNVP (NLL 2.09 vs 1.34 for Moons), likely because the concentric ring structure requires more precise rotational modeling.
- Interestingly, the planar flow performs similarly on both tasks (2.92 vs 2.85), suggesting it has already hit its expressiveness ceiling.
Qualitative Analysis: Density Heatmaps
The log-likelihood heatmaps reveal the starkest differences. RealNVP's learned density concentrates sharply on the two crescent ridges of the Two Moons dataset, with near-zero probability mass in the gap between them. The planar flow produces a smoother, more diffuse density that fails to fully separate the two modes.
On Two Circles, RealNVP traces the two ring structures but with some smearing between the inner and outer circles. The planar flow blurs the rings into a single broad mode centered at the origin.
Qualitative Analysis: Transformation Visualization
The transformation plots track 1,000 points from the base Gaussian through the inverse mapping into data space, color-coded by their angle in base space.
For RealNVP on Two Moons, the transformation reveals how the coupling layers progressively:
- Stretch the Gaussian cloud along one axis (early layers).
- Bend and fold the distribution into a crescent shape (middle layers).
- Fine-tune the density ridges and sharpen the gap between moons (final layers).
This progressive refinement is the hallmark of deep normalizing flows: each layer makes a small, invertible adjustment, and the composition builds up a complex overall transformation.
Connections to Other Generative Models
Flows vs. VAEs
VAEs optimize the Evidence Lower Bound (ELBO), which is only a lower bound on the true log-likelihood. Normalizing flows can be used within VAEs to make the approximate posterior more flexible (as proposed by Rezende and Mohamed 2015), or they can replace VAEs entirely as standalone density estimators with exact likelihood.
Flows vs. GANs
GANs define an implicit generative model with no tractable density. You can sample, but you cannot evaluate probabilities. This makes GANs powerful for sample quality but useless for anomaly detection, compression, or any task requiring density evaluation. Flows provide both sampling and density evaluation.
Flows vs. Diffusion Models
Diffusion models (DDPMs) optimize a variational bound similar to VAEs and require iterative denoising over hundreds or thousands of steps to generate a single sample. Normalizing flows generate samples in a single pass through the inverse mapping. However, diffusion models are generally more expressive for high-dimensional data (images) because they are not constrained to invertible architectures.
The Invertibility Trade-Off
The fundamental constraint of normalizing flows is architectural: every layer must be invertible with a tractable Jacobian determinant. This limits the design space compared to unconstrained neural networks. Affine coupling layers achieve this elegantly by making half the dimensions a deterministic function of the other half, but this still restricts the class of learnable transformations.
Summary
In this three-part series, we built normalizing flows from first principles:
- Part 1: Derived the change of variables formula, the rank-1 determinant trick for planar flows, and the triangular Jacobian for coupling layers.
- Part 2: Implemented PlanarFlow (with invertibility constraint), AffineCouplingLayer (with scale clamping), BatchNormFlow, and RealNVP (8 layers, alternating masks) in pure PyTorch.
- Part 3: Trained both architectures on 2D density estimation, achieving NLL of 1.34 (RealNVP) vs 2.92 (Planar) on Two Moons. Visualized density heatmaps and the progressive Gaussian-to-data transformation.
Normalizing flows demonstrate that exact likelihood computation is achievable in deep generative models. While modern practice has shifted toward diffusion models for image generation, the mathematical principles of normalizing flows -- invertibility, Jacobian computation, and the change of variables formula -- remain foundational to understanding the full landscape of generative AI.