Deconstructing Normalizing Flows: Part 3 - Exact Likelihood Generation

Introduction

Part 1 covered the math. Part 2 built the code. Now we train both architectures on 2D density estimation and see what they actually learn.

We use two datasets -- Two Moons and Two Circles -- simple enough to visualize as heatmaps but complex enough to test multi-modal and non-convex density modeling.

Experimental Setup

Datasets

Both datasets contain 4,000 points in $\mathbb{R}^2$, generated with Gaussian noise ($\sigma = 0.06$) and normalized to zero mean and unit variance:

Two Moons: Two interleaving half-circles with a narrow gap between them -- non-convex and multi-modal.
Two Circles: Two concentric circles (radius ratio 0.4) -- rotationally symmetric but disconnected.

Models

RealNVP: 8 affine coupling layers with alternating even/odd masks, batch normalization between layers, hidden dimension 64. Total: 71,744 parameters.
Planar Flow: 32 stacked planar transformations with softplus invertibility constraints. Total: 160 parameters.

Training

Both models are trained for 1,000 epochs with Adam (initial lr = $10^{-3}$), cosine annealing to $10^{-5}$, batch size 256, and gradient clipping at norm 5.0.

Results

Model	Dataset	Best NLL	Parameters
RealNVP	Two Moons	1.3375	71,744
RealNVP	Two Circles	2.0891	71,744
Planar Flow ($K$=32)	Two Moons	2.9236	160
Planar Flow ($K$=32)	Two Circles	2.8546	160

Quantitative Analysis

RealNVP outperforms the planar flow on both datasets:

On Two Moons, RealNVP's NLL of 1.34 is 54% lower than the planar flow's 2.92.
On Two Circles, RealNVP scores 2.09 versus the planar flow's 2.85 -- a 27% improvement.
The Two Circles task is harder for RealNVP (NLL 2.09 vs 1.34 for Moons), likely because the concentric ring structure requires more precise rotational modeling.
Interestingly, the planar flow performs similarly on both tasks (2.92 vs 2.85), suggesting it has already hit its expressiveness ceiling.

Density Heatmaps

The log-likelihood heatmaps show the difference clearly. RealNVP concentrates density on the two crescent ridges of Two Moons, with near-zero mass in the gap between them. The planar flow produces a smoother, more diffuse density that does not fully separate the two modes.

On Two Circles, RealNVP traces both ring structures but with some smearing between inner and outer circles. The planar flow collapses both rings into a single broad mode at the origin.

Transformation Visualization

The transformation plots track 1,000 points from the base Gaussian through the inverse mapping into data space, color-coded by their angle in base space.

For RealNVP on Two Moons, the coupling layers progressively:

Stretch the Gaussian cloud along one axis (early layers).
Bend and fold the distribution into a crescent shape (middle layers).
Sharpen the density ridges and widen the gap between moons (final layers).

Each layer makes a small invertible adjustment; the composition builds up the full transformation.

Connections to Other Generative Models

Flows vs. VAEs

VAEs optimize the Evidence Lower Bound (ELBO), which is only a lower bound on the true log-likelihood. Normalizing flows can be used within VAEs to make the approximate posterior more flexible (as proposed by Rezende and Mohamed 2015), or they can replace VAEs entirely as standalone density estimators with exact likelihood.

Flows vs. GANs

GANs define an implicit generative model with no tractable density. You can sample but cannot evaluate probabilities, which rules them out for anomaly detection, compression, or anything requiring density evaluation. Flows give you both.

Flows vs. Diffusion Models

Diffusion models (DDPMs) optimize a variational bound similar to VAEs and require iterative denoising over hundreds or thousands of steps to generate a single sample. Normalizing flows generate samples in a single pass through the inverse mapping. However, diffusion models are generally more expressive for high-dimensional data (images) because they are not constrained to invertible architectures.

The Invertibility Trade-Off

The fundamental constraint is architectural: every layer must be invertible with a tractable Jacobian determinant, which limits the design space relative to unconstrained networks. Affine coupling layers handle this by making half the dimensions a deterministic function of the other half, but even so, the class of learnable transformations remains restricted.

Summary

Part 1: Change of variables formula, rank-1 determinant trick for planar flows, triangular Jacobian for coupling layers.
Part 2: PlanarFlow, AffineCouplingLayer, BatchNormFlow, and RealNVP (8 layers, alternating masks) in pure PyTorch.
Part 3: RealNVP hits NLL 1.34 on Two Moons vs. 2.92 for the planar flow. The 450x parameter gap shows up exactly where you would expect -- in how sharply the model can carve out multi-modal structure.

Flows prove that exact likelihood computation is feasible in deep generative models. Diffusion models have largely taken over for image generation, but the core principles here -- invertibility, Jacobian computation, change of variables -- underpin a lot of what came after.

Deconstructing Normalizing Flows from Scratch

Part 3: Exact Likelihood Generation