Back to Normalizing Flows Hub

Deconstructing Normalizing Flows from Scratch

Part 2: PyTorch Implementation

Introduction

In Part 1, we established the mathematical foundation of normalizing flows: the change of variables formula for exact likelihood, the rank-1 determinant trick for planar flows, and the triangular Jacobian for affine coupling layers.

Today, we implement everything in pure PyTorch. No external normalizing flow libraries. We build three modules from scratch: PlanarFlow, AffineCouplingLayer, and RealNVP.

Component 1: Planar Flow

A planar flow transforms $z' = z + u \cdot \tanh(w^\top z + b)$. The implementation requires three key pieces: the forward transformation, the log-determinant computation, and the invertibility constraint.

The Invertibility Constraint

The unconstrained parameters $w, u, b$ do not guarantee invertibility. We enforce $w^\top \hat{u} \geq -1$ using:

def _get_u_hat(self):
    wtu = torch.dot(self.w, self.u)
    m_wtu = -1.0 + F.softplus(wtu)
    u_hat = self.u + (m_wtu - wtu) * self.w / (self.w @ self.w + 1e-8)
    return u_hat

The softplus function $m(x) = -1 + \log(1 + e^x)$ smoothly maps any real number to $[-1, \infty)$, ensuring the constraint is always satisfied during gradient-based optimization.

Forward Pass and Log-Determinant

def forward(self, z):
    u_hat = self._get_u_hat()
    linear = z @ self.w + self.b          # (batch,)
    z_prime = z + u_hat * torch.tanh(linear).unsqueeze(1)

    # psi = h'(w^T z + b) * w
    psi = (1 - torch.tanh(linear)**2).unsqueeze(1) * self.w
    det = 1.0 + (psi * u_hat).sum(dim=1)  # (batch,)
    log_det = torch.log(torch.abs(det) + 1e-8)
    return z_prime, log_det

A single planar flow has only $2d + 1$ parameters (5 parameters for $d=2$). We stack $K=32$ flows for a total of 160 parameters -- extremely lightweight but limited in expressiveness.

Component 2: Affine Coupling Layer

The coupling layer is the workhorse of modern normalizing flows. It uses a binary mask to split the input, keeps the masked dimensions fixed, and transforms the rest using learned scale ($s$) and translation ($t$) networks.

Mask Construction

# Even mask: [1, 0, 1, 0, ...]
mask = torch.zeros(dim)
mask[::2] = 1.0

The masked input x * mask is passed to the $s$ and $t$ networks, and the outputs are zeroed on masked dimensions with * (1 - mask). This ensures the masked dimensions pass through unchanged.

Scale and Translation Networks

s_net = nn.Sequential(
    nn.Linear(dim, hidden_dim),
    nn.LeakyReLU(0.2),
    nn.Linear(hidden_dim, hidden_dim),
    nn.LeakyReLU(0.2),
    nn.Linear(hidden_dim, dim),
    nn.Tanh()  # Clamp scale to [-1, 1]
)

The final Tanh on the scale network is critical for training stability. Without it, the scale can explode during early training, causing numerical overflow in $\exp(s)$. Clamping to $[-1, 1]$ means the affine scaling factor $\exp(s)$ stays within $[e^{-1}, e] \approx [0.37, 2.72]$.

Forward and Inverse

def forward(self, x):
    x_masked = x * self.mask
    s = self.s_net(x_masked) * (1 - self.mask)
    t = self.t_net(x_masked) * (1 - self.mask)
    y = x * torch.exp(s) + t
    log_det = s.sum(dim=1)
    return y, log_det

def inverse(self, y):
    y_masked = y * self.mask
    s = self.s_net(y_masked) * (1 - self.mask)
    t = self.t_net(y_masked) * (1 - self.mask)
    x = (y - t) * torch.exp(-s)
    return x

Note that the inverse uses the same networks $s$ and $t$ with the same parameters -- no separate inverse network is needed. This is the key advantage of the coupling architecture.

Component 3: Batch Normalization for Flows

Standard batch normalization is not invertible in general. However, we can design a flow-compatible version that normalizes activations between coupling layers and contributes its own log-determinant:

$$ y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \exp(\gamma) + \beta $$

The log-determinant is:

$$ \log |\det J| = \sum_i \left( \gamma_i - \frac{1}{2} \log(\sigma_i^2 + \epsilon) \right) $$

At training time, $\mu$ and $\sigma^2$ are batch statistics; at evaluation time, we use exponential moving averages. This stabilizes training significantly for deep flow stacks.

Component 4: RealNVP

The complete RealNVP model stacks 8 affine coupling layers with alternating even/odd masks, interleaved with batch normalization:

class RealNVP(nn.Module):
    def __init__(self, dim, hidden_dim=64, num_layers=8):
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            mask = 'even' if i % 2 == 0 else 'odd'
            self.layers.append(AffineCouplingLayer(dim, hidden_dim, mask))
            self.layers.append(BatchNormFlow(dim))

The mask alternation is essential. In even-indexed layers, dimensions $\{0, 2, 4, \ldots\}$ are fixed while $\{1, 3, 5, \ldots\}$ are transformed. In odd-indexed layers, the roles reverse. After two consecutive layers, every dimension has been transformed at least once.

Training Objective

The training loop is pure maximum likelihood estimation:

log_prob = model.log_prob(batch)
loss = -log_prob.mean()  # Minimize NLL
loss.backward()

No reconstruction loss. No KL divergence. No discriminator. The objective directly maximizes the exact log-likelihood of the data under the model.

Architecture Summary

Our RealNVP with $d=2$, hidden_dim=64, and 8 coupling layers has 71,744 trainable parameters. The planar flow with $K=32$ layers has only 160 parameters. This $450\times$ difference in capacity will be clearly reflected in the density estimation quality in Part 3.

Next Steps: Watching Gaussians Become Moons

We now have a complete, tested implementation of two normalizing flow architectures. Every forward pass computes exact log-likelihoods. Every inverse pass generates new samples.

In Part 3, we will train both models on 2D density estimation benchmarks (Two Moons and Two Circles), visualize the learned density heatmaps, watch Gaussian samples warp into the target distribution, and compare the expressiveness of planar flows versus affine coupling layers.