Deconstructing Normalizing Flows: Part 2 - PyTorch Implementation

Introduction

In Part 1, we established the mathematical foundation of normalizing flows: the change of variables formula for exact likelihood, the rank-1 determinant trick for planar flows, and the triangular Jacobian for affine coupling layers.

Today, we implement everything in pure PyTorch. No external normalizing flow libraries. We build three modules from scratch: PlanarFlow, AffineCouplingLayer, and RealNVP.

Component 1: Planar Flow

A planar flow transforms $z' = z + u \cdot \tanh(w^\top z + b)$. The implementation requires three key pieces: the forward transformation, the log-determinant computation, and the invertibility constraint.

The Invertibility Constraint

The unconstrained parameters $w, u, b$ do not guarantee invertibility. We enforce $w^\top \hat{u} \geq -1$ using:

def _get_u_hat(self):
    wtu = torch.dot(self.w, self.u)
    m_wtu = -1.0 + F.softplus(wtu)
    u_hat = self.u + (m_wtu - wtu) * self.w / (self.w @ self.w + 1e-8)
    return u_hat

The softplus function $m(x) = -1 + \log(1 + e^x)$ smoothly maps any real number to $[-1, \infty)$, ensuring the constraint is always satisfied during gradient-based optimization.

Forward Pass and Log-Determinant

def forward(self, z):
    u_hat = self._get_u_hat()
    linear = z @ self.w + self.b          # (batch,)
    z_prime = z + u_hat * torch.tanh(linear).unsqueeze(1)

    # psi = h'(w^T z + b) * w
    psi = (1 - torch.tanh(linear)**2).unsqueeze(1) * self.w
    det = 1.0 + (psi * u_hat).sum(dim=1)  # (batch,)
    log_det = torch.log(torch.abs(det) + 1e-8)
    return z_prime, log_det

A single planar flow has only $2d + 1$ parameters (5 parameters for $d=2$). We stack $K=32$ flows for a total of 160 parameters -- extremely lightweight but limited in expressiveness.

Component 2: Affine Coupling Layer

The coupling layer does the heavy lifting in modern normalizing flows. A binary mask splits the input: masked dimensions pass through unchanged, and the rest are transformed by learned scale ($s$) and translation ($t$) networks.

Mask Construction

# Even mask: [1, 0, 1, 0, ...]
mask = torch.zeros(dim)
mask[::2] = 1.0

The masked input x * mask is passed to the $s$ and $t$ networks, and the outputs are zeroed on masked dimensions with * (1 - mask). This ensures the masked dimensions pass through unchanged.

Scale and Translation Networks

s_net = nn.Sequential(
    nn.Linear(dim, hidden_dim),
    nn.LeakyReLU(0.2),
    nn.Linear(hidden_dim, hidden_dim),
    nn.LeakyReLU(0.2),
    nn.Linear(hidden_dim, dim),
    nn.Tanh()  # Clamp scale to [-1, 1]
)

The final Tanh on the scale network is critical for training stability. Without it, the scale can explode during early training, causing numerical overflow in $\exp(s)$. Clamping to $[-1, 1]$ means the affine scaling factor $\exp(s)$ stays within $[e^{-1}, e] \approx [0.37, 2.72]$.

Forward and Inverse

def forward(self, x):
    x_masked = x * self.mask
    s = self.s_net(x_masked) * (1 - self.mask)
    t = self.t_net(x_masked) * (1 - self.mask)
    y = x * torch.exp(s) + t
    log_det = s.sum(dim=1)
    return y, log_det

def inverse(self, y):
    y_masked = y * self.mask
    s = self.s_net(y_masked) * (1 - self.mask)
    t = self.t_net(y_masked) * (1 - self.mask)
    x = (y - t) * torch.exp(-s)
    return x

The inverse uses the same $s$ and $t$ networks with the same parameters -- no separate inverse network needed. That is the main advantage of the coupling design.

Component 3: Batch Normalization for Flows

Standard batch normalization is not invertible. But we can build a flow-compatible version that normalizes activations between coupling layers and contributes its own log-determinant:

y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \exp(\gamma) + \beta

The log-determinant is:

\log |\det J| = \sum_i \left( \gamma_i - \frac{1}{2} \log(\sigma_i^2 + \epsilon) \right)

During training, $\mu$ and $\sigma^2$ come from batch statistics; at evaluation time, we use exponential moving averages. This stabilizes training for deep flow stacks.

Component 4: RealNVP

The complete RealNVP model stacks 8 affine coupling layers with alternating even/odd masks, interleaved with batch normalization:

class RealNVP(nn.Module):
    def __init__(self, dim, hidden_dim=64, num_layers=8):
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            mask = 'even' if i % 2 == 0 else 'odd'
            self.layers.append(AffineCouplingLayer(dim, hidden_dim, mask))
            self.layers.append(BatchNormFlow(dim))

Mask alternation matters. Even-indexed layers fix dimensions $\{0, 2, 4, \ldots\}$ and transform $\{1, 3, 5, \ldots\}$. Odd-indexed layers reverse the roles. After two consecutive layers, every dimension has been transformed at least once.

Training Objective

The training loop is pure maximum likelihood estimation:

log_prob = model.log_prob(batch)
loss = -log_prob.mean()  # Minimize NLL
loss.backward()

No reconstruction loss. No KL divergence. No discriminator. The objective directly maximizes the exact log-likelihood of the data under the model.

Architecture Summary

Our RealNVP with $d=2$, hidden_dim=64, and 8 coupling layers has 71,744 trainable parameters. The planar flow with $K=32$ layers has only 160 parameters. This $450\times$ difference in capacity will be clearly reflected in the density estimation quality in Part 3.

Next: Training and Evaluation

Both architectures are implemented and ready. Forward passes compute exact log-likelihoods; inverse passes generate samples.

In Part 3, we train both models on Two Moons and Two Circles, visualize learned density heatmaps, and compare how planar flows and affine coupling layers handle multi-modal distributions.

Deconstructing Normalizing Flows from Scratch

Part 2: PyTorch Implementation