Back to CapsNets Hub

Deconstructing CapsNets from Scratch

Part 1: The Math of Equivariance

Introduction

Convolutional Neural Networks achieve remarkable performance on visual tasks, yet they suffer from a fundamental flaw: max-pooling discards the spatial relationships between features. Capsule Networks, introduced by Sabour, Frosst, and Hinton (2017), address this by replacing scalar activations with vector outputs that encode both the probability and the pose of detected features.

This first installment introduces the mathematical foundations: why pooling loses information, how capsule vectors encode equivariant representations, and how dynamic routing by agreement enables compositional understanding.

The Problem with Pooling

What Max-Pooling Does

In a standard CNN, a max-pooling layer takes a spatial region (e.g., $2 \times 2$) and outputs only the maximum activation:

$$ y = \max(x_{1,1},\; x_{1,2},\; x_{2,1},\; x_{2,2}) $$

This achieves two goals: (1) reducing spatial dimensionality, and (2) providing a degree of translation invariance -- small shifts in the input do not change the maximum value.

What Max-Pooling Loses

The cost is significant. Max-pooling discards where within the pooling region the maximum occurred. After several pooling layers, the network knows that certain features exist but not where they are relative to each other.

Consider face detection. A CNN with pooling can detect "eyes exist" and "mouth exists," but it cannot easily verify that the eyes are above the mouth. A scrambled face -- eyes below the mouth, nose on the forehead -- activates the same pooled features as a normal face.

Invariance vs. Equivariance

This reveals a deeper issue: the distinction between invariance and equivariance.

Max-pooling pushes CNNs toward invariance. But for understanding spatial structure, we want equivariance: if the input rotates, the representation should rotate correspondingly, not remain unchanged.

Capsule Vectors: Encoding Pose

From Scalars to Vectors

The key idea of Capsule Networks is to replace scalar feature activations with vector outputs called capsules. Each capsule encodes:

For example, a capsule detecting an eye might output an 8-dimensional vector where the length (0.92) indicates high confidence that an eye is present, and the direction encodes that the eye is rotated 15 degrees, scaled at $0.8\times$, and shifted 3 pixels right.

Why Vectors Enable Equivariance

When the input image rotates, the capsule vector's orientation changes but its length (probability) remains stable. This is equivariance: the representation transforms in a predictable way corresponding to the input transformation.

A scalar neuron has no way to represent this. It can only say "eye: 0.92." A capsule says "eye: 0.92 probability, at pose $\mathbf{p}$."

The Squashing Function

Since capsule lengths represent probabilities, we need a non-linearity that maps any vector to a vector with length in $[0, 1)$ while preserving orientation. This is the squashing function:

$$ \mathbf{v} = \frac{\|\mathbf{s}\|^2}{1 + \|\mathbf{s}\|^2} \cdot \frac{\mathbf{s}}{\|\mathbf{s}\|} $$

Dissecting the Formula

The squashing function is the product of two terms:

  1. Scaling factor: $\frac{\|\mathbf{s}\|^2}{1 + \|\mathbf{s}\|^2}$ -- when $\|\mathbf{s}\|$ is small: $\approx \|\mathbf{s}\|^2 \to 0$ (short vectors get shrunk further); when $\|\mathbf{s}\|$ is large: $\approx 1$ (long vectors get normalized to just below 1).
  2. Unit vector: $\frac{\mathbf{s}}{\|\mathbf{s}\|}$ -- preserves the direction (pose information).

The result: $\|\mathbf{v}\| \in [0, 1)$ for any input $\mathbf{s}$. Short vectors (low confidence) get pushed toward zero. Long vectors (high confidence) get pushed toward unit length.

Comparison to Sigmoid and Softmax

Dynamic Routing by Agreement

The Routing Problem

Given a layer of lower-level capsules (e.g., detecting eyes, noses, mouths) and a layer of higher-level capsules (e.g., detecting faces, houses), how should we decide which lower-level capsule sends its output to which higher-level capsule?

In standard CNNs, this is determined by the fixed connectivity pattern (receptive fields). Capsule Networks use a dynamic mechanism: routing by agreement.

Prediction Vectors

Each lower-level capsule $i$ produces a prediction vector $\hat{\mathbf{u}}_{j|i}$ for each higher-level capsule $j$:

$$ \hat{\mathbf{u}}_{j|i} = \mathbf{W}_{ij} \, \mathbf{u}_i $$

where $\mathbf{W}_{ij}$ is a learned transformation matrix and $\mathbf{u}_i$ is the output of capsule $i$. The prediction vector answers: "If capsule $i$ is an eye, and the eye belongs to face $j$, then face $j$ should have this pose."

The Routing Algorithm

  1. Initialize logits $b_{ij} = 0$ for all pairs $(i, j)$.
  2. For each routing iteration $r = 1, 2, \ldots, R$:
    • Compute coupling coefficients: $c_{ij} = \text{softmax}_j(b_{ij})$
    • Compute weighted sum: $\mathbf{s}_j = \sum_i c_{ij} \hat{\mathbf{u}}_{j|i}$
    • Squash: $\mathbf{v}_j = \text{squash}(\mathbf{s}_j)$
    • Update logits: $b_{ij} \leftarrow b_{ij} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$

Intuition: Voting and Consensus

The routing algorithm implements a voting scheme:

This is why it is called "routing by agreement": information flows toward consensus.

Why Three Iterations?

Sabour et al. found that three routing iterations work well empirically. Fewer iterations do not allow sufficient refinement of the coupling coefficients. More iterations provide diminishing returns and increase computational cost.

Looking Ahead

With the mathematical foundations in place -- capsule vectors for equivariant representation, the squashing function for probability normalization, and dynamic routing for compositional part-whole reasoning -- we are ready to implement a full Capsule Network in PyTorch.

In Part 2, we will build PrimaryCapsule (converting conv features to capsule vectors), DigitCapsule (the routing algorithm in code, with all tensor shape details), and CapsuleLoss (margin loss for classification plus reconstruction regularization).

The implementation reveals challenges the math does not: tensor broadcasting with 5D weight matrices, gradient detachment for routing stability, and the careful balance between margin loss and reconstruction loss.