CNNs work well on vision tasks, but max-pooling throws away spatial relationships between features. Capsule Networks (Sabour, Frosst, and Hinton, 2017) replace scalar activations with vector outputs that encode both the probability and the pose of detected features.
This post covers the mathematical foundations: the information loss from pooling, capsule vectors as equivariant representations, and how dynamic routing by agreement enables part-whole reasoning.
The Problem with Pooling
What Max-Pooling Does
A max-pooling layer takes a spatial region (e.g., $2 \times 2$) and keeps only the maximum activation:
This reduces spatial dimensionality and gives some translation invariance -- small shifts in the input don't change the output.
What Max-Pooling Loses
Max-pooling discards where within the pooling region the maximum occurred. Stack a few pooling layers and the network knows that certain features exist but not where they are relative to each other.
Think about face detection. A pooled CNN can detect "eyes exist" and "mouth exists," but it can't easily check that the eyes are above the mouth. A scrambled face -- eyes below the mouth, nose on the forehead -- fires the same pooled features as a normal face.
Invariance vs. Equivariance
This points to a deeper distinction between invariance and equivariance.
- Invariance: The output stays the same under transformation. $f(T(x)) = f(x)$.
- Equivariance: The output transforms correspondingly. $f(T(x)) = T'(f(x))$.
Max-pooling pushes CNNs toward invariance. But understanding spatial structure requires equivariance: if the input rotates, the representation should rotate with it.
Capsule Vectors: Encoding Pose
From Scalars to Vectors
Capsule Networks replace scalar feature activations with vector outputs called capsules. Each capsule encodes two things:
- Length $\|\mathbf{v}\| \in [0, 1)$: The probability that the entity exists.
- Orientation: The pose of the entity -- position, rotation, scale, and other instantiation parameters.
A capsule detecting an eye might output an 8-dimensional vector where the length (0.92) means high confidence an eye is present, and the direction encodes that it's rotated 15 degrees, scaled at $0.8\times$, and shifted 3 pixels right.
Why Vectors Enable Equivariance
When the input image rotates, the capsule vector's orientation changes but its length (probability) stays stable. That's equivariance: the representation transforms predictably with the input.
A scalar neuron can only say "eye: 0.92." A capsule says "eye: 0.92 probability, at pose $\mathbf{p}$."
The Squashing Function
Capsule lengths represent probabilities, so we need a non-linearity that maps any vector to length $[0, 1)$ while preserving orientation. The squashing function:
Dissecting the Formula
Two terms multiplied together:
- Scaling factor: $\frac{\|\mathbf{s}\|^2}{1 + \|\mathbf{s}\|^2}$ -- small $\|\mathbf{s}\|$ gives $\approx \|\mathbf{s}\|^2 \to 0$ (short vectors shrink further); large $\|\mathbf{s}\|$ gives $\approx 1$ (long vectors cap just below 1).
- Unit vector: $\frac{\mathbf{s}}{\|\mathbf{s}\|}$ -- preserves direction (pose information).
So $\|\mathbf{v}\| \in [0, 1)$ for any input $\mathbf{s}$. Low-confidence capsules get pushed toward zero; high-confidence ones approach unit length.
Comparison to Sigmoid and Softmax
- Sigmoid maps a scalar $\to [0,1]$. No directional information.
- Softmax maps a vector of scalars $\to$ probability distribution. No pose encoding.
- Squash maps a vector $\to$ vector with $\|\cdot\| < 1$. Preserves direction as pose.
Dynamic Routing by Agreement
The Routing Problem
Given lower-level capsules (detecting eyes, noses, mouths) and higher-level capsules (detecting faces, houses), how do we decide which lower capsule sends output to which higher capsule?
Standard CNNs use fixed connectivity (receptive fields). Capsule Networks use routing by agreement -- a dynamic mechanism.
Prediction Vectors
Each lower-level capsule $i$ produces a prediction vector $\hat{\mathbf{u}}_{j|i}$ for each higher-level capsule $j$:
$\mathbf{W}_{ij}$ is a learned transformation matrix, $\mathbf{u}_i$ is capsule $i$'s output. The prediction says: "If I'm an eye belonging to face $j$, then face $j$ should have this pose."
The Routing Algorithm
- Initialize logits $b_{ij} = 0$ for all pairs $(i, j)$.
- For each routing iteration $r = 1, 2, \ldots, R$:
- Compute coupling coefficients: $c_{ij} = \text{softmax}_j(b_{ij})$
- Compute weighted sum: $\mathbf{s}_j = \sum_i c_{ij} \hat{\mathbf{u}}_{j|i}$
- Squash: $\mathbf{v}_j = \text{squash}(\mathbf{s}_j)$
- Update logits: $b_{ij} \leftarrow b_{ij} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$
Voting and Consensus
The routing algorithm is a voting scheme:
- Each lower-level capsule votes for what the higher-level capsule's pose should be.
- If many lower-level capsules agree (prediction vectors pointing in similar directions), the higher-level capsule activates strongly.
- Agreement increases $c_{ij}$, routing more information toward that capsule.
- Disagreement decreases $c_{ij}$, redirecting information elsewhere.
Why Three Iterations?
Sabour et al. found three routing iterations work well empirically. Fewer iterations don't refine the coupling coefficients enough. More iterations give diminishing returns and cost more compute.
Next
In Part 2, we implement the full CapsNet in PyTorch: PrimaryCapsule, DigitCapsule with routing, and CapsuleLoss (margin loss plus reconstruction regularization). The implementation surfaces details the math hides -- tensor broadcasting with 5D weight matrices, gradient detachment for routing stability, and tuning the loss balance.