The theoretical promise of Capsule Networks is equivariant representations that handle viewpoint changes more gracefully than standard CNNs. This final installment puts that claim to the test. We evaluate our from-scratch CapsNet on MNIST classification accuracy, reconstruction quality, and -- critically -- robustness to rotations never seen during training. We compare directly against a simple CNN baseline trained on identical data, quantifying exactly where and how much capsule representations help.
Experimental Setup
Training Configuration
Both models were trained on identical data with comparable optimization:
| Setting | CapsNet | Simple CNN |
|---|---|---|
| Training images | 5,000 | 5,000 |
| Test images | 10,000 | 10,000 |
| Optimizer | Adam | Adam |
| Learning rate | 0.001 | 0.001 |
| Batch size | 64 | 64 |
| Epochs | 20 | 20 |
| Parameters | 8,141,840 | 421,642 |
Rotation Test Protocol
After training on unrotated images only, we evaluate both models on the full 10,000-image test set at four rotation angles: $0^\circ$, $15^\circ$, $30^\circ$, and $45^\circ$. Rotations are applied via affine transformation with zero-padding.
MNIST Classification Results
Test Accuracy
On the standard (unrotated) test set:
| Model | Test Accuracy |
|---|---|
| CapsNet | 98.34% |
| Simple CNN | 96.31% |
Both models achieve strong performance, with CapsNet holding a 2.03 percentage point advantage. Even on standard benchmarks, capsule representations provide a measurable edge -- and the benefit grows dramatically under transformation.
Reconstruction Quality
The reconstruction decoder takes the 16-dimensional capsule vector of the predicted class and attempts to reconstruct the original $28 \times 28$ image. This serves both as a regularizer during training and as evidence that the capsule representation captures meaningful structure.
Qualitative observation: the reconstructed digits are clearly recognizable, preserving the overall shape, stroke width, and slant of the originals. Some fine details (sharp edges, thin strokes) are slightly smoothed, which is expected given the bottleneck through a 16D vector.
The reconstruction MSE on the test set was approximately 0.052, corresponding to an average per-pixel error of about 0.23 (on $[0,1]$ scale). For reference, a trivial reconstruction (all zeros) would have MSE approximately 0.13 on MNIST, so the decoder is capturing substantial structure.
Rotation Robustness
This is the central experiment. Neither model saw any rotated images during training. We evaluate on the full test set rotated by $15^\circ$, $30^\circ$, and $45^\circ$.
| Rotation | CapsNet | Simple CNN | CapsNet Advantage |
|---|---|---|---|
| $0^\circ$ (baseline) | 98.34% | 96.31% | +2.03% |
| $15^\circ$ | 96.21% | 90.81% | +5.40% |
| $30^\circ$ | 83.64% | 69.09% | +14.55% |
| $45^\circ$ | 58.17% | 37.04% | +21.13% |
Analysis
The key finding: the CapsNet advantage grows monotonically with rotation angle. At $0^\circ$, CapsNet already leads by +2.03%. At $45^\circ$, CapsNet outperforms the CNN by 21.13 percentage points.
Degradation Rates:
- CapsNet: 98.34% to 58.17% = 40.17 points lost over $45^\circ$.
- Simple CNN: 96.31% to 37.04% = 59.27 points lost over $45^\circ$.
The CNN loses 47% more accuracy than the CapsNet under the same rotational perturbation.
Why CapsNets Are More Robust: The capsule representation encodes pose in the vector orientation. When an input digit rotates:
- The primary capsule vectors change orientation (representing the new pose).
- The routing algorithm still finds agreement -- rotated parts still agree on a rotated whole.
- The digit capsule vector rotates correspondingly, but its length (classification confidence) remains relatively stable.
In contrast, the CNN's max-pooled features are designed to be invariant to small translations, but rotations shift features across spatial locations in ways that pooling cannot absorb.
The Cost of Capsules
| Metric | CapsNet | Simple CNN |
|---|---|---|
| Parameters | 8,141,840 | 421,642 |
| Parameter ratio | $19.3\times$ | $1\times$ |
| Training time (per epoch) | ~30s | ~0.7s |
| Time ratio | ~$43\times$ | $1\times$ |
CapsNets are nearly $20\times$ larger and $43\times$ slower than the simple CNN. The iterative routing algorithm is the primary bottleneck -- each forward pass requires three iterations of the full routing computation.
Discussion and Broader Perspective
What Capsule Networks Get Right
- Equivariant representations: Capsule vectors encode pose information that scalar activations cannot represent.
- Part-whole reasoning: Dynamic routing implements a soft attention mechanism where parts vote on wholes -- a form of compositional understanding.
- Reconstruction as regularization: Forcing the capsule to reconstruct the input ensures the representation is information-rich, not just discriminative.
What Capsule Networks Get Wrong (or at Least Hard)
- Scalability: The routing algorithm scales quadratically with the number of capsules. CapsNets have not been successfully scaled to ImageNet-scale tasks.
- Computational cost: The iterative routing makes both forward and backward passes expensive.
- Training stability: The routing algorithm can be sensitive to initialization and hyperparameters.
- Marginal gains on standard benchmarks: On unrotated data, the advantage over well-tuned CNNs is small.
Conclusion
This three-part series deconstructed Capsule Networks from first principles:
- Part 1: The math -- why pooling loses spatial information, how capsule vectors encode equivariant pose, and how routing by agreement enables compositional reasoning.
- Part 2: The implementation -- squashing, primary capsules, digit capsules with dynamic routing, margin loss, and reconstruction decoder, all in pure PyTorch.
- Part 3: The results -- 98.34% test accuracy on MNIST, recognizable reconstructions from 16D vectors, and a growing robustness advantage over CNNs that reaches +21.13 percentage points at $45^\circ$ rotation.
Capsule Networks are not the dominant paradigm in deep learning today. But the core insight -- that neural representations should encode how something appears, not just that it appears -- remains one of the most important ideas in the field.