CapsNets promise equivariant representations that handle viewpoint changes better than standard CNNs. This post tests that claim. We evaluate our from-scratch CapsNet on MNIST accuracy, reconstruction quality, and robustness to rotations never seen during training, comparing directly against a CNN baseline trained on the same data.
Experimental Setup
Training Configuration
Both models trained on identical data with the same optimizer settings:
| Setting | CapsNet | Simple CNN |
|---|---|---|
| Training images | 5,000 | 5,000 |
| Test images | 10,000 | 10,000 |
| Optimizer | Adam | Adam |
| Learning rate | 0.001 | 0.001 |
| Batch size | 64 | 64 |
| Epochs | 20 | 20 |
| Parameters | 8,141,840 | 421,642 |
Rotation Test Protocol
Both models trained on unrotated images only. We then evaluate on the full 10,000-image test set at $0^\circ$, $15^\circ$, $30^\circ$, and $45^\circ$ rotations applied via affine transformation with zero-padding.
MNIST Classification Results
Test Accuracy
On the standard (unrotated) test set:
| Model | Test Accuracy |
|---|---|
| CapsNet | 98.34% |
| Simple CNN | 96.31% |
CapsNet leads by 2.03 percentage points on standard data. The gap widens under rotation.
Reconstruction Quality
The decoder takes the 16D capsule vector of the predicted class and reconstructs the original $28 \times 28$ image. It acts as a regularizer during training and confirms that the capsule representation carries meaningful structure.
Reconstructed digits are clearly recognizable -- shape, stroke width, and slant are preserved. Fine details (sharp edges, thin strokes) get slightly smoothed, which is expected from a 16D bottleneck.
Test-set reconstruction MSE was about 0.052 (per-pixel error ~0.23 on $[0,1]$ scale). For comparison, an all-zeros reconstruction gives MSE ~0.13 on MNIST, so the decoder captures most of the image structure.
Rotation Robustness
Neither model saw any rotated images during training.
| Rotation | CapsNet | Simple CNN | CapsNet Advantage |
|---|---|---|---|
| $0^\circ$ (baseline) | 98.34% | 96.31% | +2.03% |
| $15^\circ$ | 96.21% | 90.81% | +5.40% |
| $30^\circ$ | 83.64% | 69.09% | +14.55% |
| $45^\circ$ | 58.17% | 37.04% | +21.13% |
Analysis
The CapsNet advantage grows monotonically with rotation angle. At $0^\circ$: +2.03%. At $45^\circ$: +21.13 percentage points.
Degradation over the full $45^\circ$ range:
- CapsNet: 98.34% $\to$ 58.17% = 40.17 points lost.
- Simple CNN: 96.31% $\to$ 37.04% = 59.27 points lost.
The CNN loses 47% more accuracy than the CapsNet under the same perturbation.
Why? Capsule vectors encode pose in their orientation. When an input digit rotates:
- Primary capsule vectors change orientation to represent the new pose.
- Routing still finds agreement -- rotated parts still agree on a rotated whole.
- The digit capsule vector rotates correspondingly, but its length (classification confidence) stays relatively stable.
The CNN's max-pooled features handle small translations, but rotations shift features across spatial locations in ways pooling can't absorb.
The Cost of Capsules
| Metric | CapsNet | Simple CNN |
|---|---|---|
| Parameters | 8,141,840 | 421,642 |
| Parameter ratio | $19.3\times$ | $1\times$ |
| Training time (per epoch) | ~30s | ~0.7s |
| Time ratio | ~$43\times$ | $1\times$ |
CapsNets are ~$20\times$ larger and ~$43\times$ slower. The bottleneck is iterative routing -- three full passes through 1,152 capsules on every forward call.
Discussion
What Works
- Equivariant representations: Capsule vectors encode pose information that scalar activations can't.
- Part-whole reasoning: Routing is a soft attention mechanism where parts vote on wholes.
- Reconstruction regularization: Forcing reconstruction keeps the representation information-rich, not just discriminative.
What Doesn't (Yet)
- Scalability: Routing scales quadratically with capsule count. CapsNets haven't reached ImageNet scale.
- Compute cost: Iterative routing makes forward and backward passes expensive.
- Training stability: Routing is sensitive to initialization and hyperparameters.
- Standard benchmarks: On unrotated data, the advantage over well-tuned CNNs is modest.
Conclusion
Across this three-part series:
- Part 1 covered the math -- pooling's information loss, capsule vectors for equivariant pose encoding, routing by agreement.
- Part 2 implemented the full CapsNet in PyTorch -- squashing, primary capsules, digit capsules with routing, margin loss, decoder.
- Part 3 showed the results -- 98.34% on MNIST, recognizable 16D reconstructions, and a rotation robustness advantage that reaches +21.13 points at $45^\circ$.
CapsNets aren't the dominant paradigm today. But the core idea -- that neural representations should encode how something appears, not just that it appears -- remains worth understanding.