Deconstructing CapsNets: Part 3 - Viewpoint Invariance

Introduction

CapsNets promise equivariant representations that handle viewpoint changes better than standard CNNs. This post tests that claim. We evaluate our from-scratch CapsNet on MNIST accuracy, reconstruction quality, and robustness to rotations never seen during training, comparing directly against a CNN baseline trained on the same data.

Experimental Setup

Training Configuration

Both models trained on identical data with the same optimizer settings:

Setting	CapsNet	Simple CNN
Training images	5,000	5,000
Test images	10,000	10,000
Optimizer	Adam	Adam
Learning rate	0.001	0.001
Batch size	64	64
Epochs	20	20
Parameters	8,141,840	421,642

Rotation Test Protocol

Both models trained on unrotated images only. We then evaluate on the full 10,000-image test set at $0^\circ$, $15^\circ$, $30^\circ$, and $45^\circ$ rotations applied via affine transformation with zero-padding.

MNIST Classification Results

Test Accuracy

On the standard (unrotated) test set:

Model	Test Accuracy
CapsNet	98.34%
Simple CNN	96.31%

CapsNet leads by 2.03 percentage points on standard data. The gap widens under rotation.

Reconstruction Quality

The decoder takes the 16D capsule vector of the predicted class and reconstructs the original $28 \times 28$ image. It acts as a regularizer during training and confirms that the capsule representation carries meaningful structure.

Reconstructed digits are clearly recognizable -- shape, stroke width, and slant are preserved. Fine details (sharp edges, thin strokes) get slightly smoothed, which is expected from a 16D bottleneck.

Test-set reconstruction MSE was about 0.052 (per-pixel error ~0.23 on $[0,1]$ scale). For comparison, an all-zeros reconstruction gives MSE ~0.13 on MNIST, so the decoder captures most of the image structure.

Rotation Robustness

Neither model saw any rotated images during training.

Rotation	CapsNet	Simple CNN	CapsNet Advantage
$0^\circ$ (baseline)	98.34%	96.31%	+2.03%
$15^\circ$	96.21%	90.81%	+5.40%
$30^\circ$	83.64%	69.09%	+14.55%
$45^\circ$	58.17%	37.04%	+21.13%

Analysis

The CapsNet advantage grows monotonically with rotation angle. At $0^\circ$: +2.03%. At $45^\circ$: +21.13 percentage points.

Degradation over the full $45^\circ$ range:

CapsNet: 98.34% $\to$ 58.17% = 40.17 points lost.
Simple CNN: 96.31% $\to$ 37.04% = 59.27 points lost.

The CNN loses 47% more accuracy than the CapsNet under the same perturbation.

Why? Capsule vectors encode pose in their orientation. When an input digit rotates:

Primary capsule vectors change orientation to represent the new pose.
Routing still finds agreement -- rotated parts still agree on a rotated whole.
The digit capsule vector rotates correspondingly, but its length (classification confidence) stays relatively stable.

The CNN's max-pooled features handle small translations, but rotations shift features across spatial locations in ways pooling can't absorb.

The Cost of Capsules

Metric	CapsNet	Simple CNN
Parameters	8,141,840	421,642
Parameter ratio	$19.3\times$	$1\times$
Training time (per epoch)	~30s	~0.7s
Time ratio	~$43\times$	$1\times$

CapsNets are ~$20\times$ larger and ~$43\times$ slower. The bottleneck is iterative routing -- three full passes through 1,152 capsules on every forward call.

Discussion

What Works

Equivariant representations: Capsule vectors encode pose information that scalar activations can't.
Part-whole reasoning: Routing is a soft attention mechanism where parts vote on wholes.
Reconstruction regularization: Forcing reconstruction keeps the representation information-rich, not just discriminative.

What Doesn't (Yet)

Scalability: Routing scales quadratically with capsule count. CapsNets haven't reached ImageNet scale.
Compute cost: Iterative routing makes forward and backward passes expensive.
Training stability: Routing is sensitive to initialization and hyperparameters.
Standard benchmarks: On unrotated data, the advantage over well-tuned CNNs is modest.

Conclusion

Across this three-part series:

Part 1 covered the math -- pooling's information loss, capsule vectors for equivariant pose encoding, routing by agreement.
Part 2 implemented the full CapsNet in PyTorch -- squashing, primary capsules, digit capsules with routing, margin loss, decoder.
Part 3 showed the results -- 98.34% on MNIST, recognizable 16D reconstructions, and a rotation robustness advantage that reaches +21.13 points at $45^\circ$.

CapsNets aren't the dominant paradigm today. But the core idea -- that neural representations should encode how something appears, not just that it appears -- remains worth understanding.

Deconstructing CapsNets from Scratch

Part 3: Viewpoint Invariance