Deconstructing MoE: Part 3 - 8 Experts, 2 Active

Introduction

We train the from-scratch MoE implementation on MNIST and compare it against a dense MLP baseline. The MoE model (8 experts, top-2 routing, 732,946 total parameters) hits 95.14% test accuracy -- within 0.22% of the dense baseline (95.36%) -- with 54% of parameters inactive per input. We also look at how experts specialize across digit classes without any explicit routing supervision.

Experimental Setup

Property	MoE	Dense MLP
Total parameters	732,946	269,450
Active parameters	337,426	269,450
Experts	8 (top-2 active)	--
Hidden dim	256	256

Training: MNIST, 5,000-image training subset, 30 epochs, Adam lr=$10^{-3}$, batch size 64, auxiliary loss weight $\alpha = 0.1$.

Accuracy Comparison

Metric	MoE	Dense MLP
Final training accuracy	99.86%	100.00%
Test accuracy	95.14%	95.36%
Final training loss	0.2147	0.0001

The MoE model lands at 95.14% test accuracy, 0.22% below the dense baseline. Its training loss plateaus around 0.21, but only 0.004 of that is classification loss -- the rest is the auxiliary load balancing term ($0.1 \times 2.10 \approx 0.21$), which is the expected equilibrium for balanced routing.

Learning Dynamics

Epoch 1: MoE 63.98% vs Dense 75.68%. The dense model starts faster -- every parameter contributes immediately, while MoE still has to learn where to route.
Epoch 5: MoE 95.66% vs Dense 96.84%. Gap narrows as experts specialize.
Epoch 15: MoE 99.60% vs Dense 99.80%. Both near training set saturation.
Epoch 30: MoE 99.86% vs Dense 100.00%. The residual MoE error comes from gating noise occasionally misrouting samples during training.

Expert Specialization Analysis

No supervision was provided about which expert should handle which digits -- the gating network learned these assignments on its own.

Digit	Primary Expert	Secondary Expert
0	Expert 4 (49.18%)	Expert 2 (25.36%)
1	Expert 7 (49.78%)	Expert 1 (29.65%)
2	Expert 5 (47.63%)	Expert 0 (28.97%)
3	Expert 0 (47.92%)	Expert 7 (27.18%)
4	Expert 6 (46.23%)	Expert 3 (40.58%)
5	Expert 0 (29.48%)	Expert 4 (24.83%)
6	Expert 1 (49.06%)	Expert 4 (32.31%)
7	Expert 2 (42.85%)	Expert 3 (37.35%)
8	Expert 6 (35.78%)	Expert 5 (17.71%)
9	Expert 3 (47.47%)	Expert 6 (33.45%)

Patterns in Expert Sharing

Expert 0 is shared between digits 3 and 5 -- both have curved strokes in their lower halves.
Expert 6 handles digits 4, 8, and 9 -- all three feature crossing or closed-loop structures in the upper portion.
Expert 3 is shared between digits 4, 7, and 9 -- all have prominent vertical or angular strokes.
Expert 4 routes digits 0, 5, and 6 -- digits with rounded, enclosed shapes.

The experts appear to specialize on visual features -- strokes, curves, structural elements -- rather than on individual digit identities. Top-2 routing lets each digit be represented as a combination of two feature-processing specialists.

Parameter Efficiency

Metric	Value
MoE total parameters	732,946
MoE active parameters per input	337,426
Dense baseline parameters	269,450
MoE sparsity ratio	54.0% inactive
MoE capacity multiplier	$2.72\times$

The MoE model stores $2.72\times$ as many parameters as the dense baseline, but only 337,426 of 732,946 are active for any given input. The trade-off: more GPU memory, fewer FLOPs per token.

Lessons Learned

The gating network is just a linear layer. No attention, no deep network. The expressiveness comes from softmax competition between experts.
Load balancing is non-negotiable. Without the auxiliary loss, 2 of 8 experts ended up handling over 80% of all tokens within a few epochs.
Specialization is unsupervised. We never told Expert 4 to handle zeros or Expert 7 to handle ones. Routing patterns fell out of the classification gradient and the load balancing constraint together.
MoE starts slower. The dense model converges faster early on -- it does not need to simultaneously learn what to compute and who should compute it.
The auxiliary loss dominates the total loss. At convergence, classification loss is 0.004 and total loss is 0.21. Nearly all of it is the auxiliary term at its balanced equilibrium value of $k = 2$.

Conclusion

Across three parts, we:

Derived the math: sparse gating, top-$k$ routing, and load balancing loss.
Implemented Expert, TopKGating, MoELayer, and MoEClassifier from scratch in PyTorch.
Trained on MNIST: 8-expert, top-2 MoE reached 95.14% accuracy (vs 95.36% dense baseline) while activating only 46% of its parameters per input.

The same components -- gating, load balancing, top-$k$ selection -- scale from this 733K-parameter model to Mixtral's 47B-parameter system. The mechanism does not change; only the numbers do.

Deconstructing MoE from Scratch

Part 3: 8 Experts, 2 Active