We train the from-scratch MoE implementation on MNIST and compare it against a dense MLP baseline. The MoE model (8 experts, top-2 routing, 732,946 total parameters) hits 95.14% test accuracy -- within 0.22% of the dense baseline (95.36%) -- with 54% of parameters inactive per input. We also look at how experts specialize across digit classes without any explicit routing supervision.
Experimental Setup
| Property | MoE | Dense MLP |
|---|---|---|
| Total parameters | 732,946 | 269,450 |
| Active parameters | 337,426 | 269,450 |
| Experts | 8 (top-2 active) | -- |
| Hidden dim | 256 | 256 |
Training: MNIST, 5,000-image training subset, 30 epochs, Adam lr=$10^{-3}$, batch size 64, auxiliary loss weight $\alpha = 0.1$.
Accuracy Comparison
| Metric | MoE | Dense MLP |
|---|---|---|
| Final training accuracy | 99.86% | 100.00% |
| Test accuracy | 95.14% | 95.36% |
| Final training loss | 0.2147 | 0.0001 |
The MoE model lands at 95.14% test accuracy, 0.22% below the dense baseline. Its training loss plateaus around 0.21, but only 0.004 of that is classification loss -- the rest is the auxiliary load balancing term ($0.1 \times 2.10 \approx 0.21$), which is the expected equilibrium for balanced routing.
Learning Dynamics
- Epoch 1: MoE 63.98% vs Dense 75.68%. The dense model starts faster -- every parameter contributes immediately, while MoE still has to learn where to route.
- Epoch 5: MoE 95.66% vs Dense 96.84%. Gap narrows as experts specialize.
- Epoch 15: MoE 99.60% vs Dense 99.80%. Both near training set saturation.
- Epoch 30: MoE 99.86% vs Dense 100.00%. The residual MoE error comes from gating noise occasionally misrouting samples during training.
Expert Specialization Analysis
No supervision was provided about which expert should handle which digits -- the gating network learned these assignments on its own.
| Digit | Primary Expert | Secondary Expert |
|---|---|---|
| 0 | Expert 4 (49.18%) | Expert 2 (25.36%) |
| 1 | Expert 7 (49.78%) | Expert 1 (29.65%) |
| 2 | Expert 5 (47.63%) | Expert 0 (28.97%) |
| 3 | Expert 0 (47.92%) | Expert 7 (27.18%) |
| 4 | Expert 6 (46.23%) | Expert 3 (40.58%) |
| 5 | Expert 0 (29.48%) | Expert 4 (24.83%) |
| 6 | Expert 1 (49.06%) | Expert 4 (32.31%) |
| 7 | Expert 2 (42.85%) | Expert 3 (37.35%) |
| 8 | Expert 6 (35.78%) | Expert 5 (17.71%) |
| 9 | Expert 3 (47.47%) | Expert 6 (33.45%) |
Patterns in Expert Sharing
- Expert 0 is shared between digits 3 and 5 -- both have curved strokes in their lower halves.
- Expert 6 handles digits 4, 8, and 9 -- all three feature crossing or closed-loop structures in the upper portion.
- Expert 3 is shared between digits 4, 7, and 9 -- all have prominent vertical or angular strokes.
- Expert 4 routes digits 0, 5, and 6 -- digits with rounded, enclosed shapes.
The experts appear to specialize on visual features -- strokes, curves, structural elements -- rather than on individual digit identities. Top-2 routing lets each digit be represented as a combination of two feature-processing specialists.
Parameter Efficiency
| Metric | Value |
|---|---|
| MoE total parameters | 732,946 |
| MoE active parameters per input | 337,426 |
| Dense baseline parameters | 269,450 |
| MoE sparsity ratio | 54.0% inactive |
| MoE capacity multiplier | $2.72\times$ |
The MoE model stores $2.72\times$ as many parameters as the dense baseline, but only 337,426 of 732,946 are active for any given input. The trade-off: more GPU memory, fewer FLOPs per token.
Lessons Learned
- The gating network is just a linear layer. No attention, no deep network. The expressiveness comes from softmax competition between experts.
- Load balancing is non-negotiable. Without the auxiliary loss, 2 of 8 experts ended up handling over 80% of all tokens within a few epochs.
- Specialization is unsupervised. We never told Expert 4 to handle zeros or Expert 7 to handle ones. Routing patterns fell out of the classification gradient and the load balancing constraint together.
- MoE starts slower. The dense model converges faster early on -- it does not need to simultaneously learn what to compute and who should compute it.
- The auxiliary loss dominates the total loss. At convergence, classification loss is 0.004 and total loss is 0.21. Nearly all of it is the auxiliary term at its balanced equilibrium value of $k = 2$.
Conclusion
Across three parts, we:
- Derived the math: sparse gating, top-$k$ routing, and load balancing loss.
- Implemented Expert, TopKGating, MoELayer, and MoEClassifier from scratch in PyTorch.
- Trained on MNIST: 8-expert, top-2 MoE reached 95.14% accuracy (vs 95.36% dense baseline) while activating only 46% of its parameters per input.
The same components -- gating, load balancing, top-$k$ selection -- scale from this 733K-parameter model to Mixtral's 47B-parameter system. The mechanism does not change; only the numbers do.