In the final installment of this series, we train our from-scratch MoE implementation on MNIST and compare it against a dense MLP baseline. The MoE model (8 experts, top-2 routing, 732,946 total parameters) achieves 95.14% test accuracy -- within 0.22% of the dense baseline (95.36%) -- while keeping 54% of its parameters inactive per input. We analyze expert specialization patterns, demonstrating that experts autonomously learn to handle distinct digit classes without any explicit supervision of the routing.
Experimental Setup
| Property | MoE | Dense MLP |
|---|---|---|
| Total parameters | 732,946 | 269,450 |
| Active parameters | 337,426 | 269,450 |
| Experts | 8 (top-2 active) | -- |
| Hidden dim | 256 | 256 |
Training: MNIST, 5,000-image training subset, 30 epochs, Adam lr=$10^{-3}$, batch size 64, auxiliary loss weight $\alpha = 0.1$.
Accuracy Comparison
| Metric | MoE | Dense MLP |
|---|---|---|
| Final training accuracy | 99.86% | 100.00% |
| Test accuracy | 95.14% | 95.36% |
| Final training loss | 0.2147 | 0.0001 |
The MoE model achieves 95.14% test accuracy, within 0.22% of the dense baseline. The MoE model's training loss plateaus around 0.21, of which only 0.004 is classification loss -- the remainder is the auxiliary load balancing loss ($0.1 \times 2.10 \approx 0.21$), which is the expected value for balanced routing.
Learning Dynamics
- Epoch 1: MoE 63.98% vs Dense 75.68%. The dense model starts faster because every parameter contributes immediately. MoE needs time for the gating network to learn useful routing.
- Epoch 5: MoE 95.66% vs Dense 96.84%. The gap narrows as experts begin to specialize.
- Epoch 15: MoE 99.60% vs Dense 99.80%. Both models approach training set saturation.
- Epoch 30: MoE 99.86% vs Dense 100.00%. The MoE model's residual error is due to the noise in gating, which occasionally misroutes samples during training.
Expert Specialization Analysis
The most striking result is the emergent specialization of experts. No supervision was provided about which expert should handle which digits -- the gating network discovered these assignments autonomously.
| Digit | Primary Expert | Secondary Expert |
|---|---|---|
| 0 | Expert 4 (49.18%) | Expert 2 (25.36%) |
| 1 | Expert 7 (49.78%) | Expert 1 (29.65%) |
| 2 | Expert 5 (47.63%) | Expert 0 (28.97%) |
| 3 | Expert 0 (47.92%) | Expert 7 (27.18%) |
| 4 | Expert 6 (46.23%) | Expert 3 (40.58%) |
| 5 | Expert 0 (29.48%) | Expert 4 (24.83%) |
| 6 | Expert 1 (49.06%) | Expert 4 (32.31%) |
| 7 | Expert 2 (42.85%) | Expert 3 (37.35%) |
| 8 | Expert 6 (35.78%) | Expert 5 (17.71%) |
| 9 | Expert 3 (47.47%) | Expert 6 (33.45%) |
Patterns in Expert Sharing
Several interesting patterns emerge:
- Expert 0 is shared between digits 3 and 5 -- both have curved strokes in their lower halves.
- Expert 6 handles digits 4, 8, and 9 -- all three feature crossing or closed-loop structures in the upper portion.
- Expert 3 is shared between digits 4, 7, and 9 -- all have prominent vertical or angular strokes.
- Expert 4 routes digits 0, 5, and 6 -- digits with rounded, enclosed shapes.
These sharing patterns suggest the experts have learned to specialize not on individual digit identities but on visual features -- strokes, curves, and structural elements shared across digit classes. This is a form of compositional representation: the top-2 routing allows each digit to be represented as a combination of two feature-processing specialists.
Parameter Efficiency
| Metric | Value |
|---|---|
| MoE total parameters | 732,946 |
| MoE active parameters per input | 337,426 |
| Dense baseline parameters | 269,450 |
| MoE sparsity ratio | 54.0% inactive |
| MoE capacity multiplier | $2.72\times$ |
The MoE model stores $2.72\times$ as many parameters as the dense baseline, but for any given input, only 337,426 of those 732,946 parameters are active. This is the fundamental trade-off of MoE: memory for compute. You pay more in storage (GPU memory) but less in computation (FLOPs per token).
Lessons Learned
- The gating network is simple. It is a single linear layer -- no attention, no deep network. The expressiveness comes from the softmax competition between experts.
- Load balancing is non-negotiable. Without the auxiliary loss, expert utilization collapses within the first few epochs. In our experiments, removing the load balancing loss caused 2 of 8 experts to handle over 80% of all tokens.
- Specialization is emergent. We never told Expert 4 to handle zeros or Expert 7 to handle ones. The routing patterns emerged purely from the combination of the classification gradient signal and the load balancing constraint.
- MoE starts slower. The dense model converges faster in early epochs because it does not need to simultaneously learn what to compute and who should compute it.
- The auxiliary loss dominates the total loss. At convergence, the classification loss is 0.004 but the total loss is 0.21 -- nearly all of it is the auxiliary loss at its balanced equilibrium value. This is expected and correct.
Conclusion
Over three parts, we have:
- Derived the mathematical foundations of MoE: sparse gating, top-$k$ routing, and load balancing loss.
- Implemented every component from scratch in pure PyTorch: Expert, TopKGating, MoELayer, and MoEClassifier.
- Demonstrated on MNIST that an 8-expert, top-2 MoE model achieves 95.14% accuracy (matching a dense baseline at 95.36%) while activating only 46% of its parameters per input.
The architectural principle scales from our 733K-parameter toy model to Mixtral's 47B-parameter production system without conceptual change. The gating mechanism, the load balancing loss, the top-$k$ selection -- they are all the same. Only the numbers change.
Mixture of Experts is not just an optimization trick. It is a fundamentally different way of thinking about neural network capacity: not as a fixed resource that every input must share, but as a pool of specialists that can be dynamically composed based on what each input requires.