Back to MoE Hub

Deconstructing MoE from Scratch

Part 3: 8 Experts, 2 Active

Introduction

We train the from-scratch MoE implementation on MNIST and compare it against a dense MLP baseline. The MoE model (8 experts, top-2 routing, 732,946 total parameters) hits 95.14% test accuracy -- within 0.22% of the dense baseline (95.36%) -- with 54% of parameters inactive per input. We also look at how experts specialize across digit classes without any explicit routing supervision.

Experimental Setup

Property MoE Dense MLP
Total parameters732,946269,450
Active parameters337,426269,450
Experts8 (top-2 active)--
Hidden dim256256

Training: MNIST, 5,000-image training subset, 30 epochs, Adam lr=$10^{-3}$, batch size 64, auxiliary loss weight $\alpha = 0.1$.

Accuracy Comparison

Metric MoE Dense MLP
Final training accuracy99.86%100.00%
Test accuracy95.14%95.36%
Final training loss0.21470.0001

The MoE model lands at 95.14% test accuracy, 0.22% below the dense baseline. Its training loss plateaus around 0.21, but only 0.004 of that is classification loss -- the rest is the auxiliary load balancing term ($0.1 \times 2.10 \approx 0.21$), which is the expected equilibrium for balanced routing.

Learning Dynamics

Expert Specialization Analysis

No supervision was provided about which expert should handle which digits -- the gating network learned these assignments on its own.

Digit Primary Expert Secondary Expert
0Expert 4 (49.18%)Expert 2 (25.36%)
1Expert 7 (49.78%)Expert 1 (29.65%)
2Expert 5 (47.63%)Expert 0 (28.97%)
3Expert 0 (47.92%)Expert 7 (27.18%)
4Expert 6 (46.23%)Expert 3 (40.58%)
5Expert 0 (29.48%)Expert 4 (24.83%)
6Expert 1 (49.06%)Expert 4 (32.31%)
7Expert 2 (42.85%)Expert 3 (37.35%)
8Expert 6 (35.78%)Expert 5 (17.71%)
9Expert 3 (47.47%)Expert 6 (33.45%)

Patterns in Expert Sharing

  1. Expert 0 is shared between digits 3 and 5 -- both have curved strokes in their lower halves.
  2. Expert 6 handles digits 4, 8, and 9 -- all three feature crossing or closed-loop structures in the upper portion.
  3. Expert 3 is shared between digits 4, 7, and 9 -- all have prominent vertical or angular strokes.
  4. Expert 4 routes digits 0, 5, and 6 -- digits with rounded, enclosed shapes.

The experts appear to specialize on visual features -- strokes, curves, structural elements -- rather than on individual digit identities. Top-2 routing lets each digit be represented as a combination of two feature-processing specialists.

Parameter Efficiency

Metric Value
MoE total parameters732,946
MoE active parameters per input337,426
Dense baseline parameters269,450
MoE sparsity ratio54.0% inactive
MoE capacity multiplier$2.72\times$

The MoE model stores $2.72\times$ as many parameters as the dense baseline, but only 337,426 of 732,946 are active for any given input. The trade-off: more GPU memory, fewer FLOPs per token.

Lessons Learned

  1. The gating network is just a linear layer. No attention, no deep network. The expressiveness comes from softmax competition between experts.
  2. Load balancing is non-negotiable. Without the auxiliary loss, 2 of 8 experts ended up handling over 80% of all tokens within a few epochs.
  3. Specialization is unsupervised. We never told Expert 4 to handle zeros or Expert 7 to handle ones. Routing patterns fell out of the classification gradient and the load balancing constraint together.
  4. MoE starts slower. The dense model converges faster early on -- it does not need to simultaneously learn what to compute and who should compute it.
  5. The auxiliary loss dominates the total loss. At convergence, classification loss is 0.004 and total loss is 0.21. Nearly all of it is the auxiliary term at its balanced equilibrium value of $k = 2$.

Conclusion

Across three parts, we:

  1. Derived the math: sparse gating, top-$k$ routing, and load balancing loss.
  2. Implemented Expert, TopKGating, MoELayer, and MoEClassifier from scratch in PyTorch.
  3. Trained on MNIST: 8-expert, top-2 MoE reached 95.14% accuracy (vs 95.36% dense baseline) while activating only 46% of its parameters per input.

The same components -- gating, load balancing, top-$k$ selection -- scale from this 733K-parameter model to Mixtral's 47B-parameter system. The mechanism does not change; only the numbers do.