Back to MoE Hub

Deconstructing MoE from Scratch

Part 3: 8 Experts, 2 Active

Introduction

In the final installment of this series, we train our from-scratch MoE implementation on MNIST and compare it against a dense MLP baseline. The MoE model (8 experts, top-2 routing, 732,946 total parameters) achieves 95.14% test accuracy -- within 0.22% of the dense baseline (95.36%) -- while keeping 54% of its parameters inactive per input. We analyze expert specialization patterns, demonstrating that experts autonomously learn to handle distinct digit classes without any explicit supervision of the routing.

Experimental Setup

Property MoE Dense MLP
Total parameters732,946269,450
Active parameters337,426269,450
Experts8 (top-2 active)--
Hidden dim256256

Training: MNIST, 5,000-image training subset, 30 epochs, Adam lr=$10^{-3}$, batch size 64, auxiliary loss weight $\alpha = 0.1$.

Accuracy Comparison

Metric MoE Dense MLP
Final training accuracy99.86%100.00%
Test accuracy95.14%95.36%
Final training loss0.21470.0001

The MoE model achieves 95.14% test accuracy, within 0.22% of the dense baseline. The MoE model's training loss plateaus around 0.21, of which only 0.004 is classification loss -- the remainder is the auxiliary load balancing loss ($0.1 \times 2.10 \approx 0.21$), which is the expected value for balanced routing.

Learning Dynamics

Expert Specialization Analysis

The most striking result is the emergent specialization of experts. No supervision was provided about which expert should handle which digits -- the gating network discovered these assignments autonomously.

Digit Primary Expert Secondary Expert
0Expert 4 (49.18%)Expert 2 (25.36%)
1Expert 7 (49.78%)Expert 1 (29.65%)
2Expert 5 (47.63%)Expert 0 (28.97%)
3Expert 0 (47.92%)Expert 7 (27.18%)
4Expert 6 (46.23%)Expert 3 (40.58%)
5Expert 0 (29.48%)Expert 4 (24.83%)
6Expert 1 (49.06%)Expert 4 (32.31%)
7Expert 2 (42.85%)Expert 3 (37.35%)
8Expert 6 (35.78%)Expert 5 (17.71%)
9Expert 3 (47.47%)Expert 6 (33.45%)

Patterns in Expert Sharing

Several interesting patterns emerge:

  1. Expert 0 is shared between digits 3 and 5 -- both have curved strokes in their lower halves.
  2. Expert 6 handles digits 4, 8, and 9 -- all three feature crossing or closed-loop structures in the upper portion.
  3. Expert 3 is shared between digits 4, 7, and 9 -- all have prominent vertical or angular strokes.
  4. Expert 4 routes digits 0, 5, and 6 -- digits with rounded, enclosed shapes.

These sharing patterns suggest the experts have learned to specialize not on individual digit identities but on visual features -- strokes, curves, and structural elements shared across digit classes. This is a form of compositional representation: the top-2 routing allows each digit to be represented as a combination of two feature-processing specialists.

Parameter Efficiency

Metric Value
MoE total parameters732,946
MoE active parameters per input337,426
Dense baseline parameters269,450
MoE sparsity ratio54.0% inactive
MoE capacity multiplier$2.72\times$

The MoE model stores $2.72\times$ as many parameters as the dense baseline, but for any given input, only 337,426 of those 732,946 parameters are active. This is the fundamental trade-off of MoE: memory for compute. You pay more in storage (GPU memory) but less in computation (FLOPs per token).

Lessons Learned

  1. The gating network is simple. It is a single linear layer -- no attention, no deep network. The expressiveness comes from the softmax competition between experts.
  2. Load balancing is non-negotiable. Without the auxiliary loss, expert utilization collapses within the first few epochs. In our experiments, removing the load balancing loss caused 2 of 8 experts to handle over 80% of all tokens.
  3. Specialization is emergent. We never told Expert 4 to handle zeros or Expert 7 to handle ones. The routing patterns emerged purely from the combination of the classification gradient signal and the load balancing constraint.
  4. MoE starts slower. The dense model converges faster in early epochs because it does not need to simultaneously learn what to compute and who should compute it.
  5. The auxiliary loss dominates the total loss. At convergence, the classification loss is 0.004 but the total loss is 0.21 -- nearly all of it is the auxiliary loss at its balanced equilibrium value. This is expected and correct.

Conclusion

Over three parts, we have:

  1. Derived the mathematical foundations of MoE: sparse gating, top-$k$ routing, and load balancing loss.
  2. Implemented every component from scratch in pure PyTorch: Expert, TopKGating, MoELayer, and MoEClassifier.
  3. Demonstrated on MNIST that an 8-expert, top-2 MoE model achieves 95.14% accuracy (matching a dense baseline at 95.36%) while activating only 46% of its parameters per input.

The architectural principle scales from our 733K-parameter toy model to Mixtral's 47B-parameter production system without conceptual change. The gating mechanism, the load balancing loss, the top-$k$ selection -- they are all the same. Only the numbers change.

Mixture of Experts is not just an optimization trick. It is a fundamentally different way of thinking about neural network capacity: not as a fixed resource that every input must share, but as a pool of specialists that can be dynamically composed based on what each input requires.