Deconstructing MoE from Scratch
Explore the mathematics of conditional computation, sparse gating, and expert specialization — built entirely from scratch in pure PyTorch.
Part 1
The Math of Conditional Computation
Breaking down gating networks, Top-K routing, load balancing loss, and why sparsity enables massive scale.
Part 2
PyTorch Implementation
Building Expert MLPs, TopKGating with noisy exploration, and auxiliary load-balancing loss — entirely from scratch.
View Code on GitHub
Part 3
8 Experts, 2 Active
95.14% MNIST accuracy with only 54% of parameters active per input, and visualizing emergent expert specialization per digit class.