Deconstructing MoE from Scratch

Explore the mathematics of conditional computation, sparse gating, and expert specialization — built entirely from scratch in pure PyTorch.

The Math of Conditional Computation

Breaking down gating networks, Top-K routing, load balancing loss, and why sparsity enables massive scale.

Building Expert MLPs, TopKGating with noisy exploration, and auxiliary load-balancing loss — entirely from scratch.
View Code on GitHub

95.14% MNIST accuracy with only 54% of parameters active per input, and visualizing emergent expert specialization per digit class.