Back to Projects

Deconstructing MoE from Scratch

Explore the mathematics of conditional computation, sparse gating, and expert specialization — built entirely from scratch in pure PyTorch.

Part 1

The Math of Conditional Computation

Breaking down gating networks, Top-K routing, load balancing loss, and why sparsity enables massive scale.

Part 2

PyTorch Implementation

Building Expert MLPs, TopKGating with noisy exploration, and auxiliary load-balancing loss — entirely from scratch.
View Code on GitHub

Part 3

8 Experts, 2 Active

95.14% MNIST accuracy with only 54% of parameters active per input, and visualizing emergent expert specialization per digit class.