Deconstructing Normalization
BatchNorm, LayerNorm, RMSNorm, and GroupNorm from scratch — head-to-head on a 20-layer MLP. The mean subtraction in LayerNorm was load-bearing for nothing.
Part 1
The Family
Why we normalize, what each axis choice implies, why RMSNorm drops the mean subtraction, which norm dominates which model class.
Part 2
Four Layers, 20 Lines Each
BatchNorm, LayerNorm, RMSNorm, GroupNorm in a single file. No torch.nn.LayerNorm.
View Code on GitHub
Part 3
The Mean Was Load-Bearing for Nothing
No-norm wins on small scale (97.5%). BatchNorm hurts (95.7%). RMSNorm beats LayerNorm with half the FLOPs — the design choice every modern LLM made.