Back to Projects

Deconstructing Normalization

BatchNorm, LayerNorm, RMSNorm, and GroupNorm from scratch — head-to-head on a 20-layer MLP. The mean subtraction in LayerNorm was load-bearing for nothing.

Part 1

The Family

Why we normalize, what each axis choice implies, why RMSNorm drops the mean subtraction, which norm dominates which model class.

Part 2

Four Layers, 20 Lines Each

BatchNorm, LayerNorm, RMSNorm, GroupNorm in a single file. No torch.nn.LayerNorm.
View Code on GitHub

Part 3

The Mean Was Load-Bearing for Nothing

No-norm wins on small scale (97.5%). BatchNorm hurts (95.7%). RMSNorm beats LayerNorm with half the FLOPs — the design choice every modern LLM made.