BatchNorm, LayerNorm, RMSNorm, GroupNorm. Four ways to normalize activations. Each is a different choice of which axes to reduce over and which statistics to subtract.
Why Normalize at All?
Without intervention, the distribution of activations inside a deep network drifts. By layer 50 it can have wildly different mean and variance from layer 1, causing vanishing/exploding signals and coupling between layers. Normalization layers fix this by rescaling activations to a fixed distribution.
The Four Norms, Side by Side
Consider a 2D tensor of shape $(B, C)$ where $B$ is batch and $C$ is features.
BatchNorm — reduce over the batch dimension, per feature:
Tracks running statistics for inference. Dominates CNNs.
LayerNorm — reduce over the feature dimension, per sample:
No batch-dependence, no running statistics. Why Transformers use LayerNorm.
RMSNorm — drops the mean subtraction:
No additive bias either. Activations keep their mean — the rest of the model handles position along the axis.
GroupNorm — split features into $G$ groups, normalize within each. Independent of batch size, more local than LayerNorm. Standard in image diffusion.
Why RMSNorm Drops the Mean
Zhang & Sennrich (2019) showed that the mean subtraction in LayerNorm does no measurable work. RMSNorm needs one fewer reduce, no subtraction, no additive bias parameter. At LLM scale — 100+ transformer layers, trillions of tokens — saving 30% of the normalisation FLOPs is real money. Llama, Mistral, Gemma all use RMSNorm.
Which One When?
- CNNs on ImageNet-scale data: BatchNorm.
- Transformers: LayerNorm or RMSNorm.
- Image diffusion (UNets): GroupNorm.
- Tiny models on simple tasks: arguably none — toy benchmarks frequently show normalization hurting at small scale.