Deconstructing Normalization: Part 1 — The Family

Overview

BatchNorm, LayerNorm, RMSNorm, GroupNorm. Four ways to normalize activations. Each is a different choice of which axes to reduce over and which statistics to subtract.

Why Normalize at All?

Without intervention, the distribution of activations inside a deep network drifts. By layer 50 it can have wildly different mean and variance from layer 1, causing vanishing/exploding signals and coupling between layers. Normalization layers fix this by rescaling activations to a fixed distribution.

The Four Norms, Side by Side

Consider a 2D tensor of shape $(B, C)$ where $B$ is batch and $C$ is features.

BatchNorm — reduce over the batch dimension, per feature:

\mathrm{BN}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c} - \mu_c}{\sqrt{\sigma_c^2 + \varepsilon}} + \beta_c.

Tracks running statistics for inference. Dominates CNNs.

LayerNorm — reduce over the feature dimension, per sample:

\mathrm{LN}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c} - \mu_b}{\sqrt{\sigma_b^2 + \varepsilon}} + \beta_c.

No batch-dependence, no running statistics. Why Transformers use LayerNorm.

RMSNorm — drops the mean subtraction:

\mathrm{RMS}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c}}{\sqrt{\frac{1}{C} \sum_{c'} x_{b,c'}^2 + \varepsilon}}.

No additive bias either. Activations keep their mean — the rest of the model handles position along the axis.

GroupNorm — split features into $G$ groups, normalize within each. Independent of batch size, more local than LayerNorm. Standard in image diffusion.

Why RMSNorm Drops the Mean

Zhang & Sennrich (2019) showed that the mean subtraction in LayerNorm does no measurable work. RMSNorm needs one fewer reduce, no subtraction, no additive bias parameter. At LLM scale — 100+ transformer layers, trillions of tokens — saving 30% of the normalisation FLOPs is real money. Llama, Mistral, Gemma all use RMSNorm.

Which One When?

CNNs on ImageNet-scale data: BatchNorm.
Transformers: LayerNorm or RMSNorm.
Image diffusion (UNets): GroupNorm.
Tiny models on simple tasks: arguably none — toy benchmarks frequently show normalization hurting at small scale.