Back to Deconstructing Normalization Hub

Deconstructing Normalization

Part 1: The Family

Overview

BatchNorm, LayerNorm, RMSNorm, GroupNorm. Four ways to normalize activations. Each is a different choice of which axes to reduce over and which statistics to subtract.

Why Normalize at All?

Without intervention, the distribution of activations inside a deep network drifts. By layer 50 it can have wildly different mean and variance from layer 1, causing vanishing/exploding signals and coupling between layers. Normalization layers fix this by rescaling activations to a fixed distribution.

The Four Norms, Side by Side

Consider a 2D tensor of shape $(B, C)$ where $B$ is batch and $C$ is features.

BatchNorm — reduce over the batch dimension, per feature:

$$ \mathrm{BN}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c} - \mu_c}{\sqrt{\sigma_c^2 + \varepsilon}} + \beta_c. $$

Tracks running statistics for inference. Dominates CNNs.

LayerNorm — reduce over the feature dimension, per sample:

$$ \mathrm{LN}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c} - \mu_b}{\sqrt{\sigma_b^2 + \varepsilon}} + \beta_c. $$

No batch-dependence, no running statistics. Why Transformers use LayerNorm.

RMSNorm — drops the mean subtraction:

$$ \mathrm{RMS}(x)_{b,c} = \gamma_c \cdot \frac{x_{b,c}}{\sqrt{\frac{1}{C} \sum_{c'} x_{b,c'}^2 + \varepsilon}}. $$

No additive bias either. Activations keep their mean — the rest of the model handles position along the axis.

GroupNorm — split features into $G$ groups, normalize within each. Independent of batch size, more local than LayerNorm. Standard in image diffusion.

Why RMSNorm Drops the Mean

Zhang & Sennrich (2019) showed that the mean subtraction in LayerNorm does no measurable work. RMSNorm needs one fewer reduce, no subtraction, no additive bias parameter. At LLM scale — 100+ transformer layers, trillions of tokens — saving 30% of the normalisation FLOPs is real money. Llama, Mistral, Gemma all use RMSNorm.

Which One When?