Back to Normalization Hub

Deconstructing Normalization

Part 3: Head-to-Head on a 20-Layer Deep MLP

The Question

With four normalization layers from Part 2 in hand, we want to know: which one actually works best? Or more interestingly: do any of them matter on a real training problem, or is normalization just an idea we keep using because we've always used it?

This part trains a $20$-layer non-residual MLP on the two-moons dataset using five variants — None (Identity), BatchNorm1d, LayerNorm, RMSNorm, and GroupNorm — from identical initialisation. The result is not the textbook story.

Experimental Setup

A $20$-layer non-residual MLP is intentionally chosen as an adversarial benchmark for normalization. Twenty layers is deep enough that vanishing/exploding gradients would have been a problem in pre-normalization era; "non-residual" means we don't have skip connections to ease gradient flow, so the model is forced to rely on whatever stabilisation normalization provides.

Each layer is $\text{Linear}(64 \to 64) \to \tanh$. Twenty such layers stacked produces an MLP with about $80{,}000$ parameters depending on whether normalization adds its own parameters (about $2{,}700$ extra for BatchNorm/LayerNorm/GroupNorm, slightly less for RMSNorm because it has no $\beta$).

All five variants train from identical initial weights. We re-seed PyTorch before constructing each model so the only thing varying between runs is the choice of normalization. The dataset is $1{,}000$ two-moons points with $\sigma = 0.2$ noise. AdamW with $\eta = 10^{-3}$ for $60$ epochs at batch size $128$. Each variant is timed on Apple M-series CPU.

Results

NormParamsFinal lossFinal accTrain timeFLOPs/call
None$83{,}522$$0.1116$$97.50\%$$1.8$s$0$
BatchNorm$86{,}210$$0.1060$$95.70\%$$4.3$s$65{,}536$
LayerNorm$86{,}210$$0.0881$$96.60\%$$4.4$s$65{,}536$
RMSNorm$\mathbf{84{,}866}$$\mathbf{0.0927}$$97.10\%$$\mathbf{2.9}$s$\mathbf{32{,}896}$
GroupNorm$86{,}210$$0.1229$$97.20\%$$6.8$s$65{,}536$

Result 1: No-Norm Wins on Accuracy

The Identity baseline (no normalization at all) reaches $97.50\%$ accuracy — higher than any of the four normalization variants. BatchNorm actively hurts accuracy by $1.8$ percentage points relative to no normalization. LayerNorm hurts by $0.9$ points. GroupNorm hurts by $0.3$. RMSNorm hurts least, at $0.4$ points.

This is not what the textbook story predicts. The conventional wisdom — "normalization is essential for training deep networks" — turns out to be regime-dependent. Three things explain why it doesn't apply here.

Reason 1: Tanh naturally bounds activations to $[-1, 1]$. The original motivation for normalization was vanishing/exploding gradients in deep networks with ReLU or sigmoid activations. Tanh is already saturating — its outputs are bounded, its gradients vanish smoothly at the saturation boundaries, and the activation distribution naturally stays in a well-conditioned range. Add normalization on top and you are normalizing an already-normal distribution, which adds noise without removing instability.

Reason 2: AdamW is sufficiently well-conditioned to handle the rest. AdamW's per-parameter adaptive learning rates effectively normalize the gradient scale per parameter, which is much of what BatchNorm-style normalization does for activations. Once you have AdamW, the marginal benefit of activation normalization shrinks. Pre-AdamW deep-learning training (vanilla SGD or SGD with momentum) benefited much more from BatchNorm because it had less per-parameter adaptation.

Reason 3: BatchNorm's running statistics are noisy at small scale. Batch size $128$ on a $1{,}000$-sample dataset gives $\sim 8$ mini-batches per epoch. Each mini-batch's mean and variance are estimated from $128$ samples, which is a noisy estimate. The EMA-updated running statistics inherit that noise. The regularisation benefit of BatchNorm assumes its statistics are reasonably accurate; at this scale they aren't, and the noise overwhelms the benefit.

Result 2: RMSNorm Dominates LayerNorm on Every Measured Axis

This is the result that transfers to LLM scale. Among normalizations that do apply, comparing RMSNorm against LayerNorm:

This is the Llama / Mistral / Gemma design choice in microcosm. The mean-subtraction step in LayerNorm is doing no measurable work on this benchmark — and at LLM scale (trillions of training tokens, $80$+ Transformer blocks, billions of inference tokens) those $30$–$50\%$ FLOP savings are a real fraction of the total compute budget. The mean was load-bearing for nothing.

Result 3: BatchNorm Underperforms LayerNorm on This Problem

BatchNorm not only loses to no-norm — it also loses to LayerNorm by $0.9$ points of accuracy, with the same parameter count and same FLOPs/call. The difference is entirely in the reduction axis.

LayerNorm uses per-sample statistics, which are completely deterministic given the input. BatchNorm uses per-batch statistics, which depend on the random composition of the mini-batch. During training, this means BatchNorm injects mini-batch-shuffling noise into every forward pass — a form of implicit regularisation when the batch is large and informative, and just noise when it isn't. At the batch size and dataset size of this experiment, it's the noise regime.

BatchNorm shines specifically when the batch statistics are reliable (large batches, $\sim 256$+) and informative (feature distributions vary systematically across the batch). Image classification on ImageNet — the original motivating use case — sits squarely in that regime. Small-batch / small-dataset training does not.

Result 4: GroupNorm Is Slow Because of the Reshape

GroupNorm reaches a similar accuracy to RMSNorm ($97.20\%$ vs $97.10\%$) but takes $2.4\times$ longer to train ($6.8$ s vs $2.9$ s). The reason is the reshape: PyTorch's x.view(B, G, C // G) followed by the back-view produces a different memory layout that doesn't always vectorise as efficiently as a flat operation. On large vectorised hardware (CUDA) the difference is smaller, but the overhead is real.

For image diffusion models (where GroupNorm is standard), the reshape overhead is amortised by the much larger spatial dimension. For a $1$D MLP, the overhead is proportionally larger because each normalization call is small.

Three Lessons

Lesson 1: Normalisation is not free magic. At small scale it can hurt accuracy and slow training. The "always add LayerNorm" advice is regime-dependent. Test before assuming.

Lesson 2: Toy benchmarks lie about norm importance. Real production constraints — huge model depth ($80$+ layers), tiny per-token effective batch sizes, trillions of tokens — are what make normalization essential. Not "the network has many layers" in the abstract.

Lesson 3: The RMSNorm reframe travels everywhere. Whenever a paper introduces a method that subtracts something, ask: is the subtraction load-bearing? Often it is not. The mean subtraction in LayerNorm spent years as a load-bearing component of the architecture by convention, until someone tested whether dropping it would matter. It doesn't.

What This Means for LLM Training

At LLM scale, the regime changes. Models have $80$+ Transformer blocks, each with two normalization layers. Without normalization, the activations explode in the first few hundred training steps and the model collapses. With LayerNorm, training is stable and reaches state-of-the-art quality. With RMSNorm, training is also stable and reaches the same quality, with $30$–$50\%$ savings on the per-call normalization cost.

The benchmark in this article cannot reach the regime where normalization is essential — our network is too shallow and too well-conditioned. What it can demonstrate cleanly is the RMSNorm vs LayerNorm equivalence at lower scale, which is reassuring evidence that the design choice transfers.

If you are training an LLM in 2026, the answer is unambiguous: RMSNorm. Llama, Mistral, Gemma, Qwen, and most modern open weights use it. The mean subtraction step is doing nothing useful at LLM scale and dropping it saves real compute. PyTorch provides nn.RMSNorm in recent versions; before that, the $8$-line custom implementation from Part 2 works fine.

If You Are Training a Small Model

If you are training a small model (a few layers, modest depth, AdamW, tanh or GELU activations), the benchmark above suggests that no normalization may be your best option. This is unintuitive but borne out by the numbers. Run an ablation. If the no-norm baseline is comparable to or better than your normalized version, save yourself the compute and complexity.

If you are training a deep ResNet-style CNN on ImageNet-scale data, BatchNorm is still the standard answer — the regime that originally motivated its design. If you are training a Transformer, RMSNorm. If you are training a U-Net for image diffusion, GroupNorm. The choice is not "best norm in the abstract" but "which norm matches your regime."

Summary

Full code on GitHub: github.com/soveshmohapatra/Normalization-Layers