Deconstructing Optimizers: Part 3 - Head-to-Head on Three Problems

The Question

We have five optimizers from Part 2. Each one imposes a different prior on the loss landscape — Momentum bets on consistent gradient direction, Adam bets on per-parameter scale variation, Lion bets on sign-only updates. Whose prior is right? The honest answer is "it depends on the landscape", and this part demonstrates exactly what that dependence looks like.

Three experiments, in increasing realism: a smooth narrow valley (Rosenbrock), a flat-plateau-with-sharp-minimum (Beale), and a real 4-layer MLP on the two-moons dataset. Each one probes a different optimizer property. By the end of the third experiment, the textbook ranking of optimizers will look very different.

Experimental Setup

For the 2D landscapes, each optimizer gets its own tuned learning rate. This is important. Using a single learning rate for all five would be a fundamentally unfair comparison: Adam's $\sqrt{\hat{v}}$ scaling means it operates in a very different "effective learning rate" regime than SGD, and Lion's sign-only updates are smaller still. A like-for-like comparison requires each optimizer at its optimal lr.

For the MLP experiment we use the same lr for all five (a typical $10^{-3}$) because that is closer to how practitioners actually choose optimizers in production: you pick an lr that works for most adaptive methods, not one separately tuned per optimizer.

All five runs on the MLP start from identical initial weights. We re-seed PyTorch before constructing each model so the only thing varying between runs is the optimizer. Without this control, accuracy differences could come from initialisation noise rather than optimizer choice.

Experiment 1 — Rosenbrock's Banana Valley

Rosenbrock's function is the standard pathological test for optimizers:

f(x, y) = (a - x)^2 + b\,(y - x^2)^2, \qquad a = 1, b = 100.

The minimum sits at $(1, 1)$ where $f = 0$. The curvature is wildly anisotropic — the function changes slowly along a curving valley floor (the parabola $y = x^2$) and rapidly perpendicular to it. Naive gradient descent oscillates across the steep direction and crawls along the shallow one; this is the canonical test of whether an optimizer can navigate ill-conditioned curvature.

Starting point: $(-1.5, 1.5)$, well outside the valley. We integrate for $2{,}000$ iterations.

Optimizer	Final $f(x, y)$	$\\|p - p^\star\\|$
SGD	$8.80 \times 10^{-3}$	$0.20$
Momentum	$6.26 \times 10^{-9}$	$1.8 \times 10^{-4}$
Adam	$4.02 \times 10^{-2}$	$0.41$
AdamW	$4.94 \times 10^{-2}$	$0.45$
Lion	$4.03$	$2.00$

Momentum wins by six orders of magnitude. Adam and AdamW plateau at $\sim 5 \times 10^{-2}$ — two orders of magnitude worse than vanilla SGD.

This result is genuinely counter-intuitive and worth unpacking. Why does Momentum dominate Adam on a smooth problem? Two reasons, both inherent to the optimizers' design:

Reason 1: $\sqrt{\hat{v}}$ does not shrink with the gradient. Near a smooth minimum, the gradient direction is consistent and the gradient magnitude is small. The running average $\hat{v}$ stabilises at a value proportional to the gradient variance, not to the gradient magnitude. Adam's effective step size is roughly $\eta / \sqrt{\hat{v}}$, which remains constant as the gradient shrinks. The optimizer cannot land softly — it overshoots the minimum and oscillates around it. The plateau at $\sim 5 \times 10^{-2}$ is exactly this overshoot regime.

Reason 2: Momentum's velocity buffer naturally shrinks. The velocity update $v \leftarrow \beta v + g$ inherits the gradient's magnitude. Near a smooth minimum where $g \to 0$, the velocity also approaches zero (geometrically — the decay rate $\beta$ controls how fast). Momentum converges to machine precision because both terms in $\theta \leftarrow \theta - \eta v$ go to zero together.

Lion's $\mathrm{sign}(c)$ update never shrinks — every step has fixed magnitude $\eta$. On a smooth 2D problem this means it cannot converge below precision $\eta$, which explains the $f = 4.03$ plateau.

Experiment 2 — Beale's Function

Beale's function is the mirror image of Rosenbrock:

$$ f(x, y) = (1.5 - x + xy)^2 + (2.25 - x + xy^2)^2 + (2.625 - x + xy^3)^2. $$

It has a global minimum at $(3, 0.5)$ with $f = 0$, surrounded by broad flat regions where the gradient is nearly zero. Optimizers that take steps proportional to the gradient magnitude (SGD, Momentum) crawl through these plateaus; optimizers that normalise the step magnitude (Adam, Lion) escape quickly.

Starting point: $(-2, -2)$, deep in a flat region. $2{,}000$ iterations.

Optimizer	Final $f(x, y)$
Adam	$2.89 \times 10^{-1}$
AdamW	$3.62 \times 10^{-1}$
Momentum	$1.17$
SGD	$1.40$
Lion	$38.7$ (diverged)

The order has reversed. Adam wins by an order of magnitude over Momentum and SGD. This is the mirror image of Experiment 1: Adam's $\sqrt{\hat{v}}$ normalisation, which hurts on smooth minima, helps decisively on flat plateaus.

The mechanism: in the flat region near $(-2, -2)$, the true gradient is small in magnitude. SGD takes a tiny step proportional to that gradient and barely moves. Momentum accumulates a velocity proportional to the gradient and barely moves slightly faster — but still slowly. Adam's $\hat{v}$ also builds up small values, but $\hat{m} / \sqrt{\hat{v}}$ keeps the step size at a useful magnitude regardless of the gradient's absolute scale. Adam escapes the plateau in roughly $50$ iterations; SGD is still in the plateau after $2{,}000$.

Lion's divergence is illustrative. With fixed-magnitude steps and no second-moment buffer to dampen anything, Lion's trajectory bounces around the plateau, accumulates large momentum from spurious gradients, and ends up driven away from the minimum entirely. Lion's design is optimised for very-high-dimensional landscapes (LLM training) where its sign-only behaviour is a feature; on a 2D problem it is brittle.

The two experiments together demonstrate something important: no optimizer dominates across landscapes. Different optimizers are good at different things. Anyone who claims "Adam is always best" or "SGD is the gold standard" is reasoning from a single regime.

Experiment 3 — MLP on Moons (Real Training)

The 2D landscapes are pedagogical extremes. Real neural network training operates in a regime with millions of parameters, stochastic gradients from mini-batches, and a loss surface whose curvature changes constantly as the parameters move. The 2D differences may or may not survive in this regime.

Setup: a $2 \to 64 \to 64 \to 2$ MLP with tanh activations on the two-moons dataset ($1{,}000$ points). 60 epochs, batch size 128, learning rate $10^{-2}$ for all five optimizers, identical initial weights.

Optimizer	Final training loss	Final training accuracy
SGD	$0.2845$	$88.30\%$
Momentum	$0.2220$	$91.20\%$
Adam	$\mathbf{0.0746}$	$\mathbf{97.70\%}$
AdamW	$0.0759$	$97.60\%$
Lion	$0.0757$	$97.50\%$

Three findings here, each interesting in its own way.

Finding 1: The Adam family clusters at $97.5$–$97.7\%$ accuracy. Adam, AdamW, and Lion are within $0.2$ percentage points of each other. The dramatic differences from Experiments 1 and 2 have vanished. On a real training problem with stochasticity, parameter count $\sim 4{,}500$, and a loss surface whose curvature is constantly changing, the adaptive optimizers all converge to roughly the same answer. The choice between them stops being a question of optimization quality and starts being a question of compute, memory, and engineering convenience.

Finding 2: The Adam family beats Momentum by $\sim 6$ percentage points and SGD by $\sim 10$. This part of the textbook story is correct. Adaptive optimizers do beat non-adaptive ones on real high-dimensional training. The gap is not enormous on this small problem, but it is consistent. At scale (large vision or language models) the gap is usually larger.

Finding 3: Lion catches up despite struggling on the 2D landscapes. This is the surprise. Lion was the worst performer on both Rosenbrock and Beale, yet on the MLP it is statistically tied with Adam and AdamW. The reason is dimensionality: in $\sim 4{,}500$-dimensional weight space, the sign-only updates do not have the directional pathologies they exhibit on 2D toy problems. The gradient noise that hurts Lion in 2D averages out across many parameters, and the constant-magnitude updates act as a form of implicit regularisation. Lion was designed for this regime.

Why The Differences Vanish

The mechanism behind Finding 1 is worth understanding because it transfers to real production training. On the 2D landscapes, the optimizer's specific behaviour near features like smooth minima or flat plateaus is exposed directly — every iteration matters and the landscape is fully visible to the optimizer.

On a real training problem, three things average out the differences:

(a) Stochastic gradients. Each step uses a different mini-batch, so the "gradient" is noisy. The signal-to-noise ratio of any single step is low. Optimizer-specific behaviour on the deterministic loss surface is smeared by mini-batch noise.

(b) Dimensionality. In a high-dimensional weight space, gradient directions are nearly orthogonal between iterations. Most of an optimizer's specialised behaviour (Momentum's accumulation, Adam's variance normalisation) becomes a per-coordinate average rather than a directional advantage.

(c) Loss-surface plasticity. The loss surface changes during training as the parameters move. Whatever pathological feature one optimizer would have struggled with at one point in training, the parameters are somewhere else by the next epoch. There is no fixed valley or plateau the optimizer can get stuck in for long.

These three effects together explain why production ML training is forgiving of optimizer choice within the adaptive family. AdamW is the default, but Lion would have produced statistically indistinguishable training curves on most workloads. The flip side is that production training is unforgiving of non-adaptive optimizers: plain SGD generally leaves $5$–$10$ points of accuracy on the table compared to AdamW, which is a much bigger deal than the difference between AdamW and Lion.

Practical Recipe

For a new model on a new dataset: AdamW. Learning rate $3 \times 10^{-4}$ to $10^{-3}$. Weight decay $0.01$. Cosine annealing over the training run. Move on; don't optimize the optimizer.
If optimizer-state memory is a bottleneck (training large language models on consumer hardware): Lion. Drop the learning rate to $1/10$ of what AdamW would use. Drop weight decay to $0$ or near it (Lion has implicit regularisation).
Use plain SGD with Momentum only if you have a specific reason — historically image classification with batch normalisation has favoured it, and some convex problems benefit. For deep learning in 2026 it is mostly pedagogical.
Do not use Adam-without-W in 2026. AdamW always dominates in practice. The original Adam exists for backward compatibility, not for any active engineering reason.

What the Three Experiments Actually Prove

Optimizers are not "better" or "worse" in the abstract. They impose different priors on the landscape, and each prior matches some landscapes well and others badly.
The dramatic differences on 2D toy landscapes mostly disappear on real training. The differences that survive (Adam-family vs SGD) are large enough to matter; the rest (AdamW vs Adam vs Lion) come down to second-order effects, memory budget, and the LR schedule.
The framework matters more than the optimizer. PyTorch's torch.optim ships nine optimizers and we just implemented five of them in $\sim 75$ lines of pure Python. Once you understand the family tree, switching between optimizers is a one-line change.

In subsequent series we treat the optimizer as a settled matter (AdamW with sensible hyperparameters) and focus on architecture and loss design.

Full code on GitHub: github.com/soveshmohapatra/Optimizers