Every neural network you have ever heard of was trained by a small variation on the same update rule. Reading the major optimizers side by side, the family tree becomes obvious: each one is a one-line modification of its predecessor.
Gradient Descent and Momentum
The starting point: $\theta_{t+1} = \theta_t - \eta \, \nabla \mathcal{L}(\theta_t)$. On ill-conditioned problems, plain GD oscillates across the steep direction and crawls along the shallow one. The fix: add memory.
Momentum (Polyak, 1964):
$\beta = 0.9$ typically. A velocity buffer averages out high-frequency oscillation and reinforces consistent directions.
Adam (Kingma & Ba, 2014)
Adam tracks two running averages — the first moment $m_t$ (like momentum) and the second moment $v_t$ (mean of $g^2$):
The bias correction un-biases the early-step estimates pulled toward zero by $m_0 = v_0 = 0$. Dividing by $\sqrt{\hat{v}}$ normalises the step size per parameter — a diagonal preconditioner that adapts to per-axis curvature without computing the Hessian.
AdamW (Loshchilov & Hutter, 2019)
Identical to Adam except weight decay is applied directly to $\theta$, not added to the gradient before computing $m$ and $v$. Under standard L2-in-loss, Adam's adaptive scaling makes the effective decay strength depend on per-parameter gradient history — which is not what you want. Decoupling fixes this. Every modern LLM training recipe uses AdamW, not Adam.
Lion (Chen et al., 2023)
Found by symbolic search through optimizer programs. Lion is Adam where $\sqrt{\hat{v}}$ has been replaced by the sign function:
The second-moment buffer disappears — half the memory of Adam. Every update has the same magnitude in every direction; learning rate must shrink to $\sim 0.1\times$ Adam's.
The Family Tree, Compact
| Optimizer | Update rule | State |
|---|---|---|
| SGD | $\theta \leftarrow \theta - \eta g$ | none |
| Momentum | $\theta \leftarrow \theta - \eta v$ | $v$ |
| Adam | $\theta \leftarrow \theta - \eta \hat{m}/(\sqrt{\hat{v}} + \varepsilon)$ | $m, v$ |
| AdamW | $\theta \leftarrow (1-\eta\lambda)\theta - \eta \hat{m}/(\sqrt{\hat{v}} + \varepsilon)$ | $m, v$ |
| Lion | $\theta \leftarrow \theta - \eta (\mathrm{sign}(c) + \lambda \theta)$ | $m$ |
Five lines. Three buffers. Two preconditioners ($\sqrt{\hat{v}}$ vs $\mathrm{sign}(c)$). That is the entire family.