Back to Deconstructing Optimizers Hub

Deconstructing Optimizers

Part 1: The Family Tree

Overview

Every neural network you have ever heard of was trained by a small variation on the same update rule. Reading the major optimizers side by side, the family tree becomes obvious: each one is a one-line modification of its predecessor.

Gradient Descent and Momentum

The starting point: $\theta_{t+1} = \theta_t - \eta \, \nabla \mathcal{L}(\theta_t)$. On ill-conditioned problems, plain GD oscillates across the steep direction and crawls along the shallow one. The fix: add memory.

Momentum (Polyak, 1964):

$$ v_{t+1} = \beta v_t + g_t, \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}. $$

$\beta = 0.9$ typically. A velocity buffer averages out high-frequency oscillation and reinforces consistent directions.

Adam (Kingma & Ba, 2014)

Adam tracks two running averages — the first moment $m_t$ (like momentum) and the second moment $v_t$ (mean of $g^2$):

$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$
$$ \hat{m}_t = m_t/(1 - \beta_1^t), \quad \hat{v}_t = v_t/(1 - \beta_2^t), \quad \theta \leftarrow \theta - \eta\,\hat{m}_t/(\sqrt{\hat{v}_t} + \varepsilon). $$

The bias correction un-biases the early-step estimates pulled toward zero by $m_0 = v_0 = 0$. Dividing by $\sqrt{\hat{v}}$ normalises the step size per parameter — a diagonal preconditioner that adapts to per-axis curvature without computing the Hessian.

AdamW (Loshchilov & Hutter, 2019)

Identical to Adam except weight decay is applied directly to $\theta$, not added to the gradient before computing $m$ and $v$. Under standard L2-in-loss, Adam's adaptive scaling makes the effective decay strength depend on per-parameter gradient history — which is not what you want. Decoupling fixes this. Every modern LLM training recipe uses AdamW, not Adam.

Lion (Chen et al., 2023)

Found by symbolic search through optimizer programs. Lion is Adam where $\sqrt{\hat{v}}$ has been replaced by the sign function:

$$ c_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad \theta \leftarrow \theta - \eta(\mathrm{sign}(c_t) + \lambda \theta). $$

The second-moment buffer disappears — half the memory of Adam. Every update has the same magnitude in every direction; learning rate must shrink to $\sim 0.1\times$ Adam's.

The Family Tree, Compact

OptimizerUpdate ruleState
SGD$\theta \leftarrow \theta - \eta g$none
Momentum$\theta \leftarrow \theta - \eta v$$v$
Adam$\theta \leftarrow \theta - \eta \hat{m}/(\sqrt{\hat{v}} + \varepsilon)$$m, v$
AdamW$\theta \leftarrow (1-\eta\lambda)\theta - \eta \hat{m}/(\sqrt{\hat{v}} + \varepsilon)$$m, v$
Lion$\theta \leftarrow \theta - \eta (\mathrm{sign}(c) + \lambda \theta)$$m$

Five lines. Three buffers. Two preconditioners ($\sqrt{\hat{v}}$ vs $\mathrm{sign}(c)$). That is the entire family.