Part 1 established the family relationship: every modern optimizer is a one-line modification of SGD. Now we implement all five — SGD, Momentum, Adam, AdamW, Lion — in roughly $75$ lines of PyTorch total. No torch.optim inheritance, no abstract base class, no state_dict() bookkeeping. Each class exposes exactly two methods: .step() and .zero_grad().
The point of writing them at this level is to make the differences visible. With every optimizer in a single $\sim$15-line block, the modifications that distinguish Adam from Momentum, or Lion from Adam, are precisely the lines that change. The family tree from Part 1 stops being an abstract description and starts being something you can scroll through.
The Common Interface
Every optimizer has the same two-method surface. .step() applies one update to the parameters using their current .grad tensors; .zero_grad() resets the gradients between mini-batches. PyTorch's real torch.optim.Optimizer adds state-dict serialization, parameter groups, and a closure-based step variant — none of which we need for pedagogy.
The constructor's job is to capture the parameter list and allocate per-parameter state (momentum buffers, second-moment buffers). All five optimizers below follow the same pattern.
SGD — The Parent
Three lines of forward, zero lines of state to carry between steps. There is nothing to remember:
class SGD:
def __init__(self, params, lr=1e-2):
self.params = list(params)
self.lr = lr
def step(self):
with torch.no_grad():
for p in self.params:
if p.grad is None:
continue
p.add_(p.grad, alpha=-self.lr)
def zero_grad(self):
for p in self.params:
if p.grad is not None:
p.grad.zero_()
Three things to notice in this baseline that carry through to every other optimizer in this article.
The torch.no_grad() context. The parameter update is itself an arithmetic operation on tensors that have requires_grad=True. Without the no-grad wrapper, PyTorch would build the update into the autograd graph, leaking memory and confusing future .backward() calls. Every optimizer in this article wraps its step in no_grad.
The if p.grad is None: continue guard. Some parameters in a model may not have received gradients on a given forward pass (e.g., layers used only in a different branch of a conditional architecture). Iterating over them without a guard would crash on the p.grad access. Real-world optimizer code is full of this kind of defensive bookkeeping.
In-place arithmetic with add_. The line p.add_(p.grad, alpha=-self.lr) is mathematically equivalent to p = p - lr * p.grad but allocates no new tensor — the update happens in the existing storage. For a 70B-parameter model, the difference between in-place and out-of-place updates is roughly $300$ GB of avoidable allocation per training step.
Momentum — The First Memory
Momentum adds one buffer per parameter — a velocity vector that accumulates gradients with exponential decay. The update rule is two equations:
class Momentum:
def __init__(self, params, lr=1e-2, momentum=0.9):
self.params = list(params)
self.lr = lr
self.momentum = momentum
self.v = [torch.zeros_like(p) for p in self.params]
def step(self):
with torch.no_grad():
for p, v in zip(self.params, self.v):
if p.grad is None:
continue
v.mul_(self.momentum).add_(p.grad)
p.add_(v, alpha=-self.lr)
The crucial two lines are v.mul_(self.momentum).add_(p.grad), which is the velocity update, and p.add_(v, alpha=-self.lr), which is the parameter step. Notice that we update the velocity buffer in place and then use the new velocity for the parameter update — this is the standard formulation. Some references use Nesterov-style "look-ahead" momentum where the gradient is evaluated at the projected point $\theta_t - \eta \beta v_t$ instead of at $\theta_t$; that is a separate variant we are not implementing here.
Initialisation of $v$. The velocity buffer starts at zero, which means the first step is identical to SGD — there is no accumulated momentum yet. The buffer "warms up" over the first few iterations: after $k$ steps with roughly constant gradient $g$, $v_k \approx (1 + \beta + \beta^2 + \dots + \beta^{k-1}) g \to g/(1 - \beta)$. For $\beta = 0.9$, the steady-state velocity is $10 \times$ the per-step gradient, which is why classical training literature distinguished between "early epochs" (effectively SGD-like) and "steady state" (momentum-dominated) behaviour.
What momentum buys you. On a smooth valley with consistent gradient direction (think Rosenbrock's banana), the velocity vector accumulates the consistent component while oscillations average out — the ball rolls along the floor of the valley rather than bouncing off the walls. On a flat plateau where gradients are tiny but consistent, momentum lets you accelerate gradually rather than crawling. The cost is one extra tensor of state per parameter and an extra in-place multiplication per step. For most modern training that is trivial overhead.
Adam — Two Running Averages
Adam keeps two running averages: the first moment $m_t$ (mean of the gradient, like momentum) and the second moment $v_t$ (mean of $g^2$, an estimate of per-parameter gradient variance). Both are bias-corrected before use:
class Adam:
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
self.params = list(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.t = 0
self.m = [torch.zeros_like(p) for p in self.params]
self.v = [torch.zeros_like(p) for p in self.params]
def step(self):
self.t += 1
bc1 = 1 - self.beta1 ** self.t
bc2 = 1 - self.beta2 ** self.t
with torch.no_grad():
for p, m, v in zip(self.params, self.m, self.v):
if p.grad is None:
continue
g = p.grad
m.mul_(self.beta1).add_(g, alpha=1 - self.beta1)
v.mul_(self.beta2).addcmul_(g, g, value=1 - self.beta2)
p.addcdiv_(m / bc1, (v / bc2).sqrt().add_(self.eps), value=-self.lr)
This is the longest update in the series: about ten lines of step, six lines of state. Let's walk through it carefully.
The step counter self.t. Adam's bias correction needs to know how many steps have been taken since initialisation. The first step is $t = 1$ (not zero), which is what gives $\beta_1^t = \beta_1 < 1$ and makes the correction factor $1 - \beta_1^t$ meaningfully smaller than $1$. As $t$ grows, $\beta_1^t \to 0$ and the correction factor approaches $1$ — bias correction becomes a no-op after enough steps, which is exactly the desired behaviour.
The two PyTorch primitives addcmul_ and addcdiv_. These are fused multiply-add and multiply-divide ops that PyTorch ships specifically because Adam-style updates are so common. v.addcmul_(g, g, value=1 - self.beta2) computes v += (1 - beta2) * g * g in a single GPU kernel rather than allocating an intermediate tensor for g * g. p.addcdiv_(num, den, value=-lr) computes p += -lr * num / den. On a 70B-parameter model these fused ops save substantial memory bandwidth.
Why bias correction matters. We initialise $m_0 = v_0 = 0$. After one step, $m_1 = (1 - \beta_1) g_1$, which is a heavily underestimated version of the true gradient mean. Without correction, Adam would take very small steps for the first few iterations. The correction factor $1/(1 - \beta_1^t)$ rescales the EMA back to an unbiased estimate of the true running mean — under the assumption that the gradient distribution is approximately stationary, which is roughly true for short windows during training.
What dividing by $\sqrt{\hat{v}_t}$ actually does. This is the most consequential design choice in Adam. $\sqrt{\hat{v}_t}$ approximates the per-parameter gradient magnitude. A parameter with consistently large gradients sees its effective learning rate shrink (the denominator is large); a parameter with consistently small gradients sees its effective learning rate grow. This is a diagonal preconditioner: Adam adapts the step size per axis without needing to compute (or even approximate) the full Hessian.
The $\varepsilon = 10^{-8}$. Its role is to prevent division by zero when $\hat{v}_t$ has not built up yet (e.g., on parameters that have not received non-trivial gradients yet). A value of $10^{-8}$ in float32 is small enough that it doesn't affect well-trained parameters but large enough to dominate when $\hat{v}_t$ is genuinely zero. Some Adam variants use $10^{-12}$ for better precision at the cost of occasional NaN risk; PyTorch's default has stayed at $10^{-8}$ since the original paper.
AdamW — A One-Line Modification
AdamW is exactly Adam with the parameter update replaced by:
p.mul_(1 - self.lr * self.wd) # decoupled weight decay (the new line)
p.addcdiv_(m / bc1, (v / bc2).sqrt().add_(self.eps), value=-self.lr)
Everything else — the buffers, the bias correction, the momentum and variance updates — is identical. This single insertion is the entire AdamW innovation.
The question is: why does this one-line change matter? L2 regularisation in classical training is added directly to the gradient: $g \leftarrow g + \lambda \theta$. Under SGD this is mathematically equivalent to "weight decay" applied to the parameter: $\theta \leftarrow (1 - \eta \lambda)\theta - \eta g$. Adam breaks that equivalence. When we add $\lambda \theta$ to the gradient before computing $m$ and $v$, the decay term gets divided by $\sqrt{\hat{v}}$ along with everything else — which means parameters with small gradients receive disproportionately large effective decay, and parameters with large gradients receive almost none.
This is empirically harmful: it couples the strength of regularisation to the gradient noise distribution of each parameter, which is not what the user intended when they set weight_decay=0.01. Loshchilov & Hutter's 2019 paper showed that decoupling the decay — applying it directly to $\theta$ outside the Adam update — produced more robust training and better final performance across vision and language benchmarks. Every modern LLM training recipe in 2026 uses AdamW, not Adam.
The implementation detail to notice: $p \leftarrow (1 - \eta \lambda) p$ shrinks the parameter slightly toward zero. The Adam update then adds the gradient-driven step to the already-shrunken parameter. The shrinkage rate $\eta \lambda$ is independent of $\hat{v}$ — that is the entire point.
Lion — Sign Instead of $\sqrt{\hat{v}}$
Chen et al. (Google, 2023) used symbolic regression to search over the space of optimizer programs. The output: Lion (EvoLved Sign Momentum). It is Adam with two structural simplifications and a small re-ordering:
class Lion:
def __init__(self, params, lr=1e-4, betas=(0.9, 0.99), weight_decay=0.0):
self.params = list(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.wd = weight_decay
self.m = [torch.zeros_like(p) for p in self.params]
def step(self):
with torch.no_grad():
for p, m in zip(self.params, self.m):
if p.grad is None:
continue
g = p.grad
c = self.beta1 * m + (1 - self.beta1) * g
update = torch.sign(c)
if self.wd:
update = update + self.wd * p
p.add_(update, alpha=-self.lr)
m.mul_(self.beta2).add_(g, alpha=1 - self.beta2)
Three differences from Adam, all visible in the code.
One buffer instead of two. The second-moment buffer $v$ is gone. Lion stores only $m$. For a 70B-parameter model that is roughly $140$ GB of optimizer state instead of $280$ GB. On consumer hardware this is the difference between fitting and not fitting.
The update direction is $\mathrm{sign}(c)$ instead of $\hat{m} / \sqrt{\hat{v}}$. Every parameter moves by exactly $\pm \eta$ per step, regardless of the gradient's magnitude. Adam's adaptive per-parameter scaling has been replaced by an across-the-board fixed scale. This is the design choice that demands a smaller learning rate — typically $3 \times$ to $10 \times$ smaller than what AdamW would use. The implicit regularisation from constant-magnitude updates also acts as a form of weight regularisation in its own right; you generally do not need as much explicit weight decay with Lion as with AdamW.
The momentum is updated after the parameter step, using $\beta_2$. The update direction is $c = \beta_1 m_{t-1} + (1 - \beta_1) g_t$, computed from the old momentum buffer. Then the buffer is rolled forward with a different decay rate $\beta_2 = 0.99$ (compared to $\beta_1 = 0.9$). This two-buffer dance is the part of Lion that does not have a one-line conceptual justification — it was selected by the symbolic-regression search and the authors report it consistently outperforms alternatives. Real-world ML is full of design choices like this: discovered empirically, kept because they work, not always explainable by first principles.
Why the Common Skeleton Matters
Reading these five implementations one after another reveals something the equations alone do not: the optimizers are mostly the same code. The list of parameters, the in-place updates, the no_grad context, the iteration pattern — all identical. The differences live in three places: which buffers exist, how the update direction is computed, and (for AdamW) whether decoupled decay is applied.
This is not coincidence. The space of useful first-order optimizers is small. Every optimizer paper for the last decade — Adafactor, Sophia, Shampoo, Tiger, Adan, Schedule-Free SGD — is some variation on this same skeleton. Knowing the skeleton inside out makes reading new optimizer papers much faster: you scan for the buffer list, the update direction, and the regularisation handling, and the rest is fixed.
It is also why most ML practitioners do not need to write their own optimizers. The PyTorch implementations of these five — with proper state-dict serialization, parameter groups, learning-rate schedulers, distributed-training support, mixed-precision wrappers, and so on — are battle-tested across thousands of production deployments. The 75-line versions in this post are for understanding; in production, torch.optim.AdamW is the right answer.
What Part 3 Tests
With all five optimizers in hand, Part 3 benchmarks them on three problems: Rosenbrock (a smooth narrow valley), Beale (flat plateaus with a sharp minimum), and a real 4-layer MLP on the two-moons dataset. Each problem is designed to probe a different optimizer property — and the result is not the textbook story that "Adam is always best."
Full code on GitHub: github.com/soveshmohapatra/Optimizers