Back to Backpropagation Hub

Deconstructing Backpropagation from Scratch

Part 1: The Math of Reverse-Mode Autodiff

Introduction

Every neural network on Earth is trained by the same algorithm: reverse-mode automatic differentiation, popularly called backpropagation. PyTorch's autograd is roughly 60,000 lines of C++ and Python. The algorithm it implements is ten lines.

This three-part series strips the engineering away and builds a working autograd engine from first principles in pure NumPy. By the end of Part 3 we will have 245 lines of code whose gradients match PyTorch to within machine epsilon on every parameter of a multilayer perceptron.

This first part covers the math: the chain rule, computation graphs, and why reverse mode is the only viable choice when the parameter count exceeds the loss dimension by ten orders of magnitude.

The Chain Rule, As You Already Know It

Given $L = f(g(x))$, calculus tells us:

$$ \frac{dL}{dx} = \frac{dL}{dg}\cdot \frac{dg}{dx}. $$

For composed functions $L = h \circ g \circ f$ this generalises:

$$ \frac{dL}{dx} = \frac{dL}{dh}\cdot \frac{dh}{dg}\cdot \frac{dg}{df}\cdot \frac{df}{dx}. $$

This is exactly what neural networks need. Replace each function with a layer, and $L$ becomes the loss. Computing $\partial L / \partial \theta$ for every parameter $\theta$ in the network is just the chain rule, repeated.

The only question is the order in which we multiply.

Forward Mode vs Reverse Mode

Consider a network with $P$ parameters and a scalar loss $L$. To compute every gradient, we must evaluate $P$ derivatives.

Forward Mode

Forward mode propagates derivatives alongside the forward computation. For each input variable $x$, we carry a "dual number" $(v,\, dv/dx)$ through every operation. At the end, the dual associated with the loss gives $dL/dx$.

Cost: one forward pass per input variable. To differentiate the loss with respect to $P$ parameters, we need $P$ forward passes. For a 7-billion-parameter model, that is 7 billion forward passes per training step.

Reverse Mode

Reverse mode performs one forward pass and then one backward pass that fans out from the scalar loss into every parameter at once. The trick: at each operation $z = f(x, y)$, we store a function that knows how to convert $\partial L / \partial z$ into $\partial L / \partial x$ and $\partial L / \partial y$. After the forward pass we have a graph of such functions. The backward pass walks this graph from the loss outward.

Cost: one forward pass plus one backward pass, regardless of $P$.

Why This Asymmetry Matters

Forward mode is $O(P)$ in parameter count. Reverse mode is $O(1)$. When $P = 7 \times 10^9$ and the loss is a single scalar, reverse mode is a billion times cheaper.

Important caveat: reverse mode is only cheaper when the output is low-dimensional. If we had a vector loss of dimension $D$ and many fewer parameters, forward mode would win. Neural networks land squarely in the reverse-mode regime: a single scalar loss against billions of parameters.

Computation Graphs

Reverse mode requires a record of every operation performed during the forward pass. We represent this record as a directed acyclic graph (DAG):

For example, the expression $L = (W x + b)^2$ produces:

$$ x,\, W,\, b \;\longrightarrow\; u = Wx \;\longrightarrow\; z = u + b \;\longrightarrow\; L = z^2. $$

Each node stores both its value and the operation that produced it. After the forward pass, the graph contains everything needed to evaluate the chain rule in reverse.

Key design choice: who stores what?

This is the design we will implement in Part 2.

The Backward Pass in Three Steps

Given the graph, the backward algorithm is:

  1. Topological sort. Order the nodes so that every parent appears before all its children.
  2. Seed. Set $\partial L / \partial L = 1$.
  3. Reverse walk. For each node in reverse topological order, invoke its closure. The closure consumes the node's accumulated gradient and pushes contributions into its parents.

That is the entire algorithm. Every modern framework -- PyTorch, JAX, TensorFlow -- implements exactly this, with differences only in graph representation (define-by-run vs.\ define-and-compile) and execution backend.

The Backward of Matrix Multiplication

The matmul backward formula is worth deriving by hand because it appears in every neural network. For $Z = X W$ with $X \in \mathbb{R}^{m \times n}$, $W \in \mathbb{R}^{n \times k}$:

$$ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Z}\, W^\top, \qquad \frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Z}. $$

Two lines. The forward pass is two lines. The backward pass is two lines.

This pattern repeats for every operation: a forward formula and a closure that uses incoming gradient $\partial L / \partial Z$ to produce gradients with respect to each input. The art is in the closures; once they are written, the engine works.

Broadcasting and Gradient Reduction

NumPy and PyTorch use broadcasting: a shape-$(D,)$ bias added to a shape-$(B, D)$ activation expands to shape $(B, D)$ implicitly. The gradient flowing back into the bias must be summed across the broadcast axes.

In general, after broadcasting promoted a tensor of shape $S_\text{in}$ to shape $S_\text{out}$, the gradient with respect to the input is obtained by summing the outgoing gradient over the axes that were inserted or stretched:

$$ \text{grad}_{\text{in}} = \sum_{\text{broadcast axes}} \text{grad}_{\text{out}}. $$

This is the subtle bookkeeping detail that trips up most hand-rolled implementations. Part 2 contains a six-line helper that handles it.

Summary

In Part 2 we implement all of this in 245 lines of NumPy: a Tensor class, an autograd graph, an MLP, and a numerically stable cross-entropy loss. In Part 3 we benchmark it against PyTorch on the two-moons dataset and verify the gradients match to within $10^{-17}$.