Every neural network on Earth is trained by the same algorithm: reverse-mode automatic differentiation, popularly called backpropagation. PyTorch's autograd is roughly 60,000 lines of C++ and Python. The algorithm it implements is ten lines.
This three-part series strips the engineering away and builds a working autograd engine from first principles in pure NumPy. By the end of Part 3 we will have 245 lines of code whose gradients match PyTorch to within machine epsilon on every parameter of a multilayer perceptron.
This first part covers the math: the chain rule, computation graphs, and why reverse mode is the only viable choice when the parameter count exceeds the loss dimension by ten orders of magnitude.
The Chain Rule, As You Already Know It
Given $L = f(g(x))$, calculus tells us:
For composed functions $L = h \circ g \circ f$ this generalises:
This is exactly what neural networks need. Replace each function with a layer, and $L$ becomes the loss. Computing $\partial L / \partial \theta$ for every parameter $\theta$ in the network is just the chain rule, repeated.
The only question is the order in which we multiply.
Forward Mode vs Reverse Mode
Consider a network with $P$ parameters and a scalar loss $L$. To compute every gradient, we must evaluate $P$ derivatives.
Forward Mode
Forward mode propagates derivatives alongside the forward computation. For each input variable $x$, we carry a "dual number" $(v,\, dv/dx)$ through every operation. At the end, the dual associated with the loss gives $dL/dx$.
Cost: one forward pass per input variable. To differentiate the loss with respect to $P$ parameters, we need $P$ forward passes. For a 7-billion-parameter model, that is 7 billion forward passes per training step.
Reverse Mode
Reverse mode performs one forward pass and then one backward pass that fans out from the scalar loss into every parameter at once. The trick: at each operation $z = f(x, y)$, we store a function that knows how to convert $\partial L / \partial z$ into $\partial L / \partial x$ and $\partial L / \partial y$. After the forward pass we have a graph of such functions. The backward pass walks this graph from the loss outward.
Cost: one forward pass plus one backward pass, regardless of $P$.
Why This Asymmetry Matters
Forward mode is $O(P)$ in parameter count. Reverse mode is $O(1)$. When $P = 7 \times 10^9$ and the loss is a single scalar, reverse mode is a billion times cheaper.
Important caveat: reverse mode is only cheaper when the output is low-dimensional. If we had a vector loss of dimension $D$ and many fewer parameters, forward mode would win. Neural networks land squarely in the reverse-mode regime: a single scalar loss against billions of parameters.
Computation Graphs
Reverse mode requires a record of every operation performed during the forward pass. We represent this record as a directed acyclic graph (DAG):
- Each node is an intermediate value.
- Each edge points from an operation's inputs to its output.
- The graph is built dynamically as the forward pass executes.
For example, the expression $L = (W x + b)^2$ produces:
Each node stores both its value and the operation that produced it. After the forward pass, the graph contains everything needed to evaluate the chain rule in reverse.
Key design choice: who stores what?
- Each node stores its own value.
- Each node stores a closure that, given $\partial L / \partial \text{(this node)}$, knows how to update its parents' gradients.
This is the design we will implement in Part 2.
The Backward Pass in Three Steps
Given the graph, the backward algorithm is:
- Topological sort. Order the nodes so that every parent appears before all its children.
- Seed. Set $\partial L / \partial L = 1$.
- Reverse walk. For each node in reverse topological order, invoke its closure. The closure consumes the node's accumulated gradient and pushes contributions into its parents.
That is the entire algorithm. Every modern framework -- PyTorch, JAX, TensorFlow -- implements exactly this, with differences only in graph representation (define-by-run vs.\ define-and-compile) and execution backend.
The Backward of Matrix Multiplication
The matmul backward formula is worth deriving by hand because it appears in every neural network. For $Z = X W$ with $X \in \mathbb{R}^{m \times n}$, $W \in \mathbb{R}^{n \times k}$:
Two lines. The forward pass is two lines. The backward pass is two lines.
This pattern repeats for every operation: a forward formula and a closure that uses incoming gradient $\partial L / \partial Z$ to produce gradients with respect to each input. The art is in the closures; once they are written, the engine works.
Broadcasting and Gradient Reduction
NumPy and PyTorch use broadcasting: a shape-$(D,)$ bias added to a shape-$(B, D)$ activation expands to shape $(B, D)$ implicitly. The gradient flowing back into the bias must be summed across the broadcast axes.
In general, after broadcasting promoted a tensor of shape $S_\text{in}$ to shape $S_\text{out}$, the gradient with respect to the input is obtained by summing the outgoing gradient over the axes that were inserted or stretched:
This is the subtle bookkeeping detail that trips up most hand-rolled implementations. Part 2 contains a six-line helper that handles it.
Summary
- Reverse-mode autodiff computes all parameter gradients in one backward pass; forward mode would require $O(P)$ passes.
- The forward pass builds a DAG of operations.
- The backward pass topologically sorts the DAG, seeds the output gradient to one, and walks the order in reverse, invoking a per-op closure at each node.
- Each closure is short: matmul is two lines, add is two lines (plus broadcast bookkeeping).
In Part 2 we implement all of this in 245 lines of NumPy: a Tensor class, an autograd graph, an MLP, and a numerically stable cross-entropy loss. In Part 3 we benchmark it against PyTorch on the two-moons dataset and verify the gradients match to within $10^{-17}$.