Part 1 covered the math. Now we implement it.
The full engine occupies a single file (autograd.py) of 245 lines including docstrings. It supports arbitrary tensor shapes with NumPy broadcasting, the operations needed to train an MLP (+, *, @, sum, mean, tanh, relu, exp, log, log_softmax, cross_entropy), and a backward() entry point that performs the topological sort and reverse walk.
We then build an MLP on top of it and a PyTorch model with identical architecture, so Part 3 can verify the two produce matching gradients.
The Tensor Class
Every node in the autograd graph is a Tensor. It carries five fields:
class Tensor:
def __init__(self, data, _parents=(), _op="", requires_grad=True):
self.data = np.asarray(data, dtype=np.float64)
self.grad = np.zeros_like(self.data)
self.requires_grad = requires_grad
self._parents = _parents
self._op = _op
self._backward = lambda: None
The interesting one is _backward: a closure that, when called, pushes out.grad into the parents' .grad fields using the chain rule for this particular operation. Each elementary op overwrites it during the forward pass.
Elementary Operations
Addition (with broadcasting)
def __add__(self, other):
other = other if isinstance(other, Tensor) else Tensor(other, requires_grad=False)
out = Tensor(self.data + other.data, (self, other), "+")
def _backward():
self.grad += Tensor._unbroadcast(out.grad, self.data.shape)
other.grad += Tensor._unbroadcast(out.grad, other.data.shape)
out._backward = _backward
return out
The _unbroadcast helper sums the outgoing gradient along whichever axes were introduced or stretched by broadcasting. Without it, adding a $(D,)$ bias to a $(B, D)$ activation would leave the bias gradient with shape $(B, D)$, which cannot be assigned back.
Matrix multiplication
def __matmul__(self, other):
out = Tensor(self.data @ other.data, (self, other), "@")
def _backward():
self.grad += out.grad @ other.data.T
other.grad += self.data.T @ out.grad
out._backward = _backward
return out
Two lines of forward, two lines of backward. This is the workhorse of every neural network.
Tanh
def tanh(self):
t = np.tanh(self.data)
out = Tensor(t, (self,), "tanh")
def _backward():
self.grad += (1.0 - t**2) * out.grad
out._backward = _backward
return out
The derivative is captured in the closure -- $1 - \tanh^2(z)$ -- using $t$ that we just computed in the forward.
Numerically Stable Log-Softmax
The naive log-softmax overflows for large logits. The stable form subtracts the max:
Its gradient, given an incoming gradient $g$, is:
This compresses to a single line of NumPy inside the closure.
Cross-Entropy Loss
For integer class targets, mean cross-entropy reduces to:
The gradient with respect to the log-probability of the correct class is $-1/B$; all other entries are zero. Three lines of NumPy implement this. Crucially, we apply this gradient to the output of log_softmax, and let the engine handle propagation back through log_softmax and into the logits. No manual derivation of the combined softmax+cross-entropy gradient is required.
The Backward Engine
def backward(self):
topo, visited = [], set()
def build(v):
if id(v) in visited: return
visited.add(id(v))
for p in v._parents: build(p)
topo.append(v)
build(self)
self.grad = np.ones_like(self.data)
for v in reversed(topo):
v._backward()
Ten lines. This is reverse-mode autodiff, complete. Everything else in the file is per-operation forward/backward formulas.
Building an MLP
A Linear layer holds two parameters as Tensors:
class Linear:
def __init__(self, in_dim, out_dim, rng):
W, b = init_linear(in_dim, out_dim, rng)
self.W = Tensor(W)
self.b = Tensor(b)
def __call__(self, x):
return x @ self.W + self.b
An MLP composes three Linears separated by tanh:
class MyMLP:
def __init__(self, in_dim, hidden, out_dim, rng):
self.fc1 = Linear(in_dim, hidden, rng)
self.fc2 = Linear(hidden, hidden, rng)
self.fc3 = Linear(hidden, out_dim, rng)
def __call__(self, x):
h = self.fc1(x).tanh()
h = self.fc2(h).tanh()
return self.fc3(h)
Tanh is intentional: its non-trivial derivative $(1 - \tanh^2)$ is a more rigorous test of the engine than ReLU's binary mask.
Why Reproduce the PyTorch Architecture?
The mirror class TorchMLP uses torch.nn.Linear with the same shapes. A helper copy_weights_to_torch forces the PyTorch model to use exactly the same parameter values as our autograd model. This lets Part 3 perform a numerical cross-check:
- Run forward + backward in our engine $\to$ obtain analytic gradients.
- Run forward + backward in PyTorch on identical inputs and weights $\to$ obtain reference gradients.
- Report $\max_i \lvert \nabla_\theta L_\text{ours} - \nabla_\theta L_\text{torch} \rvert$ per parameter.
If the engine is correct, this difference is bounded by floating-point arithmetic, not algorithmic error.
Summary
- A 245-line
Tensorclass implements reverse-mode autodiff. - Each operation installs a closure that pushes outgoing gradients into its parents using the chain rule.
backward()does a topological sort then walks the reverse order, calling each closure exactly once.- An MLP is just three
Linearlayers with tanh in between; the engine handles training without further code.
In Part 3 we run the cross-check, train on the two-moons dataset, compare against PyTorch, and explain the surprising 4$\times$ speed advantage of NumPy on small models.
Full code on GitHub: github.com/soveshmohapatra/Backpropagation