Back to Backpropagation Hub

Deconstructing Backpropagation from Scratch

Part 2: Pure NumPy Implementation

Overview

Part 1 covered the math. Now we implement it.

The full engine occupies a single file (autograd.py) of 245 lines including docstrings. It supports arbitrary tensor shapes with NumPy broadcasting, the operations needed to train an MLP (+, *, @, sum, mean, tanh, relu, exp, log, log_softmax, cross_entropy), and a backward() entry point that performs the topological sort and reverse walk.

We then build an MLP on top of it and a PyTorch model with identical architecture, so Part 3 can verify the two produce matching gradients.

The Tensor Class

Every node in the autograd graph is a Tensor. It carries five fields:

class Tensor:
    def __init__(self, data, _parents=(), _op="", requires_grad=True):
        self.data = np.asarray(data, dtype=np.float64)
        self.grad = np.zeros_like(self.data)
        self.requires_grad = requires_grad
        self._parents = _parents
        self._op = _op
        self._backward = lambda: None

The interesting one is _backward: a closure that, when called, pushes out.grad into the parents' .grad fields using the chain rule for this particular operation. Each elementary op overwrites it during the forward pass.

Elementary Operations

Addition (with broadcasting)

def __add__(self, other):
    other = other if isinstance(other, Tensor) else Tensor(other, requires_grad=False)
    out = Tensor(self.data + other.data, (self, other), "+")
    def _backward():
        self.grad  += Tensor._unbroadcast(out.grad, self.data.shape)
        other.grad += Tensor._unbroadcast(out.grad, other.data.shape)
    out._backward = _backward
    return out

The _unbroadcast helper sums the outgoing gradient along whichever axes were introduced or stretched by broadcasting. Without it, adding a $(D,)$ bias to a $(B, D)$ activation would leave the bias gradient with shape $(B, D)$, which cannot be assigned back.

Matrix multiplication

def __matmul__(self, other):
    out = Tensor(self.data @ other.data, (self, other), "@")
    def _backward():
        self.grad  += out.grad @ other.data.T
        other.grad += self.data.T @ out.grad
    out._backward = _backward
    return out

Two lines of forward, two lines of backward. This is the workhorse of every neural network.

Tanh

def tanh(self):
    t = np.tanh(self.data)
    out = Tensor(t, (self,), "tanh")
    def _backward():
        self.grad += (1.0 - t**2) * out.grad
    out._backward = _backward
    return out

The derivative is captured in the closure -- $1 - \tanh^2(z)$ -- using $t$ that we just computed in the forward.

Numerically Stable Log-Softmax

The naive log-softmax overflows for large logits. The stable form subtracts the max:

$$ \text{log\_softmax}(z)_i = z_i - \max_k z_k - \log \sum_j \exp(z_j - \max_k z_k). $$

Its gradient, given an incoming gradient $g$, is:

$$ \frac{\partial L}{\partial z_i} = g_i \;-\; \text{softmax}(z)_i \cdot \sum_j g_j. $$

This compresses to a single line of NumPy inside the closure.

Cross-Entropy Loss

For integer class targets, mean cross-entropy reduces to:

$$ \mathcal{L} = -\frac{1}{B} \sum_{i=1}^{B} \text{log\_softmax}(z^{(i)})_{y^{(i)}}. $$

The gradient with respect to the log-probability of the correct class is $-1/B$; all other entries are zero. Three lines of NumPy implement this. Crucially, we apply this gradient to the output of log_softmax, and let the engine handle propagation back through log_softmax and into the logits. No manual derivation of the combined softmax+cross-entropy gradient is required.

The Backward Engine

def backward(self):
    topo, visited = [], set()
    def build(v):
        if id(v) in visited: return
        visited.add(id(v))
        for p in v._parents: build(p)
        topo.append(v)
    build(self)
    self.grad = np.ones_like(self.data)
    for v in reversed(topo):
        v._backward()

Ten lines. This is reverse-mode autodiff, complete. Everything else in the file is per-operation forward/backward formulas.

Building an MLP

A Linear layer holds two parameters as Tensors:

class Linear:
    def __init__(self, in_dim, out_dim, rng):
        W, b = init_linear(in_dim, out_dim, rng)
        self.W = Tensor(W)
        self.b = Tensor(b)
    def __call__(self, x):
        return x @ self.W + self.b

An MLP composes three Linears separated by tanh:

class MyMLP:
    def __init__(self, in_dim, hidden, out_dim, rng):
        self.fc1 = Linear(in_dim, hidden, rng)
        self.fc2 = Linear(hidden, hidden, rng)
        self.fc3 = Linear(hidden, out_dim, rng)
    def __call__(self, x):
        h = self.fc1(x).tanh()
        h = self.fc2(h).tanh()
        return self.fc3(h)

Tanh is intentional: its non-trivial derivative $(1 - \tanh^2)$ is a more rigorous test of the engine than ReLU's binary mask.

Why Reproduce the PyTorch Architecture?

The mirror class TorchMLP uses torch.nn.Linear with the same shapes. A helper copy_weights_to_torch forces the PyTorch model to use exactly the same parameter values as our autograd model. This lets Part 3 perform a numerical cross-check:

  1. Run forward + backward in our engine $\to$ obtain analytic gradients.
  2. Run forward + backward in PyTorch on identical inputs and weights $\to$ obtain reference gradients.
  3. Report $\max_i \lvert \nabla_\theta L_\text{ours} - \nabla_\theta L_\text{torch} \rvert$ per parameter.

If the engine is correct, this difference is bounded by floating-point arithmetic, not algorithmic error.

Summary

In Part 3 we run the cross-check, train on the two-moons dataset, compare against PyTorch, and explain the surprising 4$\times$ speed advantage of NumPy on small models.

Full code on GitHub: github.com/soveshmohapatra/Backpropagation