Deconstructing Predictive Coding Networks: Part 1 - Biology of Belief

Introduction

If standard deep learning recognizes patterns through top-down error correction, Predictive Coding Networks (PCNs) actively anticipate reality before it arrives.

Backpropagation is efficient on GPUs and mathematically clean. But it has a biological problem: it requires a precise, global error signal transmitted symmetrically backward through the entire network. Neuroscientists agree the brain does not do this. Synapses update based only on immediately adjacent neurons. There is no global supervisor freezing the network to compute chain-rule gradients.

This post covers the mathematical foundation of PCNs: top-down predictions, local errors, and energy minimization.

The Generative Brain: Top-Down vs Bottom-Up

Predictive Coding proposes that the brain does not passively wait for sensory input to flow upward (like a standard MLP). Instead, it constantly generates predictions of what it expects to see, sending them downward.

Standard MLP: Data flows Bottom $\rightarrow$ Top. Errors flow Top $\rightarrow$ Bottom.
PCN: Beliefs flow Top $\rightarrow$ Bottom (predicting the layer below). Only the prediction errors (Target - Prediction) are passed Bottom $\rightarrow$ Top.

The Mathematics of Local Beliefs

Consider two adjacent layers: a higher layer of latent nodes $v_l$ and a lower layer $v_{l-1}$.

The higher layer uses weights $W_l$ to predict the activation of the lower layer:

\hat{v}_{l-1} = f(v_l) W_l

where $f$ is a non-linear activation (e.g., Tanh).

When the actual input arrives, the local prediction error at layer $l-1$ is the difference between the true state and the predicted state:

\epsilon_{l-1} = v_{l-1} - \hat{v}_{l-1}

Phase 1: Energy Minimization (Inference)

In a standard ANN, inference is a single forward pass. In a PCN, inference is iterative. Before learning, the network must first settle its beliefs to explain the sensory input by minimizing the total prediction error, or the Energy ($E$):

E = \frac{1}{2} \sum_{l=0}^{L-1} \| \epsilon_l \|^2

Weights stay frozen. Gradient descent runs only on the latent node states $v_l$ to minimize $E$, iteratively adjusting internal beliefs until they align with the input.

Phase 2: The Hebbian Weight Update (Learning)

Once the latent states have settled, learning occurs. Because the energy function depends entirely on local prediction errors, the gradient with respect to $W_l$ uses only local variables:

\Delta W_l \propto \epsilon_{l-1}^T f(v_l)

The weight update is the outer product of the post-synaptic error ($\epsilon_{l-1}$) and the pre-synaptic activity ($f(v_l)$). This is a strictly local Hebbian rule derived from minimizing local energy -- no global backpropagation required.

The Cost of Biological Plausibility

PCNs currently underperform standard backpropagation on speed and accuracy for concrete reasons:

Inference Bottleneck: An MLP does one forward pass. A PCN iterates 20-50 times per sample to settle its latent states before any learning happens. Training is proportionally slower.
Local vs. Global Gradients: Backpropagation computes exact global error gradients via the chain rule. PCN weights only see their immediate neighbors, making coordination across deep layers harder.
Generative vs. Discriminative: MLPs draw decision boundaries between labels. PCNs predict layer states from the top down, effectively learning a generative model of the full data distribution -- a harder problem.

Despite this, removing the global synchronous gradient is what makes PCNs compatible with analog, low-power Neuromorphic Hardware.

Next: From Math to Code

The formulation is an energy-based objective: predict the input, settle the latent states, update weights locally. Part 2 implements this in pure PyTorch using torch.autograd.grad on latent nodes and a custom Hebbian update rule -- no loss.backward().