Back to Hopfield Hub

Deconstructing Hopfield Networks from Scratch

Part 1: The Math of Associative Memory

Abstract

Hopfield Networks are energy-based models where memory storage and retrieval reduce to energy minimization. This post covers the mathematical foundations: the energy function, Hebbian learning, storage capacity bounds, and the modern continuous extension that maps directly onto Transformer attention.

Energy-Based Models and Associative Memory

The premise of a Hopfield network is straightforward: memories are energy minima.

Consider a system of $N$ binary neurons, each taking values in $\{-1, +1\}$. The state of the network is a vector $\mathbf{x} \in \{-1, +1\}^N$. We define an energy function over states:

$$ E(\mathbf{x}) = -\frac{1}{2} \mathbf{x}^\top W \mathbf{x} $$

where $W \in \mathbb{R}^{N \times N}$ is a symmetric weight matrix with zero diagonal ($W_{ii} = 0$).

If we construct $W$ so that the stored patterns $\{\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P\}$ sit at local minima of $E$, then retrieval is just gradient descent on the energy landscape. Feed in a corrupted query, let the dynamics roll downhill, and the state settles into the nearest minimum---recovering the stored memory.

This is content-addressable memory: you retrieve by content (a partial or noisy version of the memory), not by address.

Hebbian Learning: Constructing the Weight Matrix

How do we construct $W$ so that the desired patterns are energy minima? The answer comes from Hebb's rule (1949): "Neurons that fire together, wire together."

Given $P$ patterns $\boldsymbol{\xi}^\mu \in \{-1, +1\}^N$ for $\mu = 1, \ldots, P$, the Hebbian weight matrix is:

$$ W = \frac{1}{N} \sum_{\mu=1}^{P} \boldsymbol{\xi}^\mu (\boldsymbol{\xi}^\mu)^\top, \qquad W_{ii} = 0 $$

The diagonal is zeroed to prevent self-reinforcement.

Why This Works

Consider the "local field" acting on neuron $i$ when the network is in state $\boldsymbol{\xi}^\nu$ (one of the stored patterns):

$$ h_i = \sum_{j \neq i} W_{ij} \xi^\nu_j = \frac{1}{N} \sum_{\mu=1}^{P} \xi^\mu_i \sum_{j \neq i} \xi^\mu_j \xi^\nu_j $$

For the term $\mu = \nu$ (the "signal"):

$$ \frac{1}{N} \xi^\nu_i \sum_{j \neq i} (\xi^\nu_j)^2 = \frac{N-1}{N} \xi^\nu_i \approx \xi^\nu_i $$

When patterns are random and uncorrelated, the crosstalk term (from $\mu \neq \nu$) has mean zero and variance $\sim P/N$. As long as $P/N$ is small, the signal dominates and $\text{sign}(h_i) = \xi^\nu_i$---the pattern is a fixed point.

The Update Rule

Retrieval uses asynchronous updates. At each step, a neuron $i$ is selected (randomly or sequentially) and updated:

$$ x_i \leftarrow \text{sign}\left(\sum_{j} W_{ij} x_j\right) $$

Energy Always Decreases

Each asynchronous update can only decrease (or maintain) the energy. The change when neuron $i$ flips:

$$ \Delta E = -\left(x_i^{\text{new}} - x_i\right) \sum_j W_{ij} x_j \leq 0 $$

Since the energy is bounded below and strictly non-increasing, the dynamics must converge to a fixed point (a local minimum of $E$).

Storage Capacity

How many patterns can a Hopfield network reliably store?

The Classical Bound

The answer, established by Amit, Gutfreund, and Sompolinsky (1985), is:

$$ P_{\max} \approx \frac{N}{2 \ln N} $$

for perfect retrieval. In practice, if we allow a small fraction of bit errors, the capacity is approximately:

$$ P_{\max} \approx 0.14 N $$

This is linear in $N$. For a network with $N = 100$ neurons, the practical capacity is roughly 13--14 patterns.

Why Capacity Is Limited

The bottleneck is crosstalk. Each additional stored pattern adds noise to the local field. When $P$ grows relative to $N$, the crosstalk overwhelms the signal, creating spurious attractors---stable states that do not correspond to any stored pattern. These are "chimeric" mixtures of multiple patterns.

The Modern Continuous Hopfield Network

In 2020, Ramsauer et al. replaced the quadratic energy with a log-sum-exp.

Continuous States and the New Energy

Instead of binary states, we allow $\mathbf{x} \in \mathbb{R}^d$. The stored patterns are continuous vectors $\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P \in \mathbb{R}^d$. The energy function becomes:

$$ E(\mathbf{x}) = -\frac{1}{\beta} \ln \left( \sum_{\mu=1}^{P} \exp\left(\beta \, (\boldsymbol{\xi}^\mu)^\top \mathbf{x}\right) \right) + \frac{1}{2} \|\mathbf{x}\|^2 + \text{const} $$

where $\beta > 0$ is an inverse temperature parameter.

The Update Rule Is Attention

Setting $\nabla_\mathbf{x} E = 0$ and solving, the fixed-point update becomes:

$$ \mathbf{x}^{\text{new}} = \sum_{\mu=1}^{P} \text{softmax}\left(\beta \, \Xi^\top \mathbf{x}\right)_\mu \, \boldsymbol{\xi}^\mu = \Xi^\top \, \text{softmax}\left(\beta \, \Xi^\top \mathbf{x}\right) $$

Now compare with standard softmax attention:

$$ \text{Attention}(Q, K, V) = V^\top \, \text{softmax}\left(\frac{K^\top Q}{\sqrt{d_k}}\right) $$

Setting $K = V = \Xi$ (stored patterns), $Q = \mathbf{x}$ (query), and $\beta = 1/\sqrt{d_k}$, the modern Hopfield update is exactly the attention mechanism.

Exponential Capacity

The log-sum-exp energy creates much sharper minima than the quadratic energy. The resulting storage capacity is:

$$ P_{\max} \sim \exp\left(\frac{d}{2}\right) $$

which is exponential in the pattern dimension $d$. For $d = 64$, this dwarfs the classical $0.14N$ bound.

Summary and Preview

Property Classical (1982) Modern (2020)
State spaceBinary $\{-1, +1\}^N$Continuous $\mathbb{R}^d$
Update$\text{sign}(W\mathbf{x})$$\Xi^\top \text{softmax}(\beta \Xi^\top \mathbf{x})$
LearningHebbian outer productsDirect pattern storage
Capacity$\sim 0.14N$ (linear)$\sim \exp(d/2)$ (exponential)
Connection---= Softmax Attention

In Part 2, we implement both networks from scratch in pure PyTorch and verify these mathematical properties empirically.