Hopfield Networks are energy-based models where memory storage and retrieval reduce to energy minimization. This post covers the mathematical foundations: the energy function, Hebbian learning, storage capacity bounds, and the modern continuous extension that maps directly onto Transformer attention.
Energy-Based Models and Associative Memory
The premise of a Hopfield network is straightforward: memories are energy minima.
Consider a system of $N$ binary neurons, each taking values in $\{-1, +1\}$. The state of the network is a vector $\mathbf{x} \in \{-1, +1\}^N$. We define an energy function over states:
where $W \in \mathbb{R}^{N \times N}$ is a symmetric weight matrix with zero diagonal ($W_{ii} = 0$).
If we construct $W$ so that the stored patterns $\{\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P\}$ sit at local minima of $E$, then retrieval is just gradient descent on the energy landscape. Feed in a corrupted query, let the dynamics roll downhill, and the state settles into the nearest minimum---recovering the stored memory.
This is content-addressable memory: you retrieve by content (a partial or noisy version of the memory), not by address.
Hebbian Learning: Constructing the Weight Matrix
How do we construct $W$ so that the desired patterns are energy minima? The answer comes from Hebb's rule (1949): "Neurons that fire together, wire together."
Given $P$ patterns $\boldsymbol{\xi}^\mu \in \{-1, +1\}^N$ for $\mu = 1, \ldots, P$, the Hebbian weight matrix is:
The diagonal is zeroed to prevent self-reinforcement.
Why This Works
Consider the "local field" acting on neuron $i$ when the network is in state $\boldsymbol{\xi}^\nu$ (one of the stored patterns):
For the term $\mu = \nu$ (the "signal"):
When patterns are random and uncorrelated, the crosstalk term (from $\mu \neq \nu$) has mean zero and variance $\sim P/N$. As long as $P/N$ is small, the signal dominates and $\text{sign}(h_i) = \xi^\nu_i$---the pattern is a fixed point.
The Update Rule
Retrieval uses asynchronous updates. At each step, a neuron $i$ is selected (randomly or sequentially) and updated:
Energy Always Decreases
Each asynchronous update can only decrease (or maintain) the energy. The change when neuron $i$ flips:
Since the energy is bounded below and strictly non-increasing, the dynamics must converge to a fixed point (a local minimum of $E$).
Storage Capacity
How many patterns can a Hopfield network reliably store?
The Classical Bound
The answer, established by Amit, Gutfreund, and Sompolinsky (1985), is:
for perfect retrieval. In practice, if we allow a small fraction of bit errors, the capacity is approximately:
This is linear in $N$. For a network with $N = 100$ neurons, the practical capacity is roughly 13--14 patterns.
Why Capacity Is Limited
The bottleneck is crosstalk. Each additional stored pattern adds noise to the local field. When $P$ grows relative to $N$, the crosstalk overwhelms the signal, creating spurious attractors---stable states that do not correspond to any stored pattern. These are "chimeric" mixtures of multiple patterns.
The Modern Continuous Hopfield Network
In 2020, Ramsauer et al. replaced the quadratic energy with a log-sum-exp.
Continuous States and the New Energy
Instead of binary states, we allow $\mathbf{x} \in \mathbb{R}^d$. The stored patterns are continuous vectors $\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P \in \mathbb{R}^d$. The energy function becomes:
where $\beta > 0$ is an inverse temperature parameter.
The Update Rule Is Attention
Setting $\nabla_\mathbf{x} E = 0$ and solving, the fixed-point update becomes:
Now compare with standard softmax attention:
Setting $K = V = \Xi$ (stored patterns), $Q = \mathbf{x}$ (query), and $\beta = 1/\sqrt{d_k}$, the modern Hopfield update is exactly the attention mechanism.
Exponential Capacity
The log-sum-exp energy creates much sharper minima than the quadratic energy. The resulting storage capacity is:
which is exponential in the pattern dimension $d$. For $d = 64$, this dwarfs the classical $0.14N$ bound.
Summary and Preview
| Property | Classical (1982) | Modern (2020) |
|---|---|---|
| State space | Binary $\{-1, +1\}^N$ | Continuous $\mathbb{R}^d$ |
| Update | $\text{sign}(W\mathbf{x})$ | $\Xi^\top \text{softmax}(\beta \Xi^\top \mathbf{x})$ |
| Learning | Hebbian outer products | Direct pattern storage |
| Capacity | $\sim 0.14N$ (linear) | $\sim \exp(d/2)$ (exponential) |
| Connection | --- | = Softmax Attention |
In Part 2, we implement both networks from scratch in pure PyTorch and verify these mathematical properties empirically.