Hopfield Networks are energy-based models where memory storage and retrieval emerge from the physics of energy minimization. In this first installment of a three-part series, we develop the mathematical foundations: the energy function, Hebbian learning, storage capacity bounds, and the modern continuous extension that connects directly to Transformer attention. No code yet---just the math that makes it all work.
Energy-Based Models and Associative Memory
The central premise of a Hopfield network is deceptively simple: memories are energy minima.
Consider a system of $N$ binary neurons, each taking values in $\{-1, +1\}$. The state of the network is a vector $\mathbf{x} \in \{-1, +1\}^N$. We define an energy function over states:
where $W \in \mathbb{R}^{N \times N}$ is a symmetric weight matrix with zero diagonal ($W_{ii} = 0$).
The key insight: if we can construct $W$ such that the stored patterns $\{\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P\}$ correspond to local minima of $E$, then retrieval becomes gradient descent on the energy landscape. Starting from a corrupted query, the dynamics roll downhill until they settle into the nearest minimum---recovering the stored memory.
This is content-addressable memory: you retrieve by content (a partial or noisy version of the memory), not by address.
Hebbian Learning: Constructing the Weight Matrix
How do we construct $W$ so that the desired patterns are energy minima? The answer comes from Hebb's rule (1949): "Neurons that fire together, wire together."
Given $P$ patterns $\boldsymbol{\xi}^\mu \in \{-1, +1\}^N$ for $\mu = 1, \ldots, P$, the Hebbian weight matrix is:
The diagonal is zeroed to prevent self-reinforcement.
Why This Works
Consider the "local field" acting on neuron $i$ when the network is in state $\boldsymbol{\xi}^\nu$ (one of the stored patterns):
For the term $\mu = \nu$ (the "signal"):
When patterns are random and uncorrelated, the crosstalk term (from $\mu \neq \nu$) has mean zero and variance $\sim P/N$. As long as $P/N$ is small, the signal dominates and $\text{sign}(h_i) = \xi^\nu_i$---the pattern is a fixed point.
The Update Rule
Retrieval uses asynchronous updates. At each step, a neuron $i$ is selected (randomly or sequentially) and updated:
Energy Always Decreases
A crucial property: each asynchronous update can only decrease (or maintain) the energy. The change in energy when neuron $i$ flips:
Since the energy is bounded below and strictly non-increasing, the dynamics must converge to a fixed point (a local minimum of $E$).
Storage Capacity
The fundamental question: how many patterns can a Hopfield network reliably store?
The Classical Bound
The answer, established by Amit, Gutfreund, and Sompolinsky (1985), is:
for perfect retrieval. In practice, if we allow a small fraction of bit errors, the capacity is approximately:
This is linear in $N$. For a network with $N = 100$ neurons, the practical capacity is roughly 13--14 patterns.
Why Capacity Is Limited
The bottleneck is crosstalk. Each additional stored pattern adds noise to the local field. When $P$ grows relative to $N$, the crosstalk overwhelms the signal, creating spurious attractors---stable states that do not correspond to any stored pattern. These are "chimeric" mixtures of multiple patterns.
The Modern Continuous Hopfield Network
In 2020, Ramsauer et al. proposed a radical upgrade: replace the quadratic energy with a log-sum-exp.
Continuous States and the New Energy
Instead of binary states, we allow $\mathbf{x} \in \mathbb{R}^d$. The stored patterns are continuous vectors $\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P \in \mathbb{R}^d$. The energy function becomes:
where $\beta > 0$ is an inverse temperature parameter.
The Update Rule Is Attention
Setting $\nabla_\mathbf{x} E = 0$ and solving, the fixed-point update becomes:
Now compare with standard softmax attention:
Setting $K = V = \Xi$ (stored patterns), $Q = \mathbf{x}$ (query), and $\beta = 1/\sqrt{d_k}$, the modern Hopfield update is exactly the attention mechanism.
Exponential Capacity
The log-sum-exp energy creates much sharper minima than the quadratic energy. The resulting storage capacity is:
which is exponential in the pattern dimension $d$. For $d = 64$, this is astronomically larger than the classical $0.14N$ bound.
Summary and Preview
| Property | Classical (1982) | Modern (2020) |
|---|---|---|
| State space | Binary $\{-1, +1\}^N$ | Continuous $\mathbb{R}^d$ |
| Update | $\text{sign}(W\mathbf{x})$ | $\Xi^\top \text{softmax}(\beta \Xi^\top \mathbf{x})$ |
| Learning | Hebbian outer products | Direct pattern storage |
| Capacity | $\sim 0.14N$ (linear) | $\sim \exp(d/2)$ (exponential) |
| Connection | --- | = Softmax Attention |
In Part 2, we implement both networks from scratch in pure PyTorch and verify these mathematical properties empirically.