In the final installment of this three-part series, we present empirical results from our from-scratch Hopfield Network implementations. We measure classical retrieval accuracy under corruption, map the storage capacity curve, demonstrate the modern network's exponential capacity advantage, and provide a numerical proof that the modern Hopfield update is identical to softmax attention. All results are from actual benchmark runs.
Experimental Setup
All experiments use PyTorch 2.5.1 on CPU with fixed random seed (42) for reproducibility. We test:
- Classical retrieval: 7 patterns in 100 neurons, 25% bit corruption
- Classical capacity: 1--30 patterns in 100 neurons, 20% corruption, 5 trials each
- Modern retrieval: 20 patterns in 64 dimensions, 30% Gaussian noise
- Modern capacity: 5--500 patterns in 64 dimensions, single-step attention retrieval
- Attention equivalence: Hopfield update vs. manual softmax attention
Total runtime: approximately 10 seconds.
Classical Hopfield: Retrieval Accuracy
We stored 7 random binary patterns in a 100-neuron network and corrupted each by flipping 25 of 100 bits.
Key observations:
- Perfect retrieval: All 7 patterns recovered exactly, zero bit errors.
- Fast convergence: 2--3 asynchronous sweeps suffice. The network reaches a fixed point rapidly.
- Energy descent: Energy drops by roughly 4$\times$ from the corrupted state to the stored pattern (from approximately $-11$ to $-48$), confirming that stored patterns sit in deep energy minima.
With only 7 patterns in 100 neurons ($P/N = 0.07$), we are well below the capacity limit. The signal-to-noise ratio in the local field is high enough for error-free retrieval.
Classical Hopfield: Storage Capacity
We systematically increased the number of stored patterns from 1 to 30, using 20% corruption and averaging over 5 trials per count.
| Patterns Stored | Mean Accuracy |
|---|---|
| 1--9 | 100.0% |
| 10 | 98.6% |
| 13 | ~95% (threshold) |
| 15 | 91.5% |
| 20 | 83.7% |
| 25 | 73.4% |
| 30 | 71.2% |
Analysis
The capacity at 95% accuracy is approximately 13 patterns, giving a ratio of $P/N = 0.130$. This aligns closely with the theoretical bound of $\sim 0.14N$.
The degradation is smooth, not catastrophic. At 30 patterns, the network still recovers about 71% of bits correctly---it is not random ($\approx 50\%$), suggesting partial retrieval even beyond capacity.
Modern Hopfield: Continuous Retrieval
We stored 20 random continuous patterns in 64 dimensions and tested retrieval with 30% additive Gaussian noise.
All 10 tested queries correctly retrieved the target pattern with cosine similarity 1.0000 (four decimal places). The modern Hopfield network using iterative softmax updates with $\beta = 1.0$ achieves perfect pattern discrimination even with substantial noise.
Modern Hopfield: Exponential Capacity
The most striking result: we tested the modern network's capacity by storing up to 500 patterns in 64 dimensions.
| Patterns Stored | Retrieval Accuracy |
|---|---|
| 5 | 100.0% |
| 10 | 100.0% |
| 20 | 100.0% |
| 50 | 100.0% |
| 100 | 100.0% |
| 200 | 100.0% |
| 500 | 100.0% |
100% accuracy at every tested pattern count, up to 500 patterns in 64 dimensions.
Compare with the classical network: 100 neurons can reliably store ~13 patterns. The modern network in 64 dimensions stores at least 500 patterns with no degradation. The theoretical capacity is exponential: $P_{\max} \sim \exp(d/2) = \exp(32) \approx 10^{13}$.
Why Exponential?
The log-sum-exp energy creates exponentially many well-separated minima in $\mathbb{R}^d$. Intuitively:
- The classical quadratic energy has minima at the corners of the weight matrix's eigenspace---a linear number of directions.
- The modern log-sum-exp energy has minima near each stored pattern, and the softmax sharpening at high $\beta$ creates exponentially narrow basins of attraction.
Attention Equivalence: Numerical Proof
We computed both the Hopfield single-step update and standard softmax attention with identical inputs:
- Stored patterns: 20 random vectors in $\mathbb{R}^{64}$
- $K = V = \Xi$ (pattern matrix), $Q = \mathbf{x}$ (random query)
- $\beta = 1/\sqrt{d} = 1/\sqrt{64} = 0.125$
| Metric | Value |
|---|---|
| Max absolute difference | $0.00 \times 10^{0}$ |
| Mean absolute difference | $0.00 \times 10^{0}$ |
| Cosine similarity | 0.9999999404 |
| Identical (tolerance $10^{-6}$) | True |
The two computations produce numerically identical results. The tiny deviation in cosine similarity from 1.0 is within floating-point precision ($\sim 10^{-7}$).
What This Means
- Transformers are energy-based models. Each attention head minimizes a modern Hopfield energy function. The key-value pairs are stored patterns; the query is the probe.
- Attention is memory retrieval. The softmax operation is not just "weighting by importance"---it is the fixed-point update of an energy-based associative memory.
- Multi-head attention = multiple Hopfield memories. Each head stores a different set of patterns (keys/values) and retrieves independently.
- Capacity explains head dimension. The exponential capacity $\sim \exp(d_{\text{head}}/2)$ explains why Transformers work well with relatively small head dimensions (64 or 128): even modest dimensions provide enormous storage.
Capacity Comparison: The Full Picture
| Classical | Modern | |
|---|---|---|
| Network/Dim | $N = 100$ | $d = 64$ |
| Max patterns tested | 30 | 500 |
| At 13 patterns | ~95% | 100% |
| Theoretical capacity | $\sim 14$ | $\sim 10^{13}$ |
| Scaling | Linear $O(N)$ | Exponential $O(\exp(d))$ |
Connection to Transformers
The chain of equivalences we have established:
This means:
- Every attention layer in GPT, BERT, or any Transformer is performing energy minimization in a Hopfield landscape.
- The "attention pattern" (the softmax distribution) is the retrieval weight vector of the Hopfield memory.
- Training a Transformer is, in part, learning which patterns to store in each attention head's Hopfield memory.
This perspective offers new ways to analyze Transformers: attention head capacity, energy landscape visualization, and memory interference between stored patterns (analogous to crosstalk in classical Hopfield networks).
Conclusion
Across three parts, we have:
- Derived the mathematics of associative memory from energy functions and Hebbian learning.
- Implemented both classical and modern Hopfield networks from scratch in pure PyTorch.
- Demonstrated empirically that:
- Classical Hopfield achieves perfect retrieval below capacity ($\sim 0.14N$).
- Modern Hopfield stores 500+ patterns in 64 dimensions with 100% accuracy.
- The modern Hopfield update is numerically identical to softmax attention.
The deep insight is that attention is not just a mechanism---it is a physics. It is the equilibrium dynamics of an energy-based memory system. Understanding this connection opens new avenues for designing, analyzing, and improving Transformer architectures.