Deconstructing Hopfield Networks: Part 3 - Memory Retrieval vs Attention

Abstract

Empirical results from the from-scratch implementations in Part 2. We measure classical retrieval accuracy under corruption, map the storage capacity curve, test the modern network's exponential capacity, and verify numerically that the modern Hopfield update matches softmax attention.

Experimental Setup

PyTorch 2.5.1, CPU, seed 42. Five experiments:

Classical retrieval: 7 patterns in 100 neurons, 25% bit corruption
Classical capacity: 1--30 patterns in 100 neurons, 20% corruption, 5 trials each
Modern retrieval: 20 patterns in 64 dimensions, 30% Gaussian noise
Modern capacity: 5--500 patterns in 64 dimensions, single-step attention retrieval
Attention equivalence: Hopfield update vs. manual softmax attention

Total runtime: ~10 seconds.

Classical Hopfield: Retrieval Accuracy

7 random binary patterns stored in a 100-neuron network, each corrupted by flipping 25 of 100 bits.

Perfect retrieval: All 7 patterns recovered exactly, zero bit errors.
Convergence: 2--3 asynchronous sweeps to reach a fixed point.
Energy descent: Energy drops ~4x from the corrupted state to the stored pattern (roughly $-11$ to $-48$).

At $P/N = 0.07$, we are well below capacity. The signal-to-noise ratio in the local field is high enough for error-free retrieval.

Classical Hopfield: Storage Capacity

Pattern count swept from 1 to 30, 20% corruption, 5 trials per count.

Patterns Stored	Mean Accuracy
1--9	100.0%
10	98.6%
13	~95% (threshold)
15	91.5%
20	83.7%
25	73.4%
30	71.2%

Analysis

The 95% accuracy threshold falls at 13 patterns, or $P/N = 0.130$---close to the theoretical $\sim 0.14N$ bound.

Degradation is smooth, not catastrophic. At 30 patterns the network still recovers ~71% of bits, well above the 50% random baseline. Partial retrieval persists beyond capacity.

Modern Hopfield: Continuous Retrieval

20 random continuous patterns in 64 dimensions, retrieval with 30% additive Gaussian noise.

All 10 tested queries returned the correct pattern at cosine similarity 1.0000. With $\beta = 1.0$ and iterative softmax updates, the modern network discriminates patterns cleanly even under substantial noise.

Modern Hopfield: Exponential Capacity

We pushed the modern network to 500 patterns in 64 dimensions.

Patterns Stored	Retrieval Accuracy
5	100.0%
10	100.0%
20	100.0%
50	100.0%
100	100.0%
200	100.0%
500	100.0%

100% accuracy at every count, all the way to 500.

For comparison: the classical network with 100 neurons tops out around 13 patterns. The modern network in 64 dimensions stores at least 500 with no degradation. Theoretical capacity: $P_{\max} \sim \exp(d/2) = \exp(32) \approx 10^{13}$.

Why Exponential?

The log-sum-exp energy supports exponentially many well-separated minima in $\mathbb{R}^d$:

The classical quadratic energy has minima at corners of the weight matrix's eigenspace---a linear number of directions.
The log-sum-exp energy places minima near each stored pattern, and softmax sharpening at high $\beta$ creates exponentially narrow basins of attraction.

Attention Equivalence: Numerical Proof

Hopfield single-step update vs. standard softmax attention, same inputs:

Stored patterns: 20 random vectors in $\mathbb{R}^{64}$
$K = V = \Xi$ (pattern matrix), $Q = \mathbf{x}$ (random query)
$\beta = 1/\sqrt{d} = 1/\sqrt{64} = 0.125$

Metric	Value
Max absolute difference	$0.00 \times 10^{0}$
Mean absolute difference	$0.00 \times 10^{0}$
Cosine similarity	0.9999999404
Identical (tolerance $10^{-6}$)	True

Numerically identical. The cosine similarity deviation from 1.0 is within float32 precision ($\sim 10^{-7}$).

Implications

Transformers are energy-based models. Each attention head minimizes a modern Hopfield energy. Key-value pairs are stored patterns; the query is the probe.
Attention is memory retrieval. The softmax is not just "weighting by importance"---it is the fixed-point update of an associative memory.
Multi-head attention = multiple Hopfield memories. Each head stores a different pattern set and retrieves independently.
Capacity explains head dimension. Exponential capacity $\sim \exp(d_{\text{head}}/2)$ is why Transformers work with small head dimensions (64 or 128)---even modest $d$ provides enormous storage.

Capacity Comparison: The Full Picture

	Classical	Modern
Network/Dim	$N = 100$	$d = 64$
Max patterns tested	30	500
At 13 patterns	~95%	100%
Theoretical capacity	$\sim 14$	$\sim 10^{13}$
Scaling	Linear $O(N)$	Exponential $O(\exp(d))$

Connection to Transformers

The equivalence chain:

\underbrace{\mathbf{x}^{\text{new}} = \Xi^\top \text{softmax}(\beta \Xi \mathbf{x})}_{\text{Modern Hopfield update}} \;=\; \underbrace{V^\top \text{softmax}\!\left(\frac{K^\top Q}{\sqrt{d_k}}\right)}_{\text{Softmax Attention}}

Concretely:

Every attention layer in GPT, BERT, or any Transformer performs energy minimization in a Hopfield landscape.
The attention pattern (softmax distribution) is the retrieval weight vector of the Hopfield memory.
Training a Transformer partly amounts to learning which patterns to store in each head's memory.

This gives concrete tools for analyzing Transformers: attention head capacity, energy landscape visualization, and memory interference between stored patterns (the continuous analogue of classical crosstalk).

Conclusion

Across three parts we:

Derived associative memory from energy functions and Hebbian learning.
Implemented both classical and modern Hopfield networks from scratch in PyTorch.
Showed empirically that:
- Classical Hopfield achieves perfect retrieval below capacity ($\sim 0.14N$).
- Modern Hopfield stores 500+ patterns in 64 dimensions at 100% accuracy.
- The modern Hopfield update is numerically identical to softmax attention.

Attention is the equilibrium dynamics of an energy-based memory system. That is not a metaphor---it is a mathematical identity, and it gives us a physics-grounded framework for reasoning about Transformer architectures.

Deconstructing Hopfield Networks from Scratch

Part 3: Memory Retrieval vs Attention