Back to Hopfield Hub

Deconstructing Hopfield Networks from Scratch

Part 3: Memory Retrieval vs Attention

Abstract

Empirical results from the from-scratch implementations in Part 2. We measure classical retrieval accuracy under corruption, map the storage capacity curve, test the modern network's exponential capacity, and verify numerically that the modern Hopfield update matches softmax attention.

Experimental Setup

PyTorch 2.5.1, CPU, seed 42. Five experiments:

  1. Classical retrieval: 7 patterns in 100 neurons, 25% bit corruption
  2. Classical capacity: 1--30 patterns in 100 neurons, 20% corruption, 5 trials each
  3. Modern retrieval: 20 patterns in 64 dimensions, 30% Gaussian noise
  4. Modern capacity: 5--500 patterns in 64 dimensions, single-step attention retrieval
  5. Attention equivalence: Hopfield update vs. manual softmax attention

Total runtime: ~10 seconds.

Classical Hopfield: Retrieval Accuracy

7 random binary patterns stored in a 100-neuron network, each corrupted by flipping 25 of 100 bits.

At $P/N = 0.07$, we are well below capacity. The signal-to-noise ratio in the local field is high enough for error-free retrieval.

Classical Hopfield: Storage Capacity

Pattern count swept from 1 to 30, 20% corruption, 5 trials per count.

Patterns Stored Mean Accuracy
1--9100.0%
1098.6%
13~95% (threshold)
1591.5%
2083.7%
2573.4%
3071.2%

Analysis

The 95% accuracy threshold falls at 13 patterns, or $P/N = 0.130$---close to the theoretical $\sim 0.14N$ bound.

Degradation is smooth, not catastrophic. At 30 patterns the network still recovers ~71% of bits, well above the 50% random baseline. Partial retrieval persists beyond capacity.

Modern Hopfield: Continuous Retrieval

20 random continuous patterns in 64 dimensions, retrieval with 30% additive Gaussian noise.

All 10 tested queries returned the correct pattern at cosine similarity 1.0000. With $\beta = 1.0$ and iterative softmax updates, the modern network discriminates patterns cleanly even under substantial noise.

Modern Hopfield: Exponential Capacity

We pushed the modern network to 500 patterns in 64 dimensions.

Patterns Stored Retrieval Accuracy
5100.0%
10100.0%
20100.0%
50100.0%
100100.0%
200100.0%
500100.0%

100% accuracy at every count, all the way to 500.

For comparison: the classical network with 100 neurons tops out around 13 patterns. The modern network in 64 dimensions stores at least 500 with no degradation. Theoretical capacity: $P_{\max} \sim \exp(d/2) = \exp(32) \approx 10^{13}$.

Why Exponential?

The log-sum-exp energy supports exponentially many well-separated minima in $\mathbb{R}^d$:

Attention Equivalence: Numerical Proof

Hopfield single-step update vs. standard softmax attention, same inputs:

Metric Value
Max absolute difference$0.00 \times 10^{0}$
Mean absolute difference$0.00 \times 10^{0}$
Cosine similarity0.9999999404
Identical (tolerance $10^{-6}$)True

Numerically identical. The cosine similarity deviation from 1.0 is within float32 precision ($\sim 10^{-7}$).

Implications

  1. Transformers are energy-based models. Each attention head minimizes a modern Hopfield energy. Key-value pairs are stored patterns; the query is the probe.
  2. Attention is memory retrieval. The softmax is not just "weighting by importance"---it is the fixed-point update of an associative memory.
  3. Multi-head attention = multiple Hopfield memories. Each head stores a different pattern set and retrieves independently.
  4. Capacity explains head dimension. Exponential capacity $\sim \exp(d_{\text{head}}/2)$ is why Transformers work with small head dimensions (64 or 128)---even modest $d$ provides enormous storage.

Capacity Comparison: The Full Picture

Classical Modern
Network/Dim$N = 100$$d = 64$
Max patterns tested30500
At 13 patterns~95%100%
Theoretical capacity$\sim 14$$\sim 10^{13}$
ScalingLinear $O(N)$Exponential $O(\exp(d))$

Connection to Transformers

The equivalence chain:

$$ \underbrace{\mathbf{x}^{\text{new}} = \Xi^\top \text{softmax}(\beta \Xi \mathbf{x})}_{\text{Modern Hopfield update}} \;=\; \underbrace{V^\top \text{softmax}\!\left(\frac{K^\top Q}{\sqrt{d_k}}\right)}_{\text{Softmax Attention}} $$

Concretely:

This gives concrete tools for analyzing Transformers: attention head capacity, energy landscape visualization, and memory interference between stored patterns (the continuous analogue of classical crosstalk).

Conclusion

Across three parts we:

  1. Derived associative memory from energy functions and Hebbian learning.
  2. Implemented both classical and modern Hopfield networks from scratch in PyTorch.
  3. Showed empirically that:
    • Classical Hopfield achieves perfect retrieval below capacity ($\sim 0.14N$).
    • Modern Hopfield stores 500+ patterns in 64 dimensions at 100% accuracy.
    • The modern Hopfield update is numerically identical to softmax attention.

Attention is the equilibrium dynamics of an energy-based memory system. That is not a metaphor---it is a mathematical identity, and it gives us a physics-grounded framework for reasoning about Transformer architectures.