Back to Hopfield Hub

Deconstructing Hopfield Networks from Scratch

Part 3: Memory Retrieval vs Attention

Abstract

In the final installment of this three-part series, we present empirical results from our from-scratch Hopfield Network implementations. We measure classical retrieval accuracy under corruption, map the storage capacity curve, demonstrate the modern network's exponential capacity advantage, and provide a numerical proof that the modern Hopfield update is identical to softmax attention. All results are from actual benchmark runs.

Experimental Setup

All experiments use PyTorch 2.5.1 on CPU with fixed random seed (42) for reproducibility. We test:

  1. Classical retrieval: 7 patterns in 100 neurons, 25% bit corruption
  2. Classical capacity: 1--30 patterns in 100 neurons, 20% corruption, 5 trials each
  3. Modern retrieval: 20 patterns in 64 dimensions, 30% Gaussian noise
  4. Modern capacity: 5--500 patterns in 64 dimensions, single-step attention retrieval
  5. Attention equivalence: Hopfield update vs. manual softmax attention

Total runtime: approximately 10 seconds.

Classical Hopfield: Retrieval Accuracy

We stored 7 random binary patterns in a 100-neuron network and corrupted each by flipping 25 of 100 bits.

Key observations:

With only 7 patterns in 100 neurons ($P/N = 0.07$), we are well below the capacity limit. The signal-to-noise ratio in the local field is high enough for error-free retrieval.

Classical Hopfield: Storage Capacity

We systematically increased the number of stored patterns from 1 to 30, using 20% corruption and averaging over 5 trials per count.

Patterns Stored Mean Accuracy
1--9100.0%
1098.6%
13~95% (threshold)
1591.5%
2083.7%
2573.4%
3071.2%

Analysis

The capacity at 95% accuracy is approximately 13 patterns, giving a ratio of $P/N = 0.130$. This aligns closely with the theoretical bound of $\sim 0.14N$.

The degradation is smooth, not catastrophic. At 30 patterns, the network still recovers about 71% of bits correctly---it is not random ($\approx 50\%$), suggesting partial retrieval even beyond capacity.

Modern Hopfield: Continuous Retrieval

We stored 20 random continuous patterns in 64 dimensions and tested retrieval with 30% additive Gaussian noise.

All 10 tested queries correctly retrieved the target pattern with cosine similarity 1.0000 (four decimal places). The modern Hopfield network using iterative softmax updates with $\beta = 1.0$ achieves perfect pattern discrimination even with substantial noise.

Modern Hopfield: Exponential Capacity

The most striking result: we tested the modern network's capacity by storing up to 500 patterns in 64 dimensions.

Patterns Stored Retrieval Accuracy
5100.0%
10100.0%
20100.0%
50100.0%
100100.0%
200100.0%
500100.0%

100% accuracy at every tested pattern count, up to 500 patterns in 64 dimensions.

Compare with the classical network: 100 neurons can reliably store ~13 patterns. The modern network in 64 dimensions stores at least 500 patterns with no degradation. The theoretical capacity is exponential: $P_{\max} \sim \exp(d/2) = \exp(32) \approx 10^{13}$.

Why Exponential?

The log-sum-exp energy creates exponentially many well-separated minima in $\mathbb{R}^d$. Intuitively:

Attention Equivalence: Numerical Proof

We computed both the Hopfield single-step update and standard softmax attention with identical inputs:

Metric Value
Max absolute difference$0.00 \times 10^{0}$
Mean absolute difference$0.00 \times 10^{0}$
Cosine similarity0.9999999404
Identical (tolerance $10^{-6}$)True

The two computations produce numerically identical results. The tiny deviation in cosine similarity from 1.0 is within floating-point precision ($\sim 10^{-7}$).

What This Means

  1. Transformers are energy-based models. Each attention head minimizes a modern Hopfield energy function. The key-value pairs are stored patterns; the query is the probe.
  2. Attention is memory retrieval. The softmax operation is not just "weighting by importance"---it is the fixed-point update of an energy-based associative memory.
  3. Multi-head attention = multiple Hopfield memories. Each head stores a different set of patterns (keys/values) and retrieves independently.
  4. Capacity explains head dimension. The exponential capacity $\sim \exp(d_{\text{head}}/2)$ explains why Transformers work well with relatively small head dimensions (64 or 128): even modest dimensions provide enormous storage.

Capacity Comparison: The Full Picture

Classical Modern
Network/Dim$N = 100$$d = 64$
Max patterns tested30500
At 13 patterns~95%100%
Theoretical capacity$\sim 14$$\sim 10^{13}$
ScalingLinear $O(N)$Exponential $O(\exp(d))$

Connection to Transformers

The chain of equivalences we have established:

$$ \underbrace{\mathbf{x}^{\text{new}} = \Xi^\top \text{softmax}(\beta \Xi \mathbf{x})}_{\text{Modern Hopfield update}} \;=\; \underbrace{V^\top \text{softmax}\!\left(\frac{K^\top Q}{\sqrt{d_k}}\right)}_{\text{Softmax Attention}} $$

This means:

This perspective offers new ways to analyze Transformers: attention head capacity, energy landscape visualization, and memory interference between stored patterns (analogous to crosstalk in classical Hopfield networks).

Conclusion

Across three parts, we have:

  1. Derived the mathematics of associative memory from energy functions and Hebbian learning.
  2. Implemented both classical and modern Hopfield networks from scratch in pure PyTorch.
  3. Demonstrated empirically that:
    • Classical Hopfield achieves perfect retrieval below capacity ($\sim 0.14N$).
    • Modern Hopfield stores 500+ patterns in 64 dimensions with 100% accuracy.
    • The modern Hopfield update is numerically identical to softmax attention.

The deep insight is that attention is not just a mechanism---it is a physics. It is the equilibrium dynamics of an energy-based memory system. Understanding this connection opens new avenues for designing, analyzing, and improving Transformer architectures.