Deconstructing Hopfield Networks: Part 1 - The Math of Associative Memory

Abstract

Hopfield Networks are energy-based models where memory storage and retrieval emerge from the physics of energy minimization. In this first installment of a three-part series, we develop the mathematical foundations: the energy function, Hebbian learning, storage capacity bounds, and the modern continuous extension that connects directly to Transformer attention. No code yet---just the math that makes it all work.

Energy-Based Models and Associative Memory

The central premise of a Hopfield network is deceptively simple: memories are energy minima.

Consider a system of $N$ binary neurons, each taking values in $\{-1, +1\}$. The state of the network is a vector $\mathbf{x} \in \{-1, +1\}^N$. We define an energy function over states:

E(\mathbf{x}) = -\frac{1}{2} \mathbf{x}^\top W \mathbf{x}

where $W \in \mathbb{R}^{N \times N}$ is a symmetric weight matrix with zero diagonal ($W_{ii} = 0$).

The key insight: if we can construct $W$ such that the stored patterns $\{\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P\}$ correspond to local minima of $E$, then retrieval becomes gradient descent on the energy landscape. Starting from a corrupted query, the dynamics roll downhill until they settle into the nearest minimum---recovering the stored memory.

This is content-addressable memory: you retrieve by content (a partial or noisy version of the memory), not by address.

Hebbian Learning: Constructing the Weight Matrix

How do we construct $W$ so that the desired patterns are energy minima? The answer comes from Hebb's rule (1949): "Neurons that fire together, wire together."

Given $P$ patterns $\boldsymbol{\xi}^\mu \in \{-1, +1\}^N$ for $\mu = 1, \ldots, P$, the Hebbian weight matrix is:

W = \frac{1}{N} \sum_{\mu=1}^{P} \boldsymbol{\xi}^\mu (\boldsymbol{\xi}^\mu)^\top, \qquad W_{ii} = 0

The diagonal is zeroed to prevent self-reinforcement.

Why This Works

Consider the "local field" acting on neuron $i$ when the network is in state $\boldsymbol{\xi}^\nu$ (one of the stored patterns):

h_i = \sum_{j \neq i} W_{ij} \xi^\nu_j = \frac{1}{N} \sum_{\mu=1}^{P} \xi^\mu_i \sum_{j \neq i} \xi^\mu_j \xi^\nu_j

For the term $\mu = \nu$ (the "signal"):

\frac{1}{N} \xi^\nu_i \sum_{j \neq i} (\xi^\nu_j)^2 = \frac{N-1}{N} \xi^\nu_i \approx \xi^\nu_i

When patterns are random and uncorrelated, the crosstalk term (from $\mu \neq \nu$) has mean zero and variance $\sim P/N$. As long as $P/N$ is small, the signal dominates and $\text{sign}(h_i) = \xi^\nu_i$---the pattern is a fixed point.

The Update Rule

Retrieval uses asynchronous updates. At each step, a neuron $i$ is selected (randomly or sequentially) and updated:

x_i \leftarrow \text{sign}\left(\sum_{j} W_{ij} x_j\right)

Energy Always Decreases

A crucial property: each asynchronous update can only decrease (or maintain) the energy. The change in energy when neuron $i$ flips:

\Delta E = -\left(x_i^{\text{new}} - x_i\right) \sum_j W_{ij} x_j \leq 0

Since the energy is bounded below and strictly non-increasing, the dynamics must converge to a fixed point (a local minimum of $E$).

Storage Capacity

The fundamental question: how many patterns can a Hopfield network reliably store?

The Classical Bound

The answer, established by Amit, Gutfreund, and Sompolinsky (1985), is:

P_{\max} \approx \frac{N}{2 \ln N}

for perfect retrieval. In practice, if we allow a small fraction of bit errors, the capacity is approximately:

P_{\max} \approx 0.14 N

This is linear in $N$. For a network with $N = 100$ neurons, the practical capacity is roughly 13--14 patterns.

Why Capacity Is Limited

The bottleneck is crosstalk. Each additional stored pattern adds noise to the local field. When $P$ grows relative to $N$, the crosstalk overwhelms the signal, creating spurious attractors---stable states that do not correspond to any stored pattern. These are "chimeric" mixtures of multiple patterns.

The Modern Continuous Hopfield Network

In 2020, Ramsauer et al. proposed a radical upgrade: replace the quadratic energy with a log-sum-exp.

Continuous States and the New Energy

Instead of binary states, we allow $\mathbf{x} \in \mathbb{R}^d$. The stored patterns are continuous vectors $\boldsymbol{\xi}^1, \ldots, \boldsymbol{\xi}^P \in \mathbb{R}^d$. The energy function becomes:

E(\mathbf{x}) = -\frac{1}{\beta} \ln \left( \sum_{\mu=1}^{P} \exp\left(\beta \, (\boldsymbol{\xi}^\mu)^\top \mathbf{x}\right) \right) + \frac{1}{2} \|\mathbf{x}\|^2 + \text{const}

where $\beta > 0$ is an inverse temperature parameter.

The Update Rule Is Attention

Setting $\nabla_\mathbf{x} E = 0$ and solving, the fixed-point update becomes:

\mathbf{x}^{\text{new}} = \sum_{\mu=1}^{P} \text{softmax}\left(\beta \, \Xi^\top \mathbf{x}\right)_\mu \, \boldsymbol{\xi}^\mu = \Xi^\top \, \text{softmax}\left(\beta \, \Xi^\top \mathbf{x}\right)

Now compare with standard softmax attention:

\text{Attention}(Q, K, V) = V^\top \, \text{softmax}\left(\frac{K^\top Q}{\sqrt{d_k}}\right)

Setting $K = V = \Xi$ (stored patterns), $Q = \mathbf{x}$ (query), and $\beta = 1/\sqrt{d_k}$, the modern Hopfield update is exactly the attention mechanism.

Exponential Capacity

The log-sum-exp energy creates much sharper minima than the quadratic energy. The resulting storage capacity is:

P_{\max} \sim \exp\left(\frac{d}{2}\right)

which is exponential in the pattern dimension $d$. For $d = 64$, this is astronomically larger than the classical $0.14N$ bound.

Summary and Preview

Property	Classical (1982)	Modern (2020)
State space	Binary $\{-1, +1\}^N$	Continuous $\mathbb{R}^d$
Update	$\text{sign}(W\mathbf{x})$	$\Xi^\top \text{softmax}(\beta \Xi^\top \mathbf{x})$
Learning	Hebbian outer products	Direct pattern storage
Capacity	$\sim 0.14N$ (linear)	$\sim \exp(d/2)$ (exponential)
Connection	---	= Softmax Attention

In Part 2, we implement both networks from scratch in pure PyTorch and verify these mathematical properties empirically.

Deconstructing Hopfield Networks from Scratch

Part 1: The Math of Associative Memory