Before 2017, NLP ran on RNNs and LSTMs -- architectures that processed data sequentially. Two bottlenecks followed: training could not be parallelized, and models lost information across long sequences to vanishing gradients.
Vaswani et al.'s "Attention Is All You Need" (2017) discarded recurrence entirely in favor of Self-Attention. This post covers the mathematical core: Multi-Head Self-Attention and Positional Encoding.
The Intuition Behind Attention
Imagine reading a sentence. When you look at the word "bank," you subconsciously check the surrounding words (e.g., "river" vs. "money") to determine its context. Self-Attention allows a neural network to do exactly this mathematically. It computes a weighted representation of each word based on its relevance to every other word in the sequence.
The key departure from prior work is that attention operates over the entire sequence simultaneously. An LSTM must compress everything it has seen into a single hidden vector, creating a bottleneck that grows worse as sequences get longer. Self-Attention sidesteps this entirely: every token has a direct connection to every other token, and the model learns which connections matter.
Queries, Keys, and Values
For each word in the sequence, the model projects its embedding into three distinct vectors:
- Query (Q): What information is this word looking for?
- Key (K): What information does this word possess?
- Value (V): What is the actual content/meaning of this word?
The relevance between two words is computed via the dot product of their respective Query and Key vectors.
Concretely, given an input embedding matrix $X \in \mathbb{R}^{n \times d_{model}}$, we obtain these projections through learned weight matrices: $Q = XW^Q$, $K = XW^K$, $V = XW^V$, where each $W$ matrix has shape $d_{model} \times d_k$. The parameter count here is $3 \times d_{model} \times d_k$ per head -- in our implementation with $d_{model} = 64$, that is 12,288 parameters just for the attention projections across 4 heads.
Scaled Dot-Product Attention
Matrix-multiplying $Q$ and $K$ produces a raw attention score matrix. Large dot products push softmax into saturation, so we scale by $\sqrt{d_k}$. After softmax normalizes the scores into probabilities, we multiply by $V$:
Why $\sqrt{d_k}$ specifically? Consider two random vectors of dimension $d_k$. Their dot product has mean 0 and variance $d_k$. As $d_k$ grows, the dot products grow in magnitude, pushing softmax outputs toward one-hot vectors. Dividing by $\sqrt{d_k}$ normalizes the variance back to 1, keeping gradients healthy throughout training. Without this scaling, we observed attention weights collapsing to near-deterministic distributions early in training, starving the model of gradient signal.
Multi-Head Attention
Instead of computing attention once, the Transformer splits the vectors into multiple "heads." This allows the model to jointly attend to information from different representation subspaces at different positions.
where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$.
Each head operates on a reduced dimensionality of $d_k = d_{model} / h$. With $d_{model} = 64$ and $h = 4$ heads, each head works in a 16-dimensional subspace. One head might learn to attend to syntactic relationships (subject-verb agreement), while another captures semantic similarity. The final linear projection $W^O$ learns to combine these diverse attention patterns into a single unified representation. The total parameter cost of multi-head attention remains identical to single-head attention with the same $d_{model}$ -- the parallelism comes for free.
Injecting Order: Positional Encodings
Unlike RNNs, the attention mechanism has no inherent notion of order; processing "A B C" is treated the same as "C B A". To fix this, we inject positional encodings directly into the input embeddings. Using a combination of sine and cosine functions of different frequencies, the model learns absolute and relative positional relationships:
$$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $$
The choice of sinusoidal functions is deliberate. For any fixed offset $k$, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$, which means the model can learn relative positions through simple linear transformations. Each dimension $i$ operates at a different frequency, creating a unique signature for every position -- much like how binary digits represent integers, but in a continuous space. The encoding is added (not concatenated) to the input embeddings, so it does not increase the model dimension.
What's Next
That covers the engine. Part 2 turns these equations into a working PyTorch Encoder-Decoder.