A complete decoder-only Transformer — the architecture behind GPT-2, GPT-3, GPT-4, LLaMA, and every modern large language model — fits in approximately 150 lines of PyTorch. Most of what makes GPT-4 different from GPT-2 is parameter count, training data, and infrastructure, not architecture.
This three-part series builds one such Transformer from first principles, trains it on synthetic three-digit addition, and analyses the per-position learning trajectory. Part 1 covers the math.
Causal Self-Attention as a Soft Dictionary Lookup
Each token's representation $x_t \in \mathbb{R}^d$ is projected into three vectors:
Conceptually, $q$ asks "what am I looking for?", $k$ advertises "what do I represent?", and $v$ carries the content that gets retrieved.
Attention is then the soft generalisation of a dictionary lookup:
The softmax produces a probability distribution over positions. Each output token is a weighted sum of the value vectors of all positions, where the weights are determined by how well each key matches the query. A hard dictionary lookup is the limit where one weight is 1 and the rest are 0.
Why $\sqrt{d_\text{head}}$?
If $q, k \in \mathbb{R}^{d_\text{head}}$ have i.i.d. unit-variance entries, then $q \cdot k = \sum_{i=1}^{d_\text{head}} q_i k_i$ has variance $d_\text{head}$. As $d_\text{head}$ grows, the dot products grow in magnitude. After softmax, large logits push almost all probability onto a single position — the gradient through softmax vanishes for the rest, and the layer becomes hard to train.
Dividing by $\sqrt{d_\text{head}}$ keeps the variance of the logits unit-order, which keeps softmax in a regime where multiple positions retain non-negligible probability. This is the smallest possible fix and the one Vaswani et al. adopted.
Causal Masking
For a decoder-only language model, each token must only see itself and earlier tokens — otherwise the model could cheat by attending to the very token it is trying to predict. We enforce this by setting the upper-triangular part of the attention logit matrix to $-\infty$:
After softmax, those entries become $0$, so future positions contribute nothing to the weighted sum at position $t$.
Multi-Head Attention
Instead of running attention once with one large head, we run $H$ heads in parallel, each operating on a $d_\text{head} = d / H$ subspace:
Each head learns its own $W_q, W_k, W_v$. The motivation is empirical: different heads specialise to different relational patterns (relative position, syntactic dependency, identity copy, etc.). The total parameter count is unchanged, but the representational variety is larger.
Pre-Norm vs Post-Norm
The original Transformer used post-norm: $x \leftarrow \text{LN}(x + \text{Sublayer}(x))$. GPT-2 switched to pre-norm: $x \leftarrow x + \text{Sublayer}(\text{LN}(x))$.
The difference is consequential. In pre-norm, the residual stream is never normalised, only the inputs to each sublayer are. This means the variance of the residual stream grows additively with depth, but each sublayer sees a normalised, well-conditioned input — and gradient signals flow straight back through the residual path without being attenuated by LayerNorm.
Empirically, pre-norm models train without warmup and without dying gradients to depths exceeding 100 layers, where post-norm requires careful tuning to make it past 12. Every modern LLM uses pre-norm or a close variant.
Weight Tying
The token embedding $E \in \mathbb{R}^{V \times d}$ maps a token id to a vector. The unembedding (lm head) $U \in \mathbb{R}^{d \times V}$ maps a vector back to logits over the vocabulary. Setting $U = E^\top$ — using the same matrix transposed — halves the parameters of these two matrices and works empirically just as well. We adopt this in our implementation.
Why Synthetic Addition?
The Transformer is a very general architecture. A toy task lets us see whether the architecture (and our implementation) is correct without the confounds of language modelling. Three-digit addition is well-defined with a clean ground truth, a small vocabulary, and a non-trivial internal dependency structure — the carry chain. Training to 100% accuracy in under two minutes is a strong sanity check for the entire pipeline.
The output-reversal trick
A subtle point: if we ask the model to emit the sum left-to-right (e.g., "472" for $127+345$), the leftmost answer digit depends on carries from columns to its right — columns the model has not yet emitted. This is non-causal. By reversing the output ("2740" for $127+345$ with the sum padded to four digits), the units digit comes first and depends only on units of $A$ and $B$. Each subsequent digit depends only on already-emitted carries. The dependency becomes causal, which is precisely the shape a causal Transformer is designed to model.
Without this trick, the same architecture struggles to exceed roughly $25\%$ test accuracy on this task. With it, the architecture converges to $100\%$ in tens of seconds. This is the deeper lesson: architectures matter less than how you frame the problem for them.
Summary
- Attention is a soft dictionary lookup: query/key dot products produce weights; values get summed.
- Scaled dot-product, causal masking, multi-head splitting, and pre-norm residuals are the four design choices that distinguish modern GPTs.
- Position embeddings inject order. Weight tying halves the parameter cost of the I/O layers.
- The addition task is a clean stress test that exposes whether the architecture and its supervision are aligned — as the reversed-output trick demonstrates.
Part 2 implements all of this in 150 lines of PyTorch.