Deconstructing TinyGPT: Part 1 - The Math Behind a Working GPT

Introduction

A complete decoder-only Transformer — the architecture behind GPT-2, GPT-3, GPT-4, LLaMA, and every modern large language model — fits in approximately 150 lines of PyTorch. Most of what makes GPT-4 different from GPT-2 is parameter count, training data, and infrastructure, not architecture.

This three-part series builds one such Transformer from first principles, trains it on synthetic three-digit addition, and analyses the per-position learning trajectory. Part 1 covers the math.

Causal Self-Attention as a Soft Dictionary Lookup

Each token's representation $x_t \in \mathbb{R}^d$ is projected into three vectors:

q_t = W_q x_t, \qquad k_t = W_k x_t, \qquad v_t = W_v x_t.

Conceptually, $q$ asks "what am I looking for?", $k$ advertises "what do I represent?", and $v$ carries the content that gets retrieved.

Attention is then the soft generalisation of a dictionary lookup:

\text{attn}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_\text{head}}}\right) V.

The softmax produces a probability distribution over positions. Each output token is a weighted sum of the value vectors of all positions, where the weights are determined by how well each key matches the query. A hard dictionary lookup is the limit where one weight is 1 and the rest are 0.

Why $\sqrt{d_\text{head}}$?

If $q, k \in \mathbb{R}^{d_\text{head}}$ have i.i.d. unit-variance entries, then $q \cdot k = \sum_{i=1}^{d_\text{head}} q_i k_i$ has variance $d_\text{head}$. As $d_\text{head}$ grows, the dot products grow in magnitude. After softmax, large logits push almost all probability onto a single position — the gradient through softmax vanishes for the rest, and the layer becomes hard to train.

Dividing by $\sqrt{d_\text{head}}$ keeps the variance of the logits unit-order, which keeps softmax in a regime where multiple positions retain non-negligible probability. This is the smallest possible fix and the one Vaswani et al. adopted.

Causal Masking

For a decoder-only language model, each token must only see itself and earlier tokens — otherwise the model could cheat by attending to the very token it is trying to predict. We enforce this by setting the upper-triangular part of the attention logit matrix to $-\infty$:

\text{logits}_{t, t'} \leftarrow -\infty \quad \text{whenever } t' > t.

After softmax, those entries become $0$, so future positions contribute nothing to the weighted sum at position $t$.

Multi-Head Attention

Instead of running attention once with one large head, we run $H$ heads in parallel, each operating on a $d_\text{head} = d / H$ subspace:

\text{MHA}(x) = W_o \, \text{concat}\!\left( \text{attn}_1(x), \ldots, \text{attn}_H(x) \right).

Each head learns its own $W_q, W_k, W_v$. The motivation is empirical: different heads specialise to different relational patterns (relative position, syntactic dependency, identity copy, etc.). The total parameter count is unchanged, but the representational variety is larger.

Pre-Norm vs Post-Norm

The original Transformer used post-norm: $x \leftarrow \text{LN}(x + \text{Sublayer}(x))$. GPT-2 switched to pre-norm: $x \leftarrow x + \text{Sublayer}(\text{LN}(x))$.

The difference is consequential. In pre-norm, the residual stream is never normalised, only the inputs to each sublayer are. This means the variance of the residual stream grows additively with depth, but each sublayer sees a normalised, well-conditioned input — and gradient signals flow straight back through the residual path without being attenuated by LayerNorm.

Empirically, pre-norm models train without warmup and without dying gradients to depths exceeding 100 layers, where post-norm requires careful tuning to make it past 12. Every modern LLM uses pre-norm or a close variant.

Weight Tying

The token embedding $E \in \mathbb{R}^{V \times d}$ maps a token id to a vector. The unembedding (lm head) $U \in \mathbb{R}^{d \times V}$ maps a vector back to logits over the vocabulary. Setting $U = E^\top$ — using the same matrix transposed — halves the parameters of these two matrices and works empirically just as well. We adopt this in our implementation.

Why Synthetic Addition?

The Transformer is a very general architecture. A toy task lets us see whether the architecture (and our implementation) is correct without the confounds of language modelling. Three-digit addition is well-defined with a clean ground truth, a small vocabulary, and a non-trivial internal dependency structure — the carry chain. Training to 100% accuracy in under two minutes is a strong sanity check for the entire pipeline.

The output-reversal trick

A subtle point: if we ask the model to emit the sum left-to-right (e.g., "472" for $127+345$), the leftmost answer digit depends on carries from columns to its right — columns the model has not yet emitted. This is non-causal. By reversing the output ("2740" for $127+345$ with the sum padded to four digits), the units digit comes first and depends only on units of $A$ and $B$. Each subsequent digit depends only on already-emitted carries. The dependency becomes causal, which is precisely the shape a causal Transformer is designed to model.

Without this trick, the same architecture struggles to exceed roughly $25\%$ test accuracy on this task. With it, the architecture converges to $100\%$ in tens of seconds. This is the deeper lesson: architectures matter less than how you frame the problem for them.

Summary

Attention is a soft dictionary lookup: query/key dot products produce weights; values get summed.
Scaled dot-product, causal masking, multi-head splitting, and pre-norm residuals are the four design choices that distinguish modern GPTs.
Position embeddings inject order. Weight tying halves the parameter cost of the I/O layers.
The addition task is a clean stress test that exposes whether the architecture and its supervision are aligned — as the reversed-output trick demonstrates.

Part 2 implements all of this in 150 lines of PyTorch.

Deconstructing TinyGPT from Scratch

Part 1: The Math Behind a Working GPT