Deconstructing TinyGPT from Scratch
A working decoder-only Transformer in 150 lines of PyTorch — the same architecture that scales to GPT-4, trained here to 100% accuracy on 3-digit addition.
Part 1
The Math Behind a Working GPT
Attention as a soft dictionary lookup, scaled dot-product, causal masking, multi-head splitting, pre-norm vs post-norm, weight tying, and the output-reversal trick that makes addition causal.
Part 2
Pure PyTorch Implementation
Building the attention module, the Transformer block, the full TinyGPT, autoregressive generation, and the synthetic addition dataset — all in ~150 lines.
View Code on GitHub
Part 3
Learning Curriculum and Attention
100% test accuracy in 98 seconds with 200K parameters. The per-position learning curriculum exposes the carry-chain bottleneck. Attention maps show column-by-column addition emerge.