Back to Projects

Deconstructing TinyGPT from Scratch

A working decoder-only Transformer in 150 lines of PyTorch — the same architecture that scales to GPT-4, trained here to 100% accuracy on 3-digit addition.

Part 1

The Math Behind a Working GPT

Attention as a soft dictionary lookup, scaled dot-product, causal masking, multi-head splitting, pre-norm vs post-norm, weight tying, and the output-reversal trick that makes addition causal.

Part 2

Pure PyTorch Implementation

Building the attention module, the Transformer block, the full TinyGPT, autoregressive generation, and the synthetic addition dataset — all in ~150 lines.
View Code on GitHub

Part 3

Learning Curriculum and Attention

100% test accuracy in 98 seconds with 200K parameters. The per-position learning curriculum exposes the carry-chain bottleneck. Attention maps show column-by-column addition emerge.