Deconstructing Tokenization
Byte-pair encoding in 80 lines of pure Python — compression curves, the linguistic structure BPE recovers from frequency alone, and the failure mode behind every weird LLM behaviour.
Part 1
Why Byte-Pair Encoding Wins
Character-level vs word-level vs subword, why byte-level eliminates OOV, the greedy-merge schedule, and why tokenization is the silent foundation under every LLM.
Part 2
BPE in Eighty Lines
A single class with train/encode/decode. No external libraries beyond collections.Counter.
View Code on GitHub
Part 3
Compression & Failure Modes
Compression curves at vocab 300–2000, the linguistic structure BPE recovers, and the failure mode: 116 characters of code collapsed into ONE token at merge #200.