Back to Projects

Deconstructing Tokenization

Byte-pair encoding in 80 lines of pure Python — compression curves, the linguistic structure BPE recovers from frequency alone, and the failure mode behind every weird LLM behaviour.

Part 1

Why Byte-Pair Encoding Wins

Character-level vs word-level vs subword, why byte-level eliminates OOV, the greedy-merge schedule, and why tokenization is the silent foundation under every LLM.

Part 2

BPE in Eighty Lines

A single class with train/encode/decode. No external libraries beyond collections.Counter.
View Code on GitHub

Part 3

Compression & Failure Modes

Compression curves at vocab 300–2000, the linguistic structure BPE recovers, and the failure mode: 116 characters of code collapsed into ONE token at merge #200.