Deconstructing Tokenization: Part 1 — Why BPE Wins

Overview

A neural language model does not consume text. It consumes a sequence of integer ids. The function that turns text into ids — the tokenizer — is trained separately, frozen for the model's lifetime, and shapes everything downstream.

Three Granularities

Character-level: one token per Unicode codepoint. Long sequences, fixed vocabulary, wasted capacity on common patterns.

Word-level: short sequences, lopsided vocabulary, brittle on out-of-vocabulary tokens.

Subword-level: the compromise everyone settled on. Common words become single tokens; rare words decompose. Byte-pair encoding (BPE) is the dominant algorithm.

Byte-Pair Encoding

Algorithm from Gage (1994), originally a data compression paper. Sennrich et al. (2016) adapted it to NMT; GPT-2 made it byte-level in 2019.

Initial vocabulary: the 256 raw bytes.
Tokenize the corpus byte-by-byte.
Repeat until target vocabulary size:
1. Count adjacent token pair frequencies.
2. Pick the most frequent pair $(a, b)$.
3. Assign a new token id, replace every occurrence in the corpus.
4. Record the merge $((a, b) \to \text{new\_id})$.

Inference: replay recorded merges in order on the byte-level representation of new text. Order matters — earlier merges combine into later ones.

Why Byte-Level Beats Character-Level

Every UTF-8 text encodes to bytes losslessly — no OOV problem, ever.
Multilingual text, emoji, mathematical symbols all decompose to bytes uniformly.
The initial vocabulary is always 256 (the bytes), independent of corpus.

Greedy Compression Is Not Language Understanding

BPE is greedy frequency-based compression. It has no concept of words, syntax, or morphology. Patterns it discovers are linguistically meaningful only because we run it on natural language. Train it on repetitive code and it will merge entire lines of code into single tokens — exactly the failure mode Part 3 demonstrates.

The Frozen-Tokenizer Problem

Once trained, BPE merges are fixed. The model is trained on top of those tokens. Changing the tokenizer post hoc would invalidate every learned weight. Tokenization artifacts (failure to count letters, whitespace sensitivity, weird arithmetic) are essentially permanent.