Deconstructing Tokenization: Part 3 - Compression and Failure Modes

The Three Questions

With our 80-line BPE tokenizer in hand from Part 2, three questions naturally follow. How well does it compress natural text? What does it actually learn — does it discover anything linguistically meaningful, or just arbitrary byte clusters? And critically, what happens when we feed it the kind of structured, repetitive input that production LLMs see all the time?

This part trains BPE at four vocabulary sizes ($300$, $500$, $1{,}000$, $2{,}000$) on a deliberately mixed corpus: prose, templated sentences, and intentionally repetitive code patterns. The compression curve confirms a universal shape. The first merges confirm that linguistic structure emerges from pure frequency counting. And the corpus's repetitive segment exposes the failure mode that explains a substantial fraction of strange LLM behaviour.

The Corpus

The training corpus is $32{,}945$ characters ($64{,}419$ bytes in UTF-8) assembled from three components, each chosen to expose a different aspect of BPE's behaviour.

The first component is several paragraphs of natural English prose — the kind of writing you would find in a textbook or technical blog post. This is BPE's home turf: where it is expected to perform well and where its learned merges should look linguistically sensible.

The second is $800$ templated sentences generated from a small lexicon: "The clever scientist analyzed the brilliant experiment", "After many years the ancient tree grew through the small river", and so on. These have natural English structure but limited vocabulary, which should let BPE saturate the high-frequency merges quickly.

The third is the trap: seven Python-like code lines (model.parameters(), torch.nn.functional, loss.backward(), etc.) repeated $80$ times in a block. This is included specifically to demonstrate the failure mode — BPE should aggressively compress this repetitive structure in a way that exposes its greedy-compression nature.

The Compression Curve

Training BPE at four vocabulary sizes and measuring the number of tokens needed to encode the full corpus:

Vocab size	Encoded length (tokens)	Compression ratio vs bytes	Train time
$300$	$39{,}593$	$1.63\times$	$0.5$ s
$500$	$16{,}693$	$3.86\times$	$1.7$ s
$1{,}000$	$4{,}564$	$14.11\times$	$2.7$ s
$2{,}000$	$3{,}432$	$18.77\times$	$3.3$ s

The shape of this curve is universal across BPE training runs on natural-language corpora. A sharp initial improvement (the first few hundred merges absorb most of the redundancy) is followed by sharply diminishing returns (the next thousand merges add only a few extra percent of compression).

Look at the jump from vocab $500$ to vocab $1{,}000$: doubling the vocabulary reduced the token count from $16{,}693$ to $4{,}564$ — a $3.66\times$ compression improvement from a $2\times$ vocabulary increase. Now look at vocab $1{,}000$ to $2{,}000$: another doubling, but the token count only drops from $4{,}564$ to $3{,}432$ — a $1.33\times$ compression improvement. The first $1{,}000$ merges are doing $73\%$ of the work; the next $1{,}000$ are doing $27\%$.

The same diminishing-returns curve appears at every scale. When GPT-2 grew from a $50{,}000$ vocabulary to GPT-3.5's $100{,}000$, the second $50{,}000$ tokens yielded much less per-token compression than the first. The shape is preserved across orders of magnitude in vocabulary size and corpus size — it is essentially a property of natural-language token distributions, not of any particular BPE implementation.

Why does the curve flatten? Information-theoretically, BPE is approximating the entropy of the corpus. After a certain number of merges, the remaining pairs all have low frequency — there is not enough repetition in the corpus to justify spending vocabulary slots on them. Adding more merges captures rare patterns that contribute very little to total compression. The flattening point depends on the corpus's intrinsic redundancy.

The First Five Merges Are Recognisably Linguistic

Looking at what BPE chose to merge at each step, ranked by frequency at the time of merging:

Merge #	Pair frequency	Resulting token (UTF-8)
$1$	$2{,}159$	`"e "` (e + space)
$2$	$1{,}792$	`"he "`
$3$	$1{,}588$	`" t"` (space + t)
$4$	$1{,}412$	`" the "`
$5$	$1{,}133$	`"er"`

Every one of these is structurally meaningful to a human reader of English. Merge $1$ is the end-of-word "e" followed by a space — the most common bigram in English because so many words end in "e" (the, time, more, made, like). Merge $2$ is the suffix "he" followed by space, which captures the endings of "he", "she", "the", "where", and similar — one of the most common letter sequences in English. Merge $3$ is a space followed by "t", which appears before every "t"-initial word. Merge $4$ is the whole word "the" with both spaces — by merge $4$, BPE has rediscovered the most common word in the English language.

Merge $5$ is "er", the standard English comparative suffix and a very common interior bigram. By the fifth merge, BPE has identified word boundaries, the definite article, and one of the most common morphological suffixes — all without ever being told what a word is, what English is, or what a suffix is.

This is the standard result on English text. BPE recovers word boundaries and morphological structure as a side effect of counting frequencies. The discovery happens because spaces and common words are by far the most frequent things in the corpus. It is critical to internalise that this is not language understanding — BPE has no notion of words. It is pure statistical compression that happens to find linguistically meaningful units when applied to natural language.

The Failure Mode — Merge #200

The previous section was BPE on its best behaviour. Now the failure mode.

Our corpus deliberately included a seven-line block of Python-like syntax repeated $80$ times. By the time BPE has done $200$ merges, it has not just merged individual lines of this block — it has merged the entire block together. The token at id $455$ (the $200$th new vocabulary entry) is:

"model.parameters()\ntorch.nn.functional\nloss.backward()\noptimizer.step()\nfor epoch in range(num_epochs)"

That is $116$ characters encoded as a single token id. Five lines of code, three newlines, multiple identifiers, a function call with parentheses, a control-flow keyword, several dots and underscores — all compressed into one token.

The compression happened from the bottom up. First BPE merged the individual identifiers ("parameters", "functional"), then the method-call patterns ("backward()", "step()"), then full lines ("loss.backward()\n"), then pairs of lines, then larger combinations. Because the same $7$-line block appeared $80$ times, every level of this hierarchy had enough frequency to justify its own merge — and the merges chained together until a substantial fraction of the block had been collapsed.

Why This Failure Mode Matters

The Python-code example is illustrative but the same pathology affects production tokenizers when their training data contains any systematically repetitive content:

Repetitive code. If GPT-2's training corpus contained $80$ copies of a common Stack Overflow boilerplate block, BPE would collapse the entire block into one token. The model would then have to learn the function from "this giant token" to "whatever comes after the boilerplate." But the model's effective context is much shorter than it appears — the giant token is one slot in the context window, but it contains an enormous amount of structure the model is treating as atomic.

Log files and structured outputs. Log lines often share a common prefix ("$ERROR$ $2024$-$01$-$05$T$12$:$34$:$56$Z $\,$ ..."). If BPE was trained on data containing many such lines, the entire prefix collapses into one token. The model can predict what comes after that prefix but cannot easily reason about its components — e.g., asking "what year is this log from?" the model would have to invert the giant token to recover the year, which is a much harder operation than reading it from a smaller-grained tokenization.

Common prompts. Instruction-tuning datasets often contain repeated phrasings — "You are a helpful assistant. Please answer the following question:" or similar. If BPE was trained on data that included thousands of copies of this prefix, the entire prefix could collapse into one token. The model then learns associations from "that giant token" to the response, which is part of why fine-tuned LLMs are sometimes brittle to phrasing changes — a slight wording variation breaks the giant-token shortcut.

Specific LLM Failures Traceable to Tokenization

This failure mode explains several famous LLM behaviours:

Letter counting. "How many R's are in strawberry?" GPT-3 and many follow-ups fail this question. The reason: "strawberry" is typically a single token in their vocabularies. The model literally cannot see individual letters — they have been collapsed into one opaque id. Asking the model to count letters is asking it to invert tokenization, which it was never trained to do.

Arithmetic with specific digit lengths. LLMs are weirdly bad at certain arithmetic — say, adding $3$-digit numbers — even when they handle $2$-digit and $4$-digit just fine. The reason: $3$-digit numbers like "371" may have tokenized as one token in training data, while $2$-digit and $4$-digit numbers tokenized differently. The model's arithmetic abilities are fragmented across token boundaries the user does not see.

Whitespace sensitivity. The same prompt with subtle whitespace changes (a leading space, a trailing newline, a different indentation level) can produce dramatically different LLM behaviour. The reason: those whitespace variations tokenize differently. The model never sees "the same prompt with different formatting" — it sees "two completely different token sequences." Tokenization variance is invisible to the user but is the single largest source of prompt-engineering instability.

Andrej Karpathy's tokenizer videos make this exact point: every "weird thing" about LLMs has its roots in tokenization choices the user cannot see and the model cannot inspect.

How Vocabulary Size Affects a Specific Sentence

To make the compression concrete, here is the sentence "The tokenizer compresses common substrings into single tokens. Every modern large language model uses some variant of byte-level BPE for this reason." encoded by each of our four tokenizers:

Vocab size	Tokens	Notes
$300$	$105$	Mostly bytes — barely any compression of words
$500$	$85$	"The ", "ing", "the " merged
$1{,}000$	$61$	"tokenizer", "modern", "large" become single tokens
$2{,}000$	$45$	"single tokens" becomes one token

The transition between vocab sizes is informative. At $300$ tokens of vocabulary, very little has been merged beyond individual letters and a few common bigrams — the sentence is encoded almost byte-per-byte. By vocab $500$, common short words and morphological suffixes are single tokens. By vocab $1{,}000$, content words like "tokenizer" and "language" fit in one token each. By vocab $2{,}000$, even two-word phrases ("single tokens") can collapse.

For reference, GPT-4 uses the cl100k_base tokenizer with vocab $\sim 100{,}000$. The same sentence in cl100k_base encodes to roughly $28$ tokens. The compression ratio at GPT-4's vocab size is roughly $5\times$ what our toy tokenizer achieves at vocab $2{,}000$, and roughly $20\times$ what byte-level encoding produces.

What This Proves

The three results together tell a consistent story:

BPE is greedy compression. It is not language understanding. On natural-language corpora it incidentally produces linguistically meaningful tokens (words, morphemes, prefixes, suffixes) because those are the high-frequency patterns. On repetitive or structured corpora it produces absurdly long single tokens because those are also high-frequency patterns. The algorithm makes no distinction between "good" and "bad" patterns to compress.

The tokenizer is frozen and shapes everything downstream. Once trained, the merges are fixed. The model is trained on top of those merges. Every weird LLM behaviour that traces back to tokenization is essentially permanent — fixing it would require retraining the model from scratch with a different tokenizer.

Tokenization is the silent foundation under every LLM. Every behaviour that depends on character counting, position-within-word, or sub-token reasoning is downstream of these decisions. The tokenizer is a separate piece of frozen software, trained once, and the model has no introspection into it. The next time you see a strange LLM output, the first question to ask is: "what does the tokenizer do to this input?"

Open Research Directions

Several recent research threads explore whether BPE is the right substrate at all:

Byte-level Transformers (ByT5, MegaByte) train directly on raw bytes, skipping tokenization entirely. This costs more compute per character (sequences are longer) but eliminates the tokenization-induced pathologies. Empirically these models close most of the gap with BPE-tokenized models but at substantially higher compute cost.

Learned tokenizers jointly optimise the tokenizer and the model. The motivation: tokenization currently gets one shot, before training, with no feedback from how well the resulting tokens work for the task. Joint optimisation could in principle produce tokens better suited to downstream performance. This has not yet succeeded at large scale.

Patch-based input (V-JEPA, Patches Scale Better) abandons tokens entirely in favour of fixed-size patches. This is more natural for image and video models. The text analogue is still under exploration.

For now, byte-level BPE remains the dominant choice. Every production LLM in 2026 — GPT-4o, Claude 3, Llama 3, Gemini, Mistral Large — uses a BPE variant. The pathologies described here are present in all of them. They are not going away soon.

Full code on GitHub: github.com/soveshmohapatra/BPE-Tokenizer