Deconstructing Tokenization: Part 2 - BPE in Eighty Lines

Overview

Part 1 established that byte-level BPE is the algorithm behind every modern LLM tokenizer. Now we implement it from scratch in pure Python. The full tokenizer is a single class with three methods — train, encode, decode — and around 80 lines of code. The only standard-library import we need is collections.Counter; everything else is plain Python lists and byte arithmetic.

The interesting parts are not the data structures (which are obvious) but the subtle correctness points: how to apply merges in order so that earlier merges produce tokens that later merges can themselves combine, how to handle the boundary between bytes and strings that Python forces you to think about, and what defensive choices keep the code safe on adversarial inputs.

Why a Class with Two Pieces of State

The full tokenizer needs two pieces of state: an ordered list of merges and a vocabulary dictionary that maps token ids to byte sequences. The vocabulary is initialised to the 256 raw bytes; the merges list is empty until train is called.

class BPETokenizer:
    def __init__(self):
        self.merges = []
        self.vocab = {i: bytes([i]) for i in range(256)}

Two design choices to notice. First, the merges are stored as a list, not a set or dict — the order matters at encoding time, because a merge that creates token id $300$ may itself be the input to a later merge that creates token id $500$. If we shuffled the order, encoding would no longer be deterministic.

Second, the vocabulary maps to bytes objects, not strings. This is the most important data-type choice in the entire tokenizer. A single Unicode codepoint may take 1-4 bytes in UTF-8, and BPE operates on the byte level, not the codepoint level. Storing vocabulary entries as bytes means we can concatenate them with b"" + b"" without worrying about codepoint boundaries — and the eventual .decode("utf-8") at output time handles any boundary issues that arise.

The Training Loop

BPE training is iterative: at each step, find the most common adjacent pair in the corpus, replace it with a new token id, repeat until you reach the target vocabulary size. The full loop fits in a dozen lines:

def train(self, text, vocab_size):
    tokens = list(text.encode("utf-8"))    # bytes → list of ints (0..255)
    next_id = 256
    while next_id < vocab_size:
        pairs = self._count_pairs(tokens)
        if not pairs:
            break
        (a, b), freq = pairs.most_common(1)[0]
        if freq < 2:
            break                          # nothing repeated — stop
        tokens = self._merge(tokens, (a, b), next_id)
        self.merges.append(((a, b), next_id))
        self.vocab[next_id] = self.vocab[a] + self.vocab[b]
        next_id += 1

The bytes-to-tokens conversion happens on the first line: text.encode("utf-8") turns a Python string into a bytes object, and list(...) turns that into a list of integers in the range $0$ to $255$. This is the canonical Python boundary between string-land (where indexing yields codepoints) and byte-land (where indexing yields integers). BPE lives entirely in byte-land.

Why we break when freq < 2. If the most frequent pair appears only once, there is no point creating a new vocabulary entry for it — adding a token id that compresses exactly one occurrence saves us zero tokens at encoding time while costing one slot in the vocabulary. In practice this guard rarely triggers because corpora have plenty of repeated bigrams long before the vocabulary is exhausted, but it is a safety net.

Why we update both merges and vocab. The merges list records the recipe (in order) so we can replay it at encoding time. The vocab dictionary records the byte expansion of every token, so we can reverse the process at decoding time. Both are needed and they encode different views of the same information: merges as instructions, vocab as state.

The new token's byte expansion. self.vocab[next_id] = self.vocab[a] + self.vocab[b] concatenates the two parent byte sequences. Crucially, the parents may themselves be merged tokens (with multi-byte expansions), so this is a recursive structure: every token's full byte sequence is built up from its parents' sequences, which were built up from their parents' sequences, all the way down to the 256 raw bytes. This recursive expansion is what makes decoding trivial — we look up each token id and concatenate.

The Merge Helper

The inner loop is the merge operation: scan the token list left to right, replace every occurrence of the target pair with the new token id. It looks simple, but there is one subtle correctness point:

@staticmethod
def _merge(tokens, pair, new_id):
    out = []
    i = 0
    a, b = pair
    while i < len(tokens):
        if i < len(tokens) - 1 and tokens[i] == a and tokens[i + 1] == b:
            out.append(new_id)
            i += 2
        else:
            out.append(tokens[i])
            i += 1
    return out

The subtlety is the i += 2 after a successful merge. Consider what happens to a sequence like [a, b, b] when we are merging (a, b). With i += 2, the new token id $z$ replaces the first (a, b) and we move past both, producing [z, b]. With i += 1, the iterator would still be sitting on what used to be position $1$ (now containing $z$), and the next comparison would test $z$ against the trailing $b$ — creating ambiguity in what got merged. The i += 2 convention is the standard "left-to-right, non-overlapping" interpretation that every BPE implementation uses.

The other detail worth flagging: this implementation does one full left-to-right pass per merge, returning a new list. Production tokenizers (tiktoken, sentencepiece) use linked-list or suffix-array data structures so each merge runs in $O(\log n)$ instead of $O(n)$, but the algorithmic content is the same. Our $O(n)$ pass is fine for educational purposes and small corpora.

Counting Pairs

The pair-counting helper is one short loop:

@staticmethod
def _count_pairs(tokens):
    c = Counter()
    for i in range(len(tokens) - 1):
        c[(tokens[i], tokens[i + 1])] += 1
    return c

collections.Counter is a dict subclass with two useful properties: += on missing keys initialises to $0$ instead of throwing, and .most_common(1)[0] returns the highest-frequency entry in $O(n \log n)$ time. We could write this with a plain dict and max(d.items(), key=lambda kv: kv[1]), but Counter is idiomatic Python.

This is the function that gets called once per merge. For a vocabulary of $2{,}000$ entries on a $33$ KB corpus, that is around $1{,}700$ pair-count passes over a list that starts at $64{,}000$ tokens and shrinks as merges progress. Total time: about $3$ seconds on a laptop. Most of it is the pair-counting loop, not the merge step. Production implementations cache pair frequencies and incrementally update them after each merge, which is where the $1000\times$ speedup comes from.

Encoding

At inference time, we need to convert arbitrary new text into the same token-id space the model was trained on. The encoding procedure is to replay the trained merges in the order they were recorded:

def encode(self, text):
    tokens = list(text.encode("utf-8"))
    for (a, b), new_id in self.merges:
        tokens = self._merge(tokens, (a, b), new_id)
    return tokens

The order is essential and worth slowing down on. Suppose merge $1$ created token $256$ from the byte pair (b'e', b' '), and merge $4$ created token $259$ from the pair (b' t', b'h'), which itself uses an earlier merge of (b' ', b't'). When we encode the string "the ", we must apply merge $1$ before merge $4$ — because merge $4$ expects to find token $258$ (or whatever id " t" received) sitting in the token stream, and that token won't exist until merge $1$ runs.

Replaying merges in training order guarantees correctness because that is the same order they were created. Every merge can assume its dependencies (the parent tokens) already exist in the token stream. Reversing or shuffling the order would break this invariant and produce different encodings — or, worse, encodings the model has never seen at training time.

For long text, this $O(M \cdot |t|)$ replay is slow. Production tokenizers use suffix-array or trie-based approaches that recognise the longest matching merge at each position in roughly linear time. Same algorithm, different data structures.

Decoding

Decoding is trivial because of the recursive vocab structure we built up at training time:

def decode(self, ids):
    data = b"".join(self.vocab[i] for i in ids)
    return data.decode("utf-8", errors="replace")

Look up each token id in the vocab, concatenate the byte sequences, then UTF-8 decode the result. The errors="replace" argument is the only interesting choice here.

Why errors="replace"? A subtle problem: BPE merges operate at the byte level, not the codepoint level. It is possible (though rare) for a sequence of token ids to correspond to a byte sequence that slices through the middle of a multi-byte UTF-8 character. Concretely, a character like é is two bytes (0xC3 0xA9); if a sequence of merges produces token boundaries that split it, the resulting bytes are not valid UTF-8. errors="replace" says "if you find an invalid byte sequence, emit the replacement character U+FFFD and continue" — graceful degradation instead of a crash.

This pathology essentially never happens on text we tokenized ourselves (the merges respect the UTF-8 structure of the training data). It can happen on adversarial inputs designed to exploit the byte-level nature of BPE — for example, certain prompt-injection attacks use deliberately malformed UTF-8 to confuse the tokenizer. The replace policy makes that fail-safe instead of fail-stop.

What This Implementation Skips

Real-world BPE implementations like tiktoken, sentencepiece, and Hugging Face Tokenizers add several optimisations and features we deliberately leave out for clarity:

Pre-tokenization with a regex. GPT-2's tokenizer applies a regex to split text into "pre-tokens" (typically words and punctuation) before BPE runs. This prevents merges from crossing word boundaries — so a token like "the next" can never form, even if the substring is frequent. Pre-tokenization adds a layer of linguistic prior that our raw-byte version does not have. The trade-off is that pre-tokenized BPE handles whitespace-rich text better, while raw-byte BPE handles arbitrary input (code, emoji, mathematical notation) more uniformly.

Special tokens. Real tokenizers reserve specific ids for special tokens: <|endoftext|>, <pad>, system role markers, end-of-turn markers, etc. These are added post-hoc to the vocabulary and treated specially during encoding (they cannot be split by merges). Our implementation has no concept of special tokens — every byte sequence is just data.

Incremental pair-frequency updates. The biggest performance optimisation is not re-counting all pairs at every iteration. Instead, after a merge, you can incrementally update the pair counts in $O(k)$ where $k$ is the number of merge occurrences. Production tokenizers use linked lists or hash maps that support fast updates. Our $O(n)$ re-count per merge is roughly $1000\times$ slower than tiktoken but fits in $30$ lines instead of $3000$.

Optimised data structures. Suffix arrays for $O(\log n)$ merge application at encoding time, hash-based merge tables, parallel chunk processing for very large corpora. None of this changes the algorithm — it changes the constants.

The Whole Implementation

Excluding comments and helpers, the substantive code is roughly $50$ lines: $14$ for the train method, $11$ for the merge helper, $4$ for encoding, $3$ for decoding, plus the constructor and the pair counter. The full file with documentation is around $80$ lines.

What you can do with this $80$ lines: tokenize arbitrary text losslessly, train new tokenizers on new corpora, inspect what merges were learned and why. What you cannot do: handle production-scale corpora in reasonable time, support special tokens, integrate with Hugging Face's model loading machinery. For learning, that trade is fine. For production, use tiktoken.

What Part 3 Tests

With this implementation in hand, Part 3 trains BPE on a $33$-KB English corpus at four vocabulary sizes and exposes two things: the universal compression curve shape that holds across every BPE training run on natural text, and the failure mode where repetitive code collapses into absurdly long single tokens — the same pathology that explains a wide range of strange LLM behaviour.

Full code on GitHub: github.com/soveshmohapatra/BPE-Tokenizer