Skip to content
The loss curve

Guide

Build a BPE tokenizer from scratch

Train a Byte Pair Encoding (BPE) tokenizer from scratch. Build the merge table, encode, decode, and compare to GPT-2's tokenizer.

Byte Pair Encoding is the tokenization scheme behind GPT-2, GPT-3, GPT-4, and most modern open LLMs. The algorithm is simple enough to implement in an afternoon. This guide walks through it from first principles, with a runnable browser cell and a local Python implementation.

1. Why BPE

Tokenizing text is harder than it looks. The naive approach — split on whitespace, build a vocabulary — fails on real text:

  • typos and rare words explode the vocabulary
  • subword structure (e.g. running = run + ning) is lost
  • unseen words at inference time have no representation

BPE solves this with a learned vocabulary that compresses common patterns and falls back to characters for rare ones. It was originally a 1994 data-compression algorithm; Sennrich et al. (2016) adapted it for NMT, and OpenAI used it in GPT-2.

2. The algorithm

The training algorithm is four lines:

  1. Start with a vocabulary of every character in the corpus.
  2. Find the most frequent adjacent pair of tokens.
  3. Merge that pair into a new token. Add it to the vocabulary.
  4. Repeat until you've done N merges.

N is your only knob. For GPT-2, it's tuned so the final vocabulary is ~50,000 tokens.

Chapter 3 — Train a BPE tokenizer from scratch implements this with interactive browser cells you can step through.

3. Training on your own corpus

Once trained, the tokenizer is just a list of merges (in order) plus a vocabulary. You can serialize both to JSON and load them back later.

merges = [("t", "h"), ("th", "e"), ...]   # ordered list of pairs
vocab = ["a", "b", ..., "th", "the", ...] # all known tokens

Encoding is iterative: start with characters, then apply each merge in order wherever the pair appears. Decoding is just "".join(tokens).

4. Encoding and decoding

A working encoder needs to handle:

  • the greedy merge order (apply earlier merges first)
  • bytes-vs-characters (modern BPE often operates on bytes for full Unicode coverage)
  • whitespace conventions (GPT-2 prepends a space to most tokens so word boundaries are preserved)

Chapter 3 covers the first two. The whitespace convention is a GPT-2-specific detail; it's worth understanding but doesn't change the algorithm.

5. Comparing to GPT-2's tokenizer

Once you have your own BPE running, run the same input string through both your tokenizer and GPT-2's. You'll see:

  • the token count is roughly similar
  • the splits differ — GPT-2 has seen vastly more text, so its merges discover more morphology
  • both produce a fixed-size, decode-able representation

The point is that yours is a smaller version of the exact same thing.

6. What comes next

Tokenization is the first piece of an LLM. From here:

Frequently asked questions

What is BPE in plain English?

BPE (Byte Pair Encoding) is a learned tokenizer. It starts by treating every character as a token, then repeatedly finds the most frequent adjacent pair of tokens in the corpus and merges them into a new token. You stop when you've done enough merges. The resulting tokens are subwords — somewhere between characters and full words.

Why not just split on whitespace?

Whitespace tokenizers explode in vocabulary size on real text — every typo, plural, and conjugation becomes a new token. BPE produces a fixed-size vocabulary that compresses common patterns and falls back to characters for rare ones, so it handles any input without out-of-vocabulary errors.

Why is BPE used in GPT-2/3/4?

It's a Goldilocks tokenization — finer than words (handles typos and morphology), coarser than characters (fewer tokens per sequence). Trained on a large corpus, BPE discovers morphology automatically — suffixes like `-ing` and `-ed` emerge as common merges.

How big is GPT-2's vocabulary?

50,257 tokens. That's the vocab size of the BPE tokenizer OpenAI trained on the WebText corpus, with a few special tokens (like `<|endoftext|>`) added on top.

Should I train my own tokenizer or use an existing one?

For learning, train your own — it's a small algorithm and makes the rest of the course concrete. For shipping, use GPT-2's or a domain-specific one. Tokenization is one of those things that's easy to get wrong if you change it mid-project.

Continue learning