Guide
Build a BPE tokenizer from scratch
Train a Byte Pair Encoding (BPE) tokenizer from scratch. Build the merge table, encode, decode, and compare to GPT-2's tokenizer.
Byte Pair Encoding is the tokenization scheme behind GPT-2, GPT-3, GPT-4, and most modern open LLMs. The algorithm is simple enough to implement in an afternoon. This guide walks through it from first principles, with a runnable browser cell and a local Python implementation.
1. Why BPE
Tokenizing text is harder than it looks. The naive approach — split on whitespace, build a vocabulary — fails on real text:
- typos and rare words explode the vocabulary
- subword structure (e.g.
running = run + ning) is lost - unseen words at inference time have no representation
BPE solves this with a learned vocabulary that compresses common patterns and falls back to characters for rare ones. It was originally a 1994 data-compression algorithm; Sennrich et al. (2016) adapted it for NMT, and OpenAI used it in GPT-2.
2. The algorithm
The training algorithm is four lines:
- Start with a vocabulary of every character in the corpus.
- Find the most frequent adjacent pair of tokens.
- Merge that pair into a new token. Add it to the vocabulary.
- Repeat until you've done
Nmerges.
N is your only knob. For GPT-2, it's tuned so the final vocabulary is ~50,000 tokens.
Chapter 3 — Train a BPE tokenizer from scratch implements this with interactive browser cells you can step through.
3. Training on your own corpus
Once trained, the tokenizer is just a list of merges (in order) plus a vocabulary. You can serialize both to JSON and load them back later.
merges = [("t", "h"), ("th", "e"), ...] # ordered list of pairs
vocab = ["a", "b", ..., "th", "the", ...] # all known tokensEncoding is iterative: start with characters, then apply each merge in order wherever the pair appears. Decoding is just "".join(tokens).
4. Encoding and decoding
A working encoder needs to handle:
- the greedy merge order (apply earlier merges first)
- bytes-vs-characters (modern BPE often operates on bytes for full Unicode coverage)
- whitespace conventions (GPT-2 prepends a space to most tokens so word boundaries are preserved)
Chapter 3 covers the first two. The whitespace convention is a GPT-2-specific detail; it's worth understanding but doesn't change the algorithm.
5. Comparing to GPT-2's tokenizer
Once you have your own BPE running, run the same input string through both your tokenizer and GPT-2's. You'll see:
- the token count is roughly similar
- the splits differ — GPT-2 has seen vastly more text, so its merges discover more morphology
- both produce a fixed-size, decode-able representation
The point is that yours is a smaller version of the exact same thing.
6. What comes next
Tokenization is the first piece of an LLM. From here:
- Chapter 4 — Embeddings turns token ids into dense vectors the model can learn over.
- Chapter 1 — The dumbest model uses your tokenizer in a bigram counter — the simplest working language model.
- The full curriculum takes the tokenizer all the way to a working chatbot.
Frequently asked questions
What is BPE in plain English?
BPE (Byte Pair Encoding) is a learned tokenizer. It starts by treating every character as a token, then repeatedly finds the most frequent adjacent pair of tokens in the corpus and merges them into a new token. You stop when you've done enough merges. The resulting tokens are subwords — somewhere between characters and full words.
Why not just split on whitespace?
Whitespace tokenizers explode in vocabulary size on real text — every typo, plural, and conjugation becomes a new token. BPE produces a fixed-size vocabulary that compresses common patterns and falls back to characters for rare ones, so it handles any input without out-of-vocabulary errors.
Why is BPE used in GPT-2/3/4?
It's a Goldilocks tokenization — finer than words (handles typos and morphology), coarser than characters (fewer tokens per sequence). Trained on a large corpus, BPE discovers morphology automatically — suffixes like `-ing` and `-ed` emerge as common merges.
How big is GPT-2's vocabulary?
50,257 tokens. That's the vocab size of the BPE tokenizer OpenAI trained on the WebText corpus, with a few special tokens (like `<|endoftext|>`) added on top.
Should I train my own tokenizer or use an existing one?
For learning, train your own — it's a small algorithm and makes the rest of the course concrete. For shipping, use GPT-2's or a domain-specific one. Tokenization is one of those things that's easy to get wrong if you change it mid-project.