Question 1

What is BPE in plain English?

Accepted Answer

BPE (Byte Pair Encoding) is a learned tokenizer. It starts by treating every character as a token, then repeatedly finds the most frequent adjacent pair of tokens in the corpus and merges them into a new token. You stop when you've done enough merges. The resulting tokens are subwords — somewhere between characters and full words.

Question 2

Why not just split on whitespace?

Accepted Answer

Whitespace tokenizers explode in vocabulary size on real text — every typo, plural, and conjugation becomes a new token. BPE produces a fixed-size vocabulary that compresses common patterns and falls back to characters for rare ones, so it handles any input without out-of-vocabulary errors.

Question 3

Why is BPE used in GPT-2/3/4?

Accepted Answer

It's a Goldilocks tokenization — finer than words (handles typos and morphology), coarser than characters (fewer tokens per sequence). Trained on a large corpus, BPE discovers morphology automatically — suffixes like `-ing` and `-ed` emerge as common merges.

Question 4

How big is GPT-2's vocabulary?

Accepted Answer

50,257 tokens. That's the vocab size of the BPE tokenizer OpenAI trained on the WebText corpus, with a few special tokens (like `<|endoftext|>`) added on top.

Question 5

Should I train my own tokenizer or use an existing one?

Accepted Answer

For learning, train your own — it's a small algorithm and makes the rest of the course concrete. For shipping, use GPT-2's or a domain-specific one. Tokenization is one of those things that's easy to get wrong if you change it mid-project.

Build a BPE tokenizer from scratch

1. Why BPE

2. The algorithm

3. Training on your own corpus

4. Encoding and decoding

5. Comparing to GPT-2's tokenizer

6. What comes next

Frequently asked questions