Skip to content
The loss curve

Byte-Pair Encoding (BPE)

Tokenization scheme that starts with characters and iteratively merges the most-frequent adjacent pair. Produces subword tokens.

Originally a 1994 data-compression algorithm. Used in GPT-2/3/4 and most modern LLMs. The merges discover morphology — suffixes like 'ing' or 'ed' emerge naturally as common subwords.