Skip to content
The loss curve

Chapter 3 · 16 min

Train your own tokens

Byte Pair Encoding from scratch — count pairs, merge, encode, decode. Train your own tokenizer and compare it to GPT-2's.

Up to now we've been splitting on whitespace. Every word is a . That's how the model in chapter 1 defined its , and it seemed innocent enough.

It's not innocent. It's terrible.

A whitespace treats running, runs, and run as three completely unrelated words. It treats cat, cats, and cat's as three completely unrelated words. Worse, the moment your model meets a word it has never seen — lowest, say, when only had low and lower — it has zero information. No fallback. The is closed.

Going the other way doesn't help either. If we split on characters — every letter is a — the is tiny (maybe 80 symbols) and totally open. But now running is seven : r u n n i n g. The model has to maintain across seven positions just to know which word it's looking at. Sequences explode.

There's a middle ground that's almost embarrassingly simple, and it powers every modern . It's called () — originally a 1994 data-compression algorithm that someone in 2015 (Sennrich et al.) realized was great for language too.

You're going to inspect it from scratch. Four functions, ~30 lines of JS total. Then you will save the same idea as llm/tokenizer.py, because every later local script needs a stable way to turn text into ids.

1. Initialize: every word as a list of characters

The starting state is the most uninformative thing possible: every word is split into individual characters. Whitespace separates words, but inside a word everything is single-letter.

Code · JavaScript

That's the unit will start merging from. Each merge step picks the most frequent adjacent pair and fuses it into a new single . Repeat enough times and you get back something that looks like words.

2. Find the most common pair

The model needs to ask: what pair of adjacent shows up the most? Walk through the , count every (a, b) where they sit next to each other, return the winner.

Code · JavaScript

Top adjacent pairs in the corpus

e+ss+tw+el+oo+wn+ee+ww+i

In our , the winning pair is going to be something boring like e + s or n + e — common letter sequences in English. That's the whole story: is going to rediscover morphology by chasing whatever appears together a lot.

3. Apply one merge

Once we know the winning pair, we sweep through the and replace every occurrence of (a, b) with the single new ab. The is just the concatenation. Nothing magic.

Code · JavaScript

Compare the before and after. Wherever a + b appeared, you now see a single chip with ab highlighted in warm. The count went down. The went up by one (the new merged ).

4. Train and encode

Now we put it together. Run "find most frequent + apply merge" N times in a loop, accumulating each merge in order. Then take a brand-new piece of text — lowest newer — and apply your trained merges to it.

The script repeats all four building blocks explicitly: bpeInit, countPairs, applyMerge, and encode (the last one applies a sequence of merges to new text).

Code · JavaScript

Look at the encoded test text. The word lowest is a word your trainer has never seen — it doesn't appear in the . But after , lowest decomposes into something like low + est. The pieces exist because:

  • low was in the , so the merges built up l + o → lo, then lo + w → low.
  • est was the suffix of newest and widest, so the merges built up e + s → es, then es + t → est.

You never said the word lowest. The can still represent it. That's the headline result: generalizes to unseen words by composing the it learned.

What you give up

is greedy and irreversible. The first few merges are critical: they shape every subsequent one. If your is small or biased — say, lots of code or one specific dialect — the discovered will reflect that. trained on English text French badly (and vice versa); trained on web text often produce eye-watering sequences for math notation, emoji, or rare scripts.

The merges are also position-blind. est works for newest and for Eastern even though those are unrelated. The doesn't know or care. It just knows the byte pair es is common and the byte pair est is too.

5. Save your tokenizer locally

Create llm/tokenizer.py:

"""A tiny BPE-style tokenizer for the teaching project."""
from __future__ import annotations
 
import json
from collections import Counter
from pathlib import Path
 
 
# [1]
def words(text: str) -> list[list[str]]:
    return [list(word) for word in text.lower().split()]
 
 
# [2]
def count_pairs(corpus: list[list[str]]) -> Counter[tuple[str, str]]:
    pairs: Counter[tuple[str, str]] = Counter()
    for word in corpus:
        for left, right in zip(word, word[1:]):
            pairs[(left, right)] += 1
    return pairs
 
 
# [3]
def apply_merge(corpus: list[list[str]], pair: tuple[str, str]) -> list[list[str]]:
    merged = "".join(pair)
    out: list[list[str]] = []
    for word in corpus:
        next_word: list[str] = []
        i = 0
        while i < len(word):
            if i < len(word) - 1 and (word[i], word[i + 1]) == pair:
                next_word.append(merged)
                # [4]
                i += 2
            else:
                next_word.append(word[i])
                i += 1
        out.append(next_word)
    return out
 
 
# [5]
def train_bpe(text: str, steps: int) -> list[tuple[str, str]]:
    corpus = words(text)
    merges: list[tuple[str, str]] = []
    for _ in range(steps):
        pairs = count_pairs(corpus)
        if not pairs:
            break
        pair, _ = pairs.most_common(1)[0]
        merges.append(pair)
        corpus = apply_merge(corpus, pair)
    return merges
 
 
# [6]
def save_merges(merges: list[tuple[str, str]], path: str = "data/merges.json") -> None:
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    Path(path).write_text(json.dumps(merges, indent=2), encoding="utf-8")

Walk through the as a compression loop:

  • [1] words starts at the character level. That is the safest possible : every word can be represented.
  • [2] count_pairs scans inside each word and counts adjacent symbols. It is asking “which two symbols most want to become one?”
  • [3] apply_merge replaces every winning pair with a single new .
  • [4] jumps by 2 after a merge so the same character is not reused twice.
  • [5] train_bpe repeats “count, pick winner, merge” and records the winners in order.
  • [6] save_merges writes the learned recipe to disk. The merge order matters; applying the same pairs in a different order gives a different .

Create scripts/train_tokenizer.py:

from llm.tokenizer import save_merges, train_bpe
 
TEXT = "low lower newest widest low lower newer"
 
# [1]
merges = train_bpe(TEXT, steps=20)
# [2]
save_merges(merges)
 
# [3]
print(f"saved {len(merges)} merges to data/merges.json")
print(merges[:10])

This local script is deliberately tiny because the interesting work is in llm/tokenizer.py:

  • TEXT is a toy with repeated roots and suffixes.
  • [1] trains at most 20 new merged tokens.
  • [2] gives later scripts a stable artifact instead of retraining every time.
  • [3] lets you sanity-check that common pieces like lo, low, or est are appearing.

Run it:

python -m scripts.train_tokenizer
python -m scripts.train_tokenizer
python -m scripts.train_tokenizer

The is still tiny, but it now exists as a reusable part of your local model. Chapter 11 will replace this toy trainer with a production for speed and coverage; the shape is the same.

Recap

  • Whitespace is too coarse: closed , no relationship between morphologically related words. - Character is too fine: vocab is tiny but sequences explode. - sits in the middle: start with characters, greedily merge the most common adjacent pair, repeat. The grows by one entry per merge. - The merges discover morphology (suffixes, common roots) without being told. - It generalizes: words the trainer never saw can still be encoded, by composing learned . - Your local project now has llm/tokenizer.py and a saved data/merges.json. - Every modern is a variant of this: (GPT-2/3/4), WordPiece (BERT), SentencePiece (T5, LLaMA). The implementation details differ; the principle is the same.

Going further

Next up: giving meaning to words — a is just a number, but you can give it a vector, and that vector can know things.