Skip to content
The loss curve

Chapter 1 · 18 min

The dumbest model that exists

Build the simplest possible language model — a bigram counter. Tokens, probability tables, sampling. Runs in your browser, then locally.

Open any book. Read the first line. Now, can you guess the next word? You probably can — because you've seen it a thousand times before, somewhere else.

Let's build a model that can do nothing else but that. By the end of the chapter, you'll have run four small functions, seen how they compose into a generator, and saved the first file in your own my-llm/ project.

The we'll work with is a fragment of a children's rhyme. It's deliberately small and repetitive — small enough that you can sanity-check your code by eye, and repetitive enough that the model will say something almost-coherent.

the cat sat on the mat the dog sat on the rug the cat watched the dog the dog watched the cat

0. Start your local project

The browser cells in this chapter are the fast lab bench. Your local folder is the thing you keep.

Create it now:

mkdir -p my-llm/llm my-llm/scripts my-llm/data && cd my-llm && python3 -m venv .venv && source .venv/bin/activate
mkdir my-llm; mkdir my-llm\\llm,my-llm\\scripts,my-llm\\data; cd my-llm; py -m venv .venv; .\\.venv\\Scripts\\Activate.ps1
mkdir -p my-llm/llm my-llm/scripts my-llm/data && cd my-llm && python3 -m venv .venv && source .venv/bin/activate

Then add an empty package marker:

touch llm/__init__.py
New-Item llm\\__init__.py -ItemType File
touch llm/__init__.py

You will use this folder for the rest of the book. Every chapter adds one brick to it.

1. Cut the text into tokens

Before we can model anything, we need to chop the text into the units the model will reason about. The simplest possible choice: every whitespace-separated word is a . Lowercase everything so The and the count as the same word.

Run the cell below. It returns the array of , and the visualization shows exactly what the model will see.

Code · JavaScript

That's it. That's training data preparation in its simplest form. Real do something much more clever (we cover that in chapter 3), but for now whitespace + lowercase is enough.

2. Count the pairs

The whole "model" is going to be a counts table: for every pair of adjacent (a, b) in the , how many times did b follow a? That table is the model. is filling it in.

Some terminology before we code: the is the set of unique in the order they first appear. The counts table is a square matrix of size vocab.length × vocab.length, where counts[i][j] is the number of times vocab[i] was immediately followed by vocab[j] in the .

Run the loop and inspect the matrix it builds.

Code · JavaScript

The matrix you just produced is the entire model. Skim it. Most cells are zero — most pairs of words never appeared next to each other. The non-zero cells tell you what the model has learned. The row for the is probably the busiest, because the is followed by lots of different things. The row for mat is sparse, because mat only ever appeared once, followed by the.

This is the math expressed in one line:

P(wtwt1)=C(wt1,wt)C(wt1)P(w_t \mid w_{t-1}) = \frac{C(w_{t-1}, w_t)}{C(w_{t-1})}

The probability that the next word is wtw_t, given that the previous word was wt1w_{t-1}, is the number of times you saw (wt1,wt)(w_{t-1}, w_t) together divided by the total number of times you saw wt1w_{t-1}. Each row of your counts matrix, divided by its sum, is exactly such a .

3. Sample from a distribution

We have probabilities. We need to turn them into actual choices. If a row says the is followed by cat 40% of the time and by dog 60% of the time, we want to draw dog six times out of ten, on average.

The standard trick: roll a random number in [0, 1), walk the cumulative sum of the probabilities, and pick the first index whose cumulative sum exceeds the roll.

Code · JavaScript

Distribution

thecatdogrug

Empirical counts

Click Run several times to fill this side.

Run it a few times. Watch the empirical-counts plot fill in. After a couple dozen draws, the right side should start to look like the left side. That's the law of large numbers, sneaking into our chapter.

4. Put it together: generate

Now we chain everything. Start from a . Find its row in the counts matrix. Normalize the row into a . the next . Make that the new . Repeat.

The cell now puts the , the , the desired length, and the three helper functions directly in the script. The loop stops early if it hits a "dead end" — a whose row is all zeros, meaning the never showed anything that could follow it.

Code · JavaScript

Click Run a few times. You'll get different sequences each time because each step is a , not a deterministic pick. Some will be coherent, some will be nonsense. None of them are repetitions of data verbatim, unless you got unlucky. They're new sequences drawn from the same statistical shape.

This is, structurally, the same thing every modern does: previous go in, a probability over next comes out, you . The mechanism that produces the is hugely more expressive than counts[i] / row_sum[i] — billions of , over thousands of , deep stacks of transformations — but the interface is identical.

You've just built a . It happens to be terrible. The rest of the book is about replacing each part of it with something that isn't.

5. Put the baseline in your repo

Now save the same idea locally. Create llm/bigram.py:

"""A tiny bigram language model."""
from __future__ import annotations
 
import random
from collections import Counter, defaultdict
 
 
# [1]
def tokenize(text: str) -> list[str]:
    return [part for part in text.lower().split() if part]
 
 
# [2]
def train(tokens: list[str]) -> dict[str, Counter[str]]:
    counts: dict[str, Counter[str]] = defaultdict(Counter)
    # [3]
    for left, right in zip(tokens, tokens[1:]):
        counts[left][right] += 1
    return counts
 
 
# [4]
def sample_next(row: Counter[str]) -> str | None:
    total = sum(row.values())
    if total == 0:
        return None
 
    roll = random.random() * total
    acc = 0.0
    for token, count in row.items():
        acc += count
        if roll <= acc:
            return token
    return next(reversed(row))
 
 
# [5]
def generate(model: dict[str, Counter[str]], seed: str, steps: int) -> list[str]:
    out = [seed]
    for _ in range(steps - 1):
        nxt = sample_next(model.get(out[-1], Counter()))
        if nxt is None:
            break
        out.append(nxt)
    return out

Read this file in small pieces:

  • [1] tokenize is deliberately boring. Lowercase, split, drop empty chunks. The output is the only thing the model can see.
  • [2] train builds a dictionary of dictionaries: previous token on the outside, possible next tokens on the inside.
  • [3] zip(tokens, tokens[1:]) is the training loop in miniature. It walks (the, cat), then (cat, sat), then (sat, on), one adjacent pair at a time.
  • [4] sample_next does not pick the biggest count. It rolls a random number and walks through the row, so frequent win often but not always.
  • [5] generate repeats that one-token step. The latest becomes the input for the next prediction.

Create scripts/train_bigram.py:

from llm.bigram import generate, tokenize, train
 
TEXT = "the cat sat on the mat the dog sat on the rug the cat watched the dog the dog watched the cat"
 
# [1]
tokens = tokenize(TEXT)
# [2]
model = train(tokens)
 
# [3]
for _ in range(5):
    print(" ".join(generate(model, seed="the", steps=12)))

This script is intentionally thin:

  • TEXT is the data.
  • [1] turns raw text into the model's units.
  • [2] fills the counts table.
  • [3] five different continuations so you can see that is stochastic.

Run it:

python -m scripts.train_bigram
python -m scripts.train_bigram
python -m scripts.train_bigram

You now have the first real artifact of the course: a local with and . Tiny, crude, but yours.

Recap

  • You ran four functions: tokenize, buildCounts, sampleNext, and a generate loop that combines them. - You started my-llm/ and saved the first reusable module: llm/bigram.py. - The model is the counts table. is filling in the table. is picking a row, normalizing it, and . - is repeated . Different runs give different outputs because each step is a draw. - It's still a . A bad one — no memory beyond one , no generalization, no notion of similar words. But it has the same input/output shape as a real . We're going to fix one flaw at a time.

Going further

  • Karpathy's makemore, episode 1 walks through the same model in Python with a different framing — character-level, with a smoothing tweak.
  • Step by Token, chapter 1 covers the same ideas from the understanding angle.
  • The full reference implementation lives in lib/ml/bigram/ if you want to see what the project's "official" version looks like.

Next up: counting isn't enough — what happens when your local model meets a continuation it has never seen.