Skip to content
The loss curve

Chapter 2 · 15 min

Counting isn't enough

Why counts alone fail and how smoothing fixes them — Laplace, Kneser-Ney, a held-out set, and the first perplexity number.

In chapter 1, you wrote a generate loop that bailed out at "dead ends" — whose row in the counts table was all zeros. That's not just a UI issue. The model has been asked: "what comes after X?" and the table has a single, honest answer: I have no idea, the row is empty.

That answer is fine for one chapter. It's not fine for any kind of scale. A real model in practice can't just give up the moment it sees a pair it doesn't recognize — it needs to assign some probability to every possible next word, even ones it has never seen in this context. The general name for that move is .

We're going to build it from scratch. Same kind of as last chapter, but bigger so we can split it 80/20 into and . In the browser, you will see exactly what does. In your local my-llm/ repo, you will turn the baseline into something you can evaluate.

1. Look the problem in the face

Take one row of the counts table — the row for the most--rich in the . Most cells will be positive (lots of seen continuations); some will be exactly zero (continuations the model has never witnessed).

The cell turns counts into a conditional : divide every cell by the row sum.

Code · JavaScript

The bars in loss color are zeros. As far as the model can tell, those continuations have no probability at all. If any one of them shows up in the , the validation is mathematically infinite — log(0) is undefined, and the geometric mean of an inverse probability with one zero in it blows up.

That's the whole problem. is how we route around it.

2. Add a little to everything: Laplace

The simplest fix has been around since the 1700s. Take every cell of the counts table — yes, including the zeros — and add a small constant α. Then renormalize. Formally:

Pα(wtwt1)=C(wt1,wt)+αC(wt1)+αVP_\alpha(w_t \mid w_{t-1}) = \frac{C(w_{t-1}, w_t) + \alpha}{C(w_{t-1}) + \alpha\,V}

Where VV is the size. With α = 1 this is the original "add-one" ; smaller values are usually better in practice.

The cell writes the formula as a function. It receives (count, rowSum, alpha, V), returns a probability, and the chapter draws what that function says about the same row as before — with an α slider so you can watch the bars flatten as α grows.

Code · JavaScript

Slide α from 0.001 to 2. Two things happen:

  • All the zeros become positive. The model is no longer surprised by anything.
  • The non-zero bars shrink. The model is also less confident about the things it had seen.

That second effect is the cost of . Add too much and you're effectively saying "every word is roughly as likely as any other".

3. Find the sweet spot: perplexity vs α

If α = 0 makes the model crash on unseen pairs and α huge makes the model say nothing useful, there's a sweet spot in between. The standard way to find it: compute (the geometric mean of inverse probability) at a range of α values and pick the lowest point.

The cell computes from transitions (each is a { count, rowSum } object — the model-side numbers needed to score it under our train counts), an alpha, and a size V. The script includes the canonical laplaceProbability from the previous cell so the sweep can call the same formula repeatedly.

Code · JavaScript

You should see a U-shape. Far left (tiny α): the model trusts its data too much; one rare unseen pair is enough to send through the roof. Far right (large α): the model is too uniform; it's spending its probability mass equally on plausible and absurd continuations. The bottom of the U is the best can do for this .

4. A smarter family: Kneser-Ney

's laziness is that it treats unseen pairs uniformly. Every unseen (w_{t-1}, w_t) gets the same tiny probability. But that's silly — words like Francisco should mostly only follow San, not "any context with equal weight".

smoothing is a smarter member of the same family:

  • Subtract a small constant d from every seen count (call it the discount). The total mass you took away is d × (number of distinct continuations seen).
  • Redistribute that mass through a fallback . Real uses a clever continuation table; we use the uniform 1/V for simplicity.

Run the formula. Same shape as , but with the extra distinctContexts argument and the lambda-times-fallback construction.

Code · JavaScript

We sweep d and overlay the result on top of the curve from the previous cell. The two curves usually have different shapes — often has a sharper minimum at a smaller strength. On a real the gap is meaningful; on this toy text the differences are subtle, but the point is that "" is a family of choices, and is the simplest member of the family.

5. Upgrade your local baseline

Open llm/bigram.py and add a smoothed probability function:

def laplace_probability(
    model: dict[str, Counter[str]],
    previous: str,
    token: str,
    vocab: set[str],
    alpha: float = 0.1,
) -> float:
    # [1]
    row = model.get(previous, Counter())
    # [2]
    count = row[token]
    # [3]
    total = sum(row.values())
    # [4]
    return (count + alpha) / (total + alpha * len(vocab))

Read the formula directly from the code:

  • [1] row is everything the model has ever seen after previous.
  • [2] count is the specific pair you are scoring: previous -> token.
  • [3] total is the row sum, the number of observed continuations after previous.
  • [4] gives every possible alpha extra mass, then normalizes the row so the probabilities still sum to 1.

Then create scripts/eval_bigram.py:

from __future__ import annotations
 
import math
 
from llm.bigram import laplace_probability, tokenize, train
 
TEXT = "the cat sat on the mat the dog sat on the rug the cat watched the dog the dog watched the cat"
 
tokens = tokenize(TEXT)
# [1]
split = int(len(tokens) * 0.8)
train_tokens = tokens[:split]
val_tokens = tokens[split:]
 
# [2]
model = train(train_tokens)
vocab = set(tokens)
 
loss = 0.0
n = 0
# [3]
for previous, token in zip(val_tokens, val_tokens[1:]):
    p = laplace_probability(model, previous, token, vocab, alpha=0.1)
    # [4]
    loss -= math.log(p)
    n += 1
 
# [5]
print(f"validation perplexity: {math.exp(loss / n):.2f}")

The evaluation script has one job: make the model answer on text it did not on.

  • [1] creates a tiny / split.
  • [2] only on train_tokens; val_tokens stays held back.
  • [3] walks each pair and asks, “what probability did the model assign to the true next ?”
  • [4] is next-token . High probability means small ; low probability means large .
  • [5] turns average into , a more readable “how many plausible choices did the model feel it had?” score.

Run it:

python -m scripts.eval_bigram
python -m scripts.eval_bigram
python -m scripts.eval_bigram

The number is not impressive. That is the point. You now have a baseline you can beat.

Why this whole chapter feels like a hack

Because it is.

is what you do when you've decided to model language with a counts table. The problem isn't that a better smoother would save the model — the problem is that no amount of patching helps with the fundamental limitation: the model has no memory beyond one word. It can't tell that the sentence is about cats. It can't notice that "the" is followed by a noun. It is reasoning from one of context.

The next several chapters are about replacing the counts table with something that can actually generalize — words that mean similar things should behave similarly, even on contexts they've never seen. The trick is to stop storing facts about every pair and start storing facts about every word individually, then learning how the words combine.

That trick has a name: . We get there in chapter 4. First we have to teach the model to chop up the input properly.

Recap

  • turns a model with hard zeros into a model with soft, finite probabilities everywhere. Without it, any pair the model has never seen makes the whole sequence's probability collapse to zero. - add-α is the simplest move — add a constant to every cell and renormalize. The α slider trades over-fitting (small α) for over- (large α). - is the geometric mean of inverse-probability. Computed via log-sum for numerical stability. - is a smarter member of the same family that takes mass away from seen pairs and redistributes it through a more thoughtful fallback than the uniform. - Your local model now has a score. gives you a number to improve, not just samples to eyeball. - None of this fixes the real problem with models, which is that they can only see one back. That's what the next several chapters address.

Going further

Next up: training your own tokens — what if "the cat" and "thecat" should be the same input? What if "running" and "run" should share something? Time to stop splitting on whitespace.