Skip to content
The loss curve

Chapter 16 · 13 min

Why your model talks badly

Compare your trained model to GPT-2 small and see exactly where size, data, and tuning matter. The honest gap.

Run scripts/generate.py from chapter 14 and read the output. You'll get something like:

ROMEO: Speak, my lord. The night is past, and yet the dawn is far. I have not seen the lady. JULIET: My lord, the friar hath sent for thee. ROMEO: I cannot. The hour is...

Recognizable. Shakespeare-ish line breaks, character names in caps, occasional Elizabethan vocabulary. Locally coherent — within a sentence, the syntax mostly works. Globally incoherent — the scenes don't add up, characters say things their plays would never have them say, names sometimes don't match the speaker.

Now run scripts/sample_gpt2.py from chapter 15 on the same prompt. Coherent paragraphs, factual completions, no Elizabethan vocabulary. The changed; the architecture did not. The difference in output quality comes from three orthogonal axes: scale, data, and — none of which live in the code you wrote.

1. Scale

Your model: ~14M parameters, ~272k training tokens (the TinyShakespeare corpus from chapter 11), ~10 minutes of CPU.

GPT-3: 175B parameters, 300B tokens, weeks on hundreds of GPUs.

That's roughly four orders of magnitude in parameters and six in training tokens. The empirical observation — codified in scaling laws by Kaplan et al. (2020) and refined by Hoffmann et al. (2022, "Chinchilla") — is that model quality improves predictably and smoothly with compute and data, with diminishing returns at each scale. The Chinchilla rule of thumb: the compute-optimal number of training tokens is roughly 20× the number of parameters. Train less, you underfit. Train more, you waste compute that would have been better spent on a bigger model.

Implement it and plot a few known models on the scaling map.

Code · JavaScript

Your 14M model from chapter 13 sits well below the Chinchilla line — it's wildly under-trained (which is why "more steps" usually helps so much at this scale). GPT-3, famously, is also under-trained by Chinchilla's estimate; Chinchilla itself (DeepMind, 70B params, 1.4T tokens) is the data point that proved smaller-but-better-trained beats bigger-but-under-trained at the same compute budget.

The headline takeaway: most of the quality gap between your model and a modern LLM is just scale. Run the exact same architecture at 10,000× the size and you'd have something more capable. That does not mean from-scratch pretraining is the best business move; it usually is not.

2. Data

The second axis is what the model learns from. Your training corpus is TinyShakespeare — about 1.1M characters (~272k GPT-2 ), one author, one era, one register. A modern LLM trains on hundreds of billions of tokens spanning books, code, math, conversation, technical manuals, news, prose fiction, social media, web pages, scientific papers, multiple languages, and curated instruction-following examples.

What that means for output:

  • Out-of-distribution prompts. Ask your Shakespeare model about JavaScript and it has no idea — the word never appeared in training. ChatGPT can answer because the relevant text was in its training corpus.
  • Factuality. Your model never saw factual text, so it can't produce facts. It hallucinates as a baseline, not as a failure mode.
  • Diversity of register. Your model speaks one register: archaic English drama. A general LLM has to pick a register from context and stay in it.

Just adding more of the same kind of text helps less than adding different kinds. The composition of the training mix is a research problem on its own.

3. Alignment

Scaling and data give you a model that's good at predicting the next token. That is not the same as being useful. A raw GPT-3, prompted with a question, will often produce a plausible-looking continuation that isn't actually an answer — because in the training data, "questions" are often followed by more questions, by quotes from articles, by silence, by tangents.

The third axis is the work that turns a next-token predictor into an assistant:

  • Supervised fine-tuning (SFT) on human-written examples of "good" answers.
  • Reinforcement learning from human feedback (RLHF) or related techniques: train a reward model on human preferences, then optimize the language model against it.- Direct preference optimization (DPO) and friends: more recent, cheaper alternatives to RLHF.

A modern chat model has spent a meaningful chunk of its compute on this alignment phase. Skip it and the same base model is far less pleasant to use.

For our scope, we do the first of those three steps. Chapter 17 walks through SFT on your chapter-13 model with a small dataset and the right mask. The preference-tuning steps (RLHF, DPO) need preference data and a reward model we will not train — but SFT alone closes most of the format gap. Most of the "this doesn't feel like an assistant" feeling is the absence of SFT, not the absence of RLHF.

4. How do you know you're getting better?

The loop reports a number called . Lower is better. Necessary but not sufficient — a single scalar that does not survive comparison across datasets, vocabularies, or model sizes.

— the readable version of

is exp(loss). It has a unit you can reason about: the average number of equally-likely next the model still considers after seeing the context. of 1 means total certainty; of vocab_size means uniform (no learning at all). Rough landmarks on natural English:

  • model from chapter 1: ~100-1000 depending on the .
  • Your chapter-13 model on Shakespeare validation: ~10-30 (depending on length).
  • GPT-2 small (~120M params) on Wikipedia: ~30-40.
  • Frontier on the same Wikipedia: ~15-20.

Lower is better, but the comparison is only meaningful on the same dataset. on Shakespeare is not directly comparable to on Wikipedia or code.

Benchmark suites

When a paper says "improved performance by X%", it usually means a benchmark like:

  • HellaSwag — pick the most plausible sentence completion from 4 candidates (common-sense reasoning).
  • MMLU — multiple-choice questions across 57 topics (knowledge breadth).
  • LAMBADA — predict the final word of a paragraph (long-range dependency).

These exist so different models can be compared on the same axis. They are designed for ≥1B- models. At your scale, benchmark scores are mostly noise — a 14M model on MMLU will sit near the 25% random-guess baseline no matter how well you trained it. Do not chase them.

Qualitative is honest

For small models, the most useful evaluation is to read what the model says. Pick 5-10 prompts that exercise the behavior you care about. Generate from your model and from a trusted baseline (the chapter-13 model, or GPT-2 small via transformers.js) side by side. After a few hundred such comparisons, you build an intuition no scalar can replace.

That intuition is what the rest of the book leverages: SFT changes the format, LoRA tunes a narrow specialization, quantization changes cost. Each lever moves a different axis of the output, and you need to see those axes before you can choose which lever to pull.

5. Where the money usually is

Honesty: at this scale, your trained model is not useful as a general product. It's great as a pedagogical object. You wrote every line, watched every loss curve, can explain every parameter. That is the foundation you need before deciding where to spend money.

Commercially, value usually comes from one of four places:

  • Data advantage. You have private, clean, domain-specific data that competitors do not.
  • Workflow integration. The model sits inside a painful business process and removes time, errors, or support cost.
  • Specialization. A small model fine-tuned for one narrow job can beat a larger generic model on cost, latency, privacy, or reliability.
  • Serving efficiency. Quantization, caching, batching, and routing make the same capability cheaper to run.

There are also real applications for tiny models near our scale, if you train on the right data:

  • Embedded autocomplete for a specific small domain (a code editor for one project, a CMS that auto-completes article titles).
  • Anomaly detection by perplexity. Train on one domain, flag anything with much higher perplexity than baseline.
  • Style transfer if you have a tightly-defined target style.
  • Classification with a small instruction tail — the model produces a token, you check whether it matches your expected class.

The "we need 100B+ parameters to be useful" claim is true for general dialogue. For narrow tasks, much smaller models are often enough. Chapters 17 and 18 are practical for that reason: fine-tuning changes behavior; quantization changes cost.

Recap

  • Architecture is almost identical between your model and a frontier . The gap is in scale, data composition, and , not in the . - Chinchilla rule: optimal ≈ 20 × . Under- is the most common failure mode at small scales. - Data matters as much as size. The diversity and quality of the mix determines what the model can learn, even before scale determines how much it does. - (SFT + RLHF/DPO) is what turns a next- predictor into an assistant. We do the SFT half in chapter 17; preference tuning is beyond scope. - is made readable; benchmarks are noise at small scale; qualitative side-by-side reading is the honest evaluation for models the size of yours. - Small models can be useful for narrow tasks. The trick is matching scale to the task, not chasing frontier numbers. - The commercial path is usually data + workflow + efficiency, not pretraining a frontier model from scratch.

Going further

That closes the "build your LLM" arc. Part V is practical work on top of it: instruction-tuning the model so it follows the chat shape, then making it cheaper and usable.

Next up: Part V begins with give your model instructions — the cheapest, most direct way to turn the next-token predictor from chapter 13 into a model that actually answers questions.