Chapters · in reading order
All chapters
Each chapter is a working artifact you can manipulate, and a short piece of prose that explains what you just did. They build on each other; read them in order.
Part 0 — Before you start
ch 0
Python, venv, PyTorch — the local toolchain. Skip if you already have a 3.11+ Python and pip install torch is muscle memory.
Part 1 — Start the project
ch 1-4
Tokens, bigrams, BPE, embeddings. You start the local project and build the first pieces of a language model.
01
The dumbest model that exists
Build the simplest possible language model — a bigram counter. Tokens, probability tables, sampling. Runs in your browser, then locally.
18 min
02
Counting isn't enough
Why counts alone fail and how smoothing fixes them — Laplace, Kneser-Ney, a held-out set, and the first perplexity number.
15 min
03
Train your own tokens
Byte Pair Encoding from scratch — count pairs, merge, encode, decode. Train your own tokenizer and compare it to GPT-2's.
16 min
04
Giving meaning to words
Give meaning to tokens. One-hot vectors, embeddings, cosine similarity, skip-gram training — and what "king − man + woman" really shows.
16 min
Part 2 — Make it learn
ch 5-7
Single neuron, MLP, optimizers. The model stops counting and starts improving through gradients.
05
A neuron that learns
One neuron, one loss, one gradient. Build a learnable linear unit by hand and watch it converge — the smallest possible training loop.
16 min
06
Stacking layers
A single neuron is a line. Stack them with a non-linearity and you get an MLP — the feed-forward block at the heart of every Transformer.
14 min
07
Gradient descent live
Watch SGD, momentum, and Adam navigate the same loss surface. Build each optimizer step in plain code.
15 min
Part 3 — Build the transformer
ch 8-10
Attention, multiple heads, residual connections, and the complete transformer block used by modern LLMs.
08
An attention head by hand
Q, K, V, scaled dot product, causal mask, softmax. Build a self-attention head by hand and visualize what it attends to.
18 min
09
Multi-head and residuals
From one head to many. Add residual connections and LayerNorm — the wiring that makes Transformers trainable at depth.
14 min
10
The full transformer block
Attention + feed-forward + residuals + LayerNorm, assembled into the block that GPT stacks N times. End-to-end forward pass.
16 min
Part 4 — Train and use the LLM
ch 11-16
Prepare data, switch to PyTorch, train a small GPT, load real GPT-2 weights into the same code, sample, and read its failure modes honestly.
11
Prepare a dataset
Move off toy data — load Shakespeare, tokenize with your BPE, build train/val splits, save tensors ready for training.
16 min
12
The minimum code
The minimum PyTorch code for a GPT-style model — embeddings, blocks, head, loss. Reads in one sitting and trains in chapter 13.
15 min
13
The training loop
Write the training loop, plot the loss curve, save a checkpoint, generate a sample. This is where the project starts to feel real.
16 min
14
Generation and sampling
How a trained model becomes text — temperature, top-k, top-p (nucleus). Visualize each strategy on the same logits.
12 min
15
Load real weights
Map GPT-2's parameter names to yours and load real weights into the architecture you wrote. From toy model to small GPT.
14 min
16
Why your model talks badly
Compare your trained model to GPT-2 small and see exactly where size, data, and tuning matter. The honest gap.
13 min
Part 5 — Make it useful, cheaper, and usable
ch 17-21
Instruction-tuning, LoRA, quantization, a chat loop, and a capstone where you ship one specialized assistant end-to-end.
17
Give your model instructions
Turn a base model into an instruction-follower. Chat templates, supervised fine-tuning, loss masking — the SFT recipe in code.
16 min
18
Fine-tuning with LoRA
Implement Low-Rank Adaptation in ~30 lines and fine-tune GPT-2 with a fraction of the parameters. Math, code, results.
12 min
19
Simple quantization
Quantize your model to INT8 — half the memory, almost the same outputs. See where it breaks and what KV cache costs.
10 min
20
Talk to your model
Talk to your model. A minimal chat loop with a KV cache — and the difference between cached and uncached generation.
12 min
21
Ship a useful one
Pick a narrow domain, fine-tune GPT-2 with your SFT recipe, evaluate side-by-side, and ship a small useful model.
22 min
Part 6 — Appendices
optional
Optional deep dives that complement the main path: math derivations and second-look explanations of concepts the chapters use without unpacking.
22
Appendix · Backprop by hand
Derive backprop on a small graph — every gradient written out. The math behind every loss.backward() you've ever called.
14 min
23
Appendix · RLHF and DPO
A conceptual walk-through of RLHF and DPO. What preference data is, what reward models are for, and where DPO simplifies things.
12 min