Skip to content
The loss curve

Chapters · in reading order

All chapters

Each chapter is a working artifact you can manipulate, and a short piece of prose that explains what you just did. They build on each other; read them in order.

0

Part 0Before you start

ch 0

Python, venv, PyTorch — the local toolchain. Skip if you already have a 3.11+ Python and pip install torch is muscle memory.

  1. 00

    Before you start

    Set up Python 3.11+, a virtual environment, and PyTorch in 10 minutes. Mac, Windows, Linux. The toolchain for the rest of the course.

    12 min

I

Part 1Start the project

ch 1-4

Tokens, bigrams, BPE, embeddings. You start the local project and build the first pieces of a language model.

  1. 01

    The dumbest model that exists

    Build the simplest possible language model — a bigram counter. Tokens, probability tables, sampling. Runs in your browser, then locally.

    18 min

  2. 02

    Counting isn't enough

    Why counts alone fail and how smoothing fixes them — Laplace, Kneser-Ney, a held-out set, and the first perplexity number.

    15 min

  3. 03

    Train your own tokens

    Byte Pair Encoding from scratch — count pairs, merge, encode, decode. Train your own tokenizer and compare it to GPT-2's.

    16 min

  4. 04

    Giving meaning to words

    Give meaning to tokens. One-hot vectors, embeddings, cosine similarity, skip-gram training — and what "king − man + woman" really shows.

    16 min

II

Part 2Make it learn

ch 5-7

Single neuron, MLP, optimizers. The model stops counting and starts improving through gradients.

  1. 05

    A neuron that learns

    One neuron, one loss, one gradient. Build a learnable linear unit by hand and watch it converge — the smallest possible training loop.

    16 min

  2. 06

    Stacking layers

    A single neuron is a line. Stack them with a non-linearity and you get an MLP — the feed-forward block at the heart of every Transformer.

    14 min

  3. 07

    Gradient descent live

    Watch SGD, momentum, and Adam navigate the same loss surface. Build each optimizer step in plain code.

    15 min

III

Part 3Build the transformer

ch 8-10

Attention, multiple heads, residual connections, and the complete transformer block used by modern LLMs.

  1. 08

    An attention head by hand

    Q, K, V, scaled dot product, causal mask, softmax. Build a self-attention head by hand and visualize what it attends to.

    18 min

  2. 09

    Multi-head and residuals

    From one head to many. Add residual connections and LayerNorm — the wiring that makes Transformers trainable at depth.

    14 min

  3. 10

    The full transformer block

    Attention + feed-forward + residuals + LayerNorm, assembled into the block that GPT stacks N times. End-to-end forward pass.

    16 min

IV

Part 4Train and use the LLM

ch 11-16

Prepare data, switch to PyTorch, train a small GPT, load real GPT-2 weights into the same code, sample, and read its failure modes honestly.

  1. 11

    Prepare a dataset

    Move off toy data — load Shakespeare, tokenize with your BPE, build train/val splits, save tensors ready for training.

    16 min

  2. 12

    The minimum code

    The minimum PyTorch code for a GPT-style model — embeddings, blocks, head, loss. Reads in one sitting and trains in chapter 13.

    15 min

  3. 13

    The training loop

    Write the training loop, plot the loss curve, save a checkpoint, generate a sample. This is where the project starts to feel real.

    16 min

  4. 14

    Generation and sampling

    How a trained model becomes text — temperature, top-k, top-p (nucleus). Visualize each strategy on the same logits.

    12 min

  5. 15

    Load real weights

    Map GPT-2's parameter names to yours and load real weights into the architecture you wrote. From toy model to small GPT.

    14 min

  6. 16

    Why your model talks badly

    Compare your trained model to GPT-2 small and see exactly where size, data, and tuning matter. The honest gap.

    13 min

V

Part 5Make it useful, cheaper, and usable

ch 17-21

Instruction-tuning, LoRA, quantization, a chat loop, and a capstone where you ship one specialized assistant end-to-end.

  1. 17

    Give your model instructions

    Turn a base model into an instruction-follower. Chat templates, supervised fine-tuning, loss masking — the SFT recipe in code.

    16 min

  2. 18

    Fine-tuning with LoRA

    Implement Low-Rank Adaptation in ~30 lines and fine-tune GPT-2 with a fraction of the parameters. Math, code, results.

    12 min

  3. 19

    Simple quantization

    Quantize your model to INT8 — half the memory, almost the same outputs. See where it breaks and what KV cache costs.

    10 min

  4. 20

    Talk to your model

    Talk to your model. A minimal chat loop with a KV cache — and the difference between cached and uncached generation.

    12 min

  5. 21

    Ship a useful one

    Pick a narrow domain, fine-tune GPT-2 with your SFT recipe, evaluate side-by-side, and ship a small useful model.

    22 min

VI

Part 6Appendices

optional

Optional deep dives that complement the main path: math derivations and second-look explanations of concepts the chapters use without unpacking.

  1. 22

    Appendix · Backprop by hand

    Derive backprop on a small graph — every gradient written out. The math behind every loss.backward() you've ever called.

    14 min

  2. 23

    Appendix · RLHF and DPO

    A conceptual walk-through of RLHF and DPO. What preference data is, what reward models are for, and where DPO simplifies things.

    12 min