All Chapters

0

Part 0 — Before you start

ch 0

Python, venv, PyTorch — the local toolchain. Skip if you already have a 3.11+ Python and pip install torch is muscle memory.

00
Before you start
Set up Python 3.11+, a virtual environment, and PyTorch in 10 minutes. Mac, Windows, Linux. The toolchain for the rest of the course.
12 min

I

Part 1 — Start the project

ch 1-4

Tokens, bigrams, BPE, embeddings. You start the local project and build the first pieces of a language model.

01
The dumbest model that exists
Build the simplest possible language model — a bigram counter. Tokens, probability tables, sampling. Runs in your browser, then locally.
18 min
02
Counting isn't enough
Why counts alone fail and how smoothing fixes them — Laplace, Kneser-Ney, a held-out set, and the first perplexity number.
15 min
03
Train your own tokens
Byte Pair Encoding from scratch — count pairs, merge, encode, decode. Train your own tokenizer and compare it to GPT-2's.
16 min
04
Giving meaning to words
Give meaning to tokens. One-hot vectors, embeddings, cosine similarity, skip-gram training — and what "king − man + woman" really shows.
16 min

II

Part 2 — Make it learn

ch 5-7

Single neuron, MLP, optimizers. The model stops counting and starts improving through gradients.

III

Part 3 — Build the transformer

ch 8-10

Attention, multiple heads, residual connections, and the complete transformer block used by modern LLMs.

IV

Part 4 — Train and use the LLM

ch 11-16

Prepare data, switch to PyTorch, train a small GPT, load real GPT-2 weights into the same code, sample, and read its failure modes honestly.

V

Part 5 — Make it useful, cheaper, and usable

ch 17-21

Instruction-tuning, LoRA, quantization, a chat loop, and a capstone where you ship one specialized assistant end-to-end.

VI

Part 6 — Appendices

optional

Optional deep dives that complement the main path: math derivations and second-look explanations of concepts the chapters use without unpacking.

Part 0 — Before you start

Before you start

Part 1 — Start the project

The dumbest model that exists

Counting isn't enough

Train your own tokens

Giving meaning to words

Part 2 — Make it learn

A neuron that learns

Stacking layers

Gradient descent live

Part 3 — Build the transformer

An attention head by hand

Multi-head and residuals

The full transformer block

Part 4 — Train and use the LLM

Prepare a dataset

The minimum code

The training loop

Generation and sampling

Load real weights

Why your model talks badly

Part 5 — Make it useful, cheaper, and usable

Give your model instructions

Fine-tuning with LoRA

Simple quantization

Talk to your model

Ship a useful one

Part 6 — Appendices

Appendix · Backprop by hand

Appendix · RLHF and DPO