Guide
Build a GPT from scratch
Build a GPT-style decoder model from scratch in PyTorch. Architecture, training loop, GPT-2 124M weight loading, and instruction tuning.
GPT is a decoder-only Transformer trained on next-token prediction. This guide builds one in ~150 lines of PyTorch, trains it on a small corpus, loads real GPT-2 124M weights into your architecture, and fine-tunes it to follow instructions.
1. What GPT actually is
GPT is the simplest possible recipe that scales:
- A stack of Transformer blocks (chapter 10).
- Causal masking, so each token only sees the past.
- A next-token cross-entropy loss.
- Trained on lots of text.
That's it. GPT-2, GPT-3, and GPT-4 all share this recipe — they differ in the size of the stack, the size of the dataset, and the post-training (instruction tuning, RLHF).
2. The architecture
A GPT model has four pieces:
- Token embeddings: every input token id is mapped to a
d_model-dimensional vector. - Position embeddings: added to the token embeddings so the model knows where each token sits.
- Transformer blocks (×N): the workhorse. Multi-head attention + feed-forward, with residuals and LayerNorm.
- Output head: a linear projection from
d_modelback to the vocabulary, producing logits for every possible next token.
Chapter 12 — A GPT model in ~150 lines of PyTorch writes this whole architecture out cleanly.
3. The training loop
Training a GPT looks like training any neural network, with one ML-specific detail: the targets are the inputs shifted by one position. Predict token 2 from token 1, token 3 from tokens 1–2, and so on.
The minimum loop:
for step in range(n_steps):
x, y = get_batch() # x: tokens, y: x shifted by 1
logits = model(x) # (B, T, V)
loss = F.cross_entropy(logits.view(-1, V), y.view(-1))
loss.backward()
optimizer.step(); optimizer.zero_grad()Chapter 13 — The training loop builds the loop end-to-end with checkpointing, validation loss, and the first time you see your own model generate text.
4. Sampling
Once trained, the model produces a probability distribution over the vocabulary at every step. How you turn that into actual text matters more than it should.
- Greedy: take the most likely token. Boring, repetitive.
- Temperature: scale logits by
1/T. Higher T → flatter distribution → more random. - Top-K: only consider the K most likely tokens.
- Top-P (nucleus): only consider the smallest set of tokens whose cumulative probability exceeds P.
Chapter 14 — Sampling: temperature, top-k, top-p visualizes each strategy on the same logits.
5. Loading real GPT-2 weights
The model you wrote in chapter 12 is architecturally identical to GPT-2 small. So GPT-2's published weights can be loaded into it directly — once you map the parameter names.
OpenAI's release uses TensorFlow's Conv1D (transposed weights) for some linear layers. Hugging Face's port keeps that. Your PyTorch nn.Linear doesn't. So a few weights need to be transposed at load time. The rest is one big rename table.
Chapter 15 — Load GPT-2 124M weights into your model walks through the mapping, with a side-by-side parameter table.
6. Instruction tuning
A base GPT trained on raw text will complete prompts as if they were continuations of a webpage. "What is two plus two?" gets a paragraph about arithmetic education, not the answer.
Instruction tuning fixes this. You fine-tune on prompt/completion pairs — questions paired with desired answers — and the model learns the chat template. Chapter 17 — Instruction tuning from scratch does this with a small SFT recipe, loss masking, and a clean evaluation comparing before/after.
7. The capstone
Chapter 21 — Capstone picks a narrow domain, hand-writes 150 SFT examples, fine-tunes GPT-2 small with the recipe from chapter 17, and ships a domain-specific assistant. It's the chapter where everything in the course gets used at once.
8. Where to go next
If you've followed along, you have a GPT you wrote and trained yourself. From here:
- Scale: swap in a bigger base model (Llama, Mistral) — the same recipe works.
- Quantize: chapter 19 shrinks the model to INT8.
- Cache: chapter 20 adds a KV cache for faster inference.
- LoRA: chapter 18 shows parameter-efficient fine-tuning if full SFT is too expensive.
Frequently asked questions
What does GPT stand for?
Generative Pre-trained Transformer. The name describes the recipe — a Transformer (architecture) pre-trained (objective) on a generative task (next-token prediction). Everything else about modern GPT models follows from that recipe.
How big is GPT-2 small?
124 million parameters. About 500 MB in float32, 250 MB in float16, ~125 MB in INT8. Trainable, loadable, and quantizable on a normal laptop.
Can I load GPT-2 weights into my own architecture?
Yes — that's chapter 15. You map GPT-2's parameter names to your variable names, transpose where the conventions differ (Conv1D vs Linear), and load. The architecture is identical; only the variable names move.
What's the difference between GPT and a Transformer?
GPT is a *decoder-only* Transformer trained with causal next-token prediction. The original 2017 Transformer paper described an encoder–decoder used for translation. GPT keeps only the decoder half and trains it to predict the next word.
Will my from-scratch GPT compete with GPT-4?
No, and that's deliberate. You'll have a small model (~14M to 124M parameters) that produces readable text. Real frontier models are 1000–10000× larger and trained on far more data. The point is to understand the recipe, not beat the benchmarks.