Guide

Build a GPT from scratch

Build a GPT-style decoder model from scratch in PyTorch. Architecture, training loop, GPT-2 124M weight loading, and instruction tuning.

GPT is a decoder-only Transformer trained on next-token prediction. This guide builds one in ~150 lines of PyTorch, trains it on a small corpus, loads real GPT-2 124M weights into your architecture, and fine-tunes it to follow instructions.

1. What GPT actually is

GPT is the simplest possible recipe that scales:

A stack of Transformer blocks (chapter 10).
Causal masking, so each token only sees the past.
A next-token cross-entropy loss.
Trained on lots of text.

That's it. GPT-2, GPT-3, and GPT-4 all share this recipe — they differ in the size of the stack, the size of the dataset, and the post-training (instruction tuning, RLHF).

2. The architecture

A GPT model has four pieces:

Token embeddings: every input token id is mapped to a d_model-dimensional vector.
Position embeddings: added to the token embeddings so the model knows where each token sits.
Transformer blocks (×N): the workhorse. Multi-head attention + feed-forward, with residuals and LayerNorm.
Output head: a linear projection from d_model back to the vocabulary, producing logits for every possible next token.

Chapter 12 — A GPT model in ~150 lines of PyTorch writes this whole architecture out cleanly.

3. The training loop

Training a GPT looks like training any neural network, with one ML-specific detail: the targets are the inputs shifted by one position. Predict token 2 from token 1, token 3 from tokens 1–2, and so on.

The minimum loop:

for step in range(n_steps):
    x, y = get_batch()                     # x: tokens, y: x shifted by 1
    logits = model(x)                       # (B, T, V)
    loss = F.cross_entropy(logits.view(-1, V), y.view(-1))
    loss.backward()
    optimizer.step(); optimizer.zero_grad()

Chapter 13 — The training loop builds the loop end-to-end with checkpointing, validation loss, and the first time you see your own model generate text.

4. Sampling

Once trained, the model produces a probability distribution over the vocabulary at every step. How you turn that into actual text matters more than it should.

Greedy: take the most likely token. Boring, repetitive.
Temperature: scale logits by 1/T. Higher T → flatter distribution → more random.
Top-K: only consider the K most likely tokens.
Top-P (nucleus): only consider the smallest set of tokens whose cumulative probability exceeds P.

Chapter 14 — Sampling: temperature, top-k, top-p visualizes each strategy on the same logits.

5. Loading real GPT-2 weights

The model you wrote in chapter 12 is architecturally identical to GPT-2 small. So GPT-2's published weights can be loaded into it directly — once you map the parameter names.

OpenAI's release uses TensorFlow's Conv1D (transposed weights) for some linear layers. Hugging Face's port keeps that. Your PyTorch nn.Linear doesn't. So a few weights need to be transposed at load time. The rest is one big rename table.

Chapter 15 — Load GPT-2 124M weights into your model walks through the mapping, with a side-by-side parameter table.

6. Instruction tuning

A base GPT trained on raw text will complete prompts as if they were continuations of a webpage. "What is two plus two?" gets a paragraph about arithmetic education, not the answer.

Instruction tuning fixes this. You fine-tune on prompt/completion pairs — questions paired with desired answers — and the model learns the chat template. Chapter 17 — Instruction tuning from scratch does this with a small SFT recipe, loss masking, and a clean evaluation comparing before/after.

7. The capstone

Chapter 21 — Capstone picks a narrow domain, hand-writes 150 SFT examples, fine-tunes GPT-2 small with the recipe from chapter 17, and ships a domain-specific assistant. It's the chapter where everything in the course gets used at once.

8. Where to go next

If you've followed along, you have a GPT you wrote and trained yourself. From here:

Scale: swap in a bigger base model (Llama, Mistral) — the same recipe works.
Quantize: chapter 19 shrinks the model to INT8.
Cache: chapter 20 adds a KV cache for faster inference.
LoRA: chapter 18 shows parameter-efficient fine-tuning if full SFT is too expensive.

Frequently asked questions

What does GPT stand for?

Generative Pre-trained Transformer. The name describes the recipe — a Transformer (architecture) pre-trained (objective) on a generative task (next-token prediction). Everything else about modern GPT models follows from that recipe.

How big is GPT-2 small?

124 million parameters. About 500 MB in float32, 250 MB in float16, ~125 MB in INT8. Trainable, loadable, and quantizable on a normal laptop.

Can I load GPT-2 weights into my own architecture?

Yes — that's chapter 15. You map GPT-2's parameter names to your variable names, transpose where the conventions differ (Conv1D vs Linear), and load. The architecture is identical; only the variable names move.

What's the difference between GPT and a Transformer?

GPT is a *decoder-only* Transformer trained with causal next-token prediction. The original 2017 Transformer paper described an encoder–decoder used for translation. GPT keeps only the decoder half and trains it to predict the next word.

Will my from-scratch GPT compete with GPT-4?

No, and that's deliberate. You'll have a small model (~14M to 124M parameters) that produces readable text. Real frontier models are 1000–10000× larger and trained on far more data. The point is to understand the recipe, not beat the benchmarks.

Continue learning

Chapters Lexicon