Chapter 12 · 15 min
The minimum code
The minimum PyTorch code for a GPT-style model — embeddings, blocks, head, loss. Reads in one sitting and trains in chapter 13.
This is the chapter where the readable list-based llm/model.py from chapter 10 becomes a PyTorch model you can actually train. We're going to keep the code under 150 lines, leaning on PyTorch's built-in modules for the things that are not pedagogically interesting (linear layers, layer norm, embedding lookup) but writing the attention and the block ourselves so the chapter-by-chapter mapping stays clean.
The result is a self-contained llm/model.py that defines a GPT-like decoder you can train on data/train.bin from chapter 11 and generate from in chapter 14.
1. How big is this thing?
Before showing the code, do a back-of-envelope. The model has these knobs:
vocab_size= 50,257 (GPT-2 BPE)block_size= 64 (max context length we train on)n_layer= 4n_head= 4n_embd= 128ffn_mult= 4 (FFN hidden dim = 4 × n_embd)
What's the total parameter count? Break it down by source — you should be able to reason it out from chapters 8–10.
Code · JavaScript
Roughly 14 million parameters, most of them in the embedding and unembedding matrices (the vocab is large). The attention sublayers themselves are surprisingly small — the FFN dominates the per-block budget. This is a useful intuition: most of an LLM's parameters live in the FFN, not the attention.
2. The model code
Replace llm/model.py with this PyTorch version:
"""llm/model.py — a teaching-scale decoder transformer."""
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class GPTConfig:
vocab_size: int = 50257
block_size: int = 64
n_layer: int = 4
n_head: int = 4
n_embd: int = 128
ffn_mult: int = 4
dropout: float = 0.0
# Switches we'll use later in the book. Defaults preserve chapter-12 behavior.
bias: bool = False # bias on attention linears (ch.15 → True for GPT-2)
tied_lm_head: bool = False # share tok_emb ↔ head (ch.15 → True for GPT-2)
gelu_approximate: str = "none" # "none" or "tanh" (ch.15 → "tanh" for GPT-2)
class CausalSelfAttention(nn.Module):
"""Multi-head attention with a causal mask. Chapters 8-9."""
def __init__(self, cfg):
super().__init__()
assert cfg.n_embd % cfg.n_head == 0
self.n_head = cfg.n_head
self.n_embd = cfg.n_embd
# combined Q, K, V projection for speed
self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=cfg.bias)
self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=cfg.bias)
# causal mask: lower-triangular ones up to block_size
self.register_buffer(
"mask",
torch.tril(torch.ones(cfg.block_size, cfg.block_size)).view(1, 1, cfg.block_size, cfg.block_size),
)
def forward(self, x, past_kv=None):
B, T, C = x.shape
q, k, v = self.qkv(x).split(self.n_embd, dim=2)
# split into heads: (B, T, n_head, head_dim) -> (B, n_head, T, head_dim)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
# KV cache: extend k, v with previously-computed keys/values.
if past_kv is not None:
past_k, past_v = past_kv
k = torch.cat([past_k, k], dim=2)
v = torch.cat([past_v, v], dim=2)
present_kv = (k, v)
# scaled dot-product attention with causal mask
att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
if past_kv is None:
# full prefix pass — apply causal mask
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
# else: T == 1, the new token attends to everything in the cache; no mask
att = F.softmax(att, dim=-1)
out = att @ v # (B, n_head, T, head_dim)
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.proj(out), present_kv
class FFN(nn.Module):
"""Per-token MLP with GELU. Chapter 10."""
def __init__(self, cfg):
super().__init__()
self.fc1 = nn.Linear(cfg.n_embd, cfg.ffn_mult * cfg.n_embd)
self.fc2 = nn.Linear(cfg.ffn_mult * cfg.n_embd, cfg.n_embd)
self.approximate = cfg.gelu_approximate
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x), approximate=self.approximate))
class Block(nn.Module):
"""One transformer block, pre-norm. Chapter 10."""
def __init__(self, cfg):
super().__init__()
self.ln1 = nn.LayerNorm(cfg.n_embd)
self.attn = CausalSelfAttention(cfg)
self.ln2 = nn.LayerNorm(cfg.n_embd)
self.ffn = FFN(cfg)
def forward(self, x, past_kv=None):
attn_out, present_kv = self.attn(self.ln1(x), past_kv=past_kv)
x = x + attn_out
x = x + self.ffn(self.ln2(x))
return x, present_kv
class GPT(nn.Module):
"""Full decoder transformer. Chapters 8-10 + token + position embeddings."""
def __init__(self, cfg):
super().__init__()
self.cfg = cfg
self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
self.ln_f = nn.LayerNorm(cfg.n_embd)
self.head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
if cfg.tied_lm_head:
self.head.weight = self.tok_emb.weight
def forward(self, idx, targets=None, past_kvs=None):
B, T = idx.shape
# When caching, position embeddings must offset by the cached length.
offset = 0 if past_kvs is None else past_kvs[0][0].size(2)
tok = self.tok_emb(idx)
pos = self.pos_emb(torch.arange(offset, offset + T, device=idx.device))
x = tok + pos
present_kvs = []
past_kvs = past_kvs or [None] * len(self.blocks)
for block, past_kv in zip(self.blocks, past_kvs):
x, present_kv = block(x, past_kv=past_kv)
present_kvs.append(present_kv)
x = self.ln_f(x)
logits = self.head(x)
if targets is None:
return logits, present_kvs
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, lossRead the model from bottom to top once, then top to bottom:
GPT.forwardis the whole language-model interface. Token ids go in; logits come out.tok_emb(idx)turns ids into vectors. This is chapter 4, now as a PyTorch table.pos_emb(torch.arange(T, ...))gives every position its own learned vector, so order matters.- Each
Blockupdates the residual stream without changing its shape. CausalSelfAttentionis chapters 8–9: make Q/K/V, split heads, mask the future, mix values, project back.FFNis chapter 10's per-token MLP.F.cross_entropycompares all logits against the true next-token ids and gives training one scalar to minimize.
That's 80-odd lines of actual code. Skim it. Every block maps to a chapter:
CausalSelfAttention.forward→ chapters 8 (single head) + 9 (multi-head) + the causal mask which is new: a token at position t isn't allowed to attend to positions > t. We zero out the upper triangle of the attention matrix before softmax. That's the one line that makes this a language model and not just a sequence encoder.FFN→ chapter 10 cell 1.Block→ chapter 10 cell 2.GPT.forward→ chapter 10 cell 3, plus the token + positional embedding step we skipped.
3. Install PyTorch
The model uses PyTorch. From your project virtualenv:
pip install torchpip install torchpip install torchPyTorch's CPU build is ~200 MB, so this takes a minute. If you have a recent Mac with Apple Silicon, the install also enables MPS (Metal Performance Shaders) for accelerated inference and training.
4. Sanity-check the model
Save this as scripts/check_model.py:
"""check_model.py — confirm the model's contract on a fake batch."""
import math
import torch
from llm.model import GPT, GPTConfig
# [1]
cfg = GPTConfig()
model = GPT(cfg)
# [2]
n_params = sum(p.numel() for p in model.parameters())
assert 13_500_000 < n_params < 14_500_000, f"expected ~14M params, got {n_params:,}"
# [3]
idx = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
logits, _ = model(idx)
expected_shape = (2, cfg.block_size, cfg.vocab_size)
assert tuple(logits.shape) == expected_shape, (
f"expected output shape {expected_shape}, got {tuple(logits.shape)}"
)
# [4]
targets = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
_, loss = model(idx, targets=targets)
expected_loss = math.log(cfg.vocab_size)
assert abs(loss.item() - expected_loss) < 1.0, (
f"expected loss near log(vocab) = {expected_loss:.2f}, got {loss.item():.2f}"
)
print(f"✓ {n_params:,} params")
print(f"✓ forward shape: {tuple(logits.shape)}")
print(f"✓ initial loss: {loss.item():.2f} (expected ~{expected_loss:.2f})")The check script is not a training script. It locks in three guarantees the rest of the book depends on:
- [1]
GPTConfig()fixes the same dimensions used in the parameter-count cell. - [2] asserts the parameter count is in the right ballpark (~14M). Drift past this band means a layer dropped, doubled, or got the wrong dimension.
- [3] asserts the forward shape is
(batch, time, vocab)— one score per per position. This is the contract every later chapter assumes. - [4] asserts the initial on random targets sits near
log(vocab_size) ≈ 10.83. That is the "uniform-distribution" baseline of an untrained model. Far from it means the model is broken in a way that will not improve with .
Run it:
python -m scripts.check_modelpython -m scripts.check_modelpython -m scripts.check_modelYou should see three ticks:
✓ 13,665,280 params
✓ forward shape: (2, 64, 50257)
✓ initial loss: 10.84 (expected ~10.83)
If any assertion fires instead, the message tells you which contract broke. ~14M params matches your browser cell above. The output shape is one logit vector per (batch, position) pair — exactly what an autoregressive produces. The initial loss near log(50257) ≈ 10.83 is the uniform-distribution baseline of an untrained model.
What we're skipping vs the original nanoGPT
Karpathy's nanoGPT (~300 lines) adds:
- Weight tying between
tok_embandhead(the embedding matrix is reused as the unembedding). Cuts ~vocab_size × n_embd parameters; mostly a memory optimization. - Dropout in attention and FFN. We left
dropout=0.0for clarity; in practice you turn it on during training. - Bias terms in the linear layers. Modern GPTs leave them off.
- Hyperparameter loading from a checkpoint.
The architecture is identical. We're trading some efficiency for clarity.
Recap
- The whole architecture fits in ~100 lines of PyTorch. - Most parameters live in the FFN,
not in attention. The FFN ratio is the architecture knob with the biggest impact on model size. -
Causal masking is the one line that distinguishes a decoder transformer (GPT) from an encoder
transformer (BERT): only attend to past tokens. - Positional embeddings are added to token
embeddings before the first block. Without them, attention is permutation-equivariant. - The
chapter's
llm/model.pyis a complete, runnable architecture; the next chapter trains it.
Going further
- karpathy/nanoGPT — the reference. ~300 lines, very close to what we wrote.
- Karpathy's "Let's build GPT from scratch" — two-hour live walkthrough of the same code.
- Andrej's minGPT — the slightly less stripped-down predecessor of nanoGPT.
Next up: the training loop — we have a model that can produce logits. Now we make it produce correct ones.