Chapter 12 · 15 min

The minimum code

The minimum PyTorch code for a GPT-style model — embeddings, blocks, head, loss. Reads in one sitting and trains in chapter 13.

This is the chapter where the readable list-based llm/model.py from chapter 10 becomes a PyTorch model you can actually train. We're going to keep the code under 150 lines, leaning on PyTorch's built-in modules for the things that are not pedagogically interesting (linear layers, layer norm, embedding lookup) but writing the attention and the block ourselves so the chapter-by-chapter mapping stays clean. The result is a self-contained llm/model.py that defines a GPT-like decoder you can train on data/train.bin from chapter 11 and generate from in chapter 14.

1. How big is this thing?

Before showing the code, do a back-of-envelope. The model has these knobs:

vocab_size = 50,257 (GPT-2 BPE)
block_size = 64 (max context length we train on)
n_layer = 4
n_head = 4
n_embd = 128
ffn_mult = 4 (FFN hidden dim = 4 × n_embd)

What's the total parameter count? Break it down by source — you should be able to reason it out from chapters 8–10.

Code · JavaScript

Roughly 14 million parameters, most of them in the embedding and unembedding matrices (the vocab is large). The attention sublayers themselves are surprisingly small — the FFN dominates the per-block budget. This is a useful intuition: most of an LLM's parameters live in the FFN, not the attention.

2. The model code

Replace llm/model.py with this PyTorch version:

"""llm/model.py — a teaching-scale decoder transformer."""
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
 
@dataclass
class GPTConfig:
    vocab_size: int = 50257
    block_size: int = 64
    n_layer: int = 4
    n_head: int = 4
    n_embd: int = 128
    ffn_mult: int = 4
    dropout: float = 0.0
    # Switches we'll use later in the book. Defaults preserve chapter-12 behavior.
    bias: bool = False               # bias on attention linears (ch.15 → True for GPT-2)
    tied_lm_head: bool = False       # share tok_emb ↔ head (ch.15 → True for GPT-2)
    gelu_approximate: str = "none"   # "none" or "tanh" (ch.15 → "tanh" for GPT-2)
 
class CausalSelfAttention(nn.Module):
    """Multi-head attention with a causal mask. Chapters 8-9."""
    def __init__(self, cfg):
        super().__init__()
        assert cfg.n_embd % cfg.n_head == 0
        self.n_head = cfg.n_head
        self.n_embd = cfg.n_embd
        # combined Q, K, V projection for speed
        self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=cfg.bias)
        self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=cfg.bias)
        # causal mask: lower-triangular ones up to block_size
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(cfg.block_size, cfg.block_size)).view(1, 1, cfg.block_size, cfg.block_size),
        )
 
    def forward(self, x, past_kv=None):
        B, T, C = x.shape
        q, k, v = self.qkv(x).split(self.n_embd, dim=2)
        # split into heads: (B, T, n_head, head_dim) -> (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
 
        # KV cache: extend k, v with previously-computed keys/values.
        if past_kv is not None:
            past_k, past_v = past_kv
            k = torch.cat([past_k, k], dim=2)
            v = torch.cat([past_v, v], dim=2)
        present_kv = (k, v)
 
        # scaled dot-product attention with causal mask
        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        if past_kv is None:
            # full prefix pass — apply causal mask
            att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        # else: T == 1, the new token attends to everything in the cache; no mask
        att = F.softmax(att, dim=-1)
        out = att @ v  # (B, n_head, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out), present_kv
 
class FFN(nn.Module):
    """Per-token MLP with GELU. Chapter 10."""
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg.n_embd, cfg.ffn_mult * cfg.n_embd)
        self.fc2 = nn.Linear(cfg.ffn_mult * cfg.n_embd, cfg.n_embd)
        self.approximate = cfg.gelu_approximate
 
    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x), approximate=self.approximate))
 
class Block(nn.Module):
    """One transformer block, pre-norm. Chapter 10."""
    def __init__(self, cfg):
        super().__init__()
        self.ln1 = nn.LayerNorm(cfg.n_embd)
        self.attn = CausalSelfAttention(cfg)
        self.ln2 = nn.LayerNorm(cfg.n_embd)
        self.ffn = FFN(cfg)
 
    def forward(self, x, past_kv=None):
        attn_out, present_kv = self.attn(self.ln1(x), past_kv=past_kv)
        x = x + attn_out
        x = x + self.ffn(self.ln2(x))
        return x, present_kv
 
class GPT(nn.Module):
    """Full decoder transformer. Chapters 8-10 + token + position embeddings."""
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
        self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
        self.ln_f = nn.LayerNorm(cfg.n_embd)
        self.head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
        if cfg.tied_lm_head:
            self.head.weight = self.tok_emb.weight
 
    def forward(self, idx, targets=None, past_kvs=None):
        B, T = idx.shape
        # When caching, position embeddings must offset by the cached length.
        offset = 0 if past_kvs is None else past_kvs[0][0].size(2)
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(offset, offset + T, device=idx.device))
        x = tok + pos
        present_kvs = []
        past_kvs = past_kvs or [None] * len(self.blocks)
        for block, past_kv in zip(self.blocks, past_kvs):
            x, present_kv = block(x, past_kv=past_kv)
            present_kvs.append(present_kv)
        x = self.ln_f(x)
        logits = self.head(x)
        if targets is None:
            return logits, present_kvs
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

Read the model from bottom to top once, then top to bottom:

GPT.forward is the whole language-model interface. Token ids go in; logits come out.
tok_emb(idx) turns ids into vectors. This is chapter 4, now as a PyTorch table.
pos_emb(torch.arange(T, ...)) gives every position its own learned vector, so order matters.
Each Block updates the residual stream without changing its shape.
CausalSelfAttention is chapters 8–9: make Q/K/V, split heads, mask the future, mix values, project back.
FFN is chapter 10's per-token MLP.
F.cross_entropy compares all logits against the true next-token ids and gives training one scalar to minimize.

That's 80-odd lines of actual code. Skim it. Every block maps to a chapter:

CausalSelfAttention.forward → chapters 8 (single head) + 9 (multi-head) + the causal mask which is new: a token at position t isn't allowed to attend to positions > t. We zero out the upper triangle of the attention matrix before softmax. That's the one line that makes this a language model and not just a sequence encoder.
FFN → chapter 10 cell 1.
Block → chapter 10 cell 2.
GPT.forward → chapter 10 cell 3, plus the token + positional embedding step we skipped.

What are the four switches at the top of GPTConfig for?

Four knobs in GPTConfig are unused at this chapter's default values but matter later. The defaults preserve the behavior we're training in chapter 13, so nothing breaks here.

bias — when True, attention linears get learnable biases. We leave it False for our from-scratch model; chapter 15 flips it on because GPT-2 was trained that way.
tied_lm_head — when True, head.weight shares storage with tok_emb.weight, cutting ~38M params for GPT-2 (no effect at our size). Also flipped on at chapter 15.
gelu_approximate — "none" here, "tanh" for GPT-2 compatibility.
past_kvs argument on forward — for KV-cached . When None (default in training), the model behaves exactly like the chapter 10 version. Chapter 20 threads a real cache through.

Building these in now means later chapters never have to ask you to retroactively patch llm/model.py. The same file you write here is the one used everywhere downstream.

3. Install PyTorch

The model uses PyTorch. From your project virtualenv:

pip install torch

pip install torch

pip install torch

PyTorch's CPU build is ~200 MB, so this takes a minute. If you have a recent Mac with Apple Silicon, the install also enables MPS (Metal Performance Shaders) for accelerated inference and training.

4. Sanity-check the model

Save this as scripts/check_model.py:

"""check_model.py — confirm the model's contract on a fake batch."""
import math
import torch
 
from llm.model import GPT, GPTConfig
 
 
# [1]
cfg = GPTConfig()
model = GPT(cfg)
 
# [2]
n_params = sum(p.numel() for p in model.parameters())
assert 13_500_000 < n_params < 14_500_000, f"expected ~14M params, got {n_params:,}"
 
# [3]
idx = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
logits, _ = model(idx)
expected_shape = (2, cfg.block_size, cfg.vocab_size)
assert tuple(logits.shape) == expected_shape, (
    f"expected output shape {expected_shape}, got {tuple(logits.shape)}"
)
 
# [4]
targets = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
_, loss = model(idx, targets=targets)
expected_loss = math.log(cfg.vocab_size)
assert abs(loss.item() - expected_loss) < 1.0, (
    f"expected loss near log(vocab) = {expected_loss:.2f}, got {loss.item():.2f}"
)
 
print(f"✓ {n_params:,} params")
print(f"✓ forward shape: {tuple(logits.shape)}")
print(f"✓ initial loss: {loss.item():.2f} (expected ~{expected_loss:.2f})")

The check script is not a training script. It locks in three guarantees the rest of the book depends on:

[1] GPTConfig() fixes the same dimensions used in the parameter-count cell.
[2] asserts the parameter count is in the right ballpark (~14M). Drift past this band means a layer dropped, doubled, or got the wrong dimension.
[3] asserts the forward shape is (batch, time, vocab) — one score per per position. This is the contract every later chapter assumes.
[4] asserts the initial on random targets sits near log(vocab_size) ≈ 10.83. That is the "uniform-distribution" baseline of an untrained model. Far from it means the model is broken in a way that will not improve with .

Run it:

python -m scripts.check_model

python -m scripts.check_model

python -m scripts.check_model

You should see three ticks:

✓ 13,665,280 params
✓ forward shape: (2, 64, 50257)
✓ initial loss: 10.84 (expected ~10.83)

If any assertion fires instead, the message tells you which contract broke. ~14M params matches your browser cell above. The output shape is one logit vector per (batch, position) pair — exactly what an autoregressive produces. The initial loss near log(50257) ≈ 10.83 is the uniform-distribution baseline of an untrained model.

What we're skipping vs the original nanoGPT

Karpathy's nanoGPT (~300 lines) adds:

Weight tying between tok_emb and head (the embedding matrix is reused as the unembedding). Cuts ~vocab_size × n_embd parameters; mostly a memory optimization.
Dropout in attention and FFN. We left dropout=0.0 for clarity; in practice you turn it on during training.
Bias terms in the linear layers. Modern GPTs leave them off.
Hyperparameter loading from a checkpoint.

The architecture is identical. We're trading some efficiency for clarity.

Recap

The whole architecture fits in ~100 lines of PyTorch. - Most parameters live in the FFN, not in attention. The FFN ratio is the architecture knob with the biggest impact on model size. - Causal masking is the one line that distinguishes a decoder transformer (GPT) from an encoder transformer (BERT): only attend to past tokens. - Positional embeddings are added to token embeddings before the first block. Without them, attention is permutation-equivariant. - The chapter's llm/model.py is a complete, runnable architecture; the next chapter trains it.

Going further

karpathy/nanoGPT — the reference. ~300 lines, very close to what we wrote.
Karpathy's "Let's build GPT from scratch" — two-hour live walkthrough of the same code.
Andrej's minGPT — the slightly less stripped-down predecessor of nanoGPT.

Next up: the training loop — we have a model that can produce logits. Now we make it produce correct ones.