Skip to content
The loss curve

Chapter 12 · 15 min

The minimum code

The minimum PyTorch code for a GPT-style model — embeddings, blocks, head, loss. Reads in one sitting and trains in chapter 13.

This is the chapter where the readable list-based llm/model.py from chapter 10 becomes a PyTorch model you can actually train. We're going to keep the code under 150 lines, leaning on PyTorch's built-in modules for the things that are not pedagogically interesting (linear layers, layer norm, embedding lookup) but writing the attention and the block ourselves so the chapter-by-chapter mapping stays clean. The result is a self-contained llm/model.py that defines a GPT-like decoder you can train on data/train.bin from chapter 11 and generate from in chapter 14.

1. How big is this thing?

Before showing the code, do a back-of-envelope. The model has these knobs:

  • vocab_size = 50,257 (GPT-2 BPE)
  • block_size = 64 (max context length we train on)
  • n_layer = 4
  • n_head = 4
  • n_embd = 128
  • ffn_mult = 4 (FFN hidden dim = 4 × n_embd)

What's the total parameter count? Break it down by source — you should be able to reason it out from chapters 8–10.

Code · JavaScript

Roughly 14 million parameters, most of them in the embedding and unembedding matrices (the vocab is large). The attention sublayers themselves are surprisingly small — the FFN dominates the per-block budget. This is a useful intuition: most of an LLM's parameters live in the FFN, not the attention.

2. The model code

Replace llm/model.py with this PyTorch version:

"""llm/model.py — a teaching-scale decoder transformer."""
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
 
@dataclass
class GPTConfig:
    vocab_size: int = 50257
    block_size: int = 64
    n_layer: int = 4
    n_head: int = 4
    n_embd: int = 128
    ffn_mult: int = 4
    dropout: float = 0.0
    # Switches we'll use later in the book. Defaults preserve chapter-12 behavior.
    bias: bool = False               # bias on attention linears (ch.15 → True for GPT-2)
    tied_lm_head: bool = False       # share tok_emb ↔ head (ch.15 → True for GPT-2)
    gelu_approximate: str = "none"   # "none" or "tanh" (ch.15 → "tanh" for GPT-2)
 
class CausalSelfAttention(nn.Module):
    """Multi-head attention with a causal mask. Chapters 8-9."""
    def __init__(self, cfg):
        super().__init__()
        assert cfg.n_embd % cfg.n_head == 0
        self.n_head = cfg.n_head
        self.n_embd = cfg.n_embd
        # combined Q, K, V projection for speed
        self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd, bias=cfg.bias)
        self.proj = nn.Linear(cfg.n_embd, cfg.n_embd, bias=cfg.bias)
        # causal mask: lower-triangular ones up to block_size
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(cfg.block_size, cfg.block_size)).view(1, 1, cfg.block_size, cfg.block_size),
        )
 
    def forward(self, x, past_kv=None):
        B, T, C = x.shape
        q, k, v = self.qkv(x).split(self.n_embd, dim=2)
        # split into heads: (B, T, n_head, head_dim) -> (B, n_head, T, head_dim)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
 
        # KV cache: extend k, v with previously-computed keys/values.
        if past_kv is not None:
            past_k, past_v = past_kv
            k = torch.cat([past_k, k], dim=2)
            v = torch.cat([past_v, v], dim=2)
        present_kv = (k, v)
 
        # scaled dot-product attention with causal mask
        att = (q @ k.transpose(-2, -1)) / math.sqrt(k.size(-1))
        if past_kv is None:
            # full prefix pass — apply causal mask
            att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        # else: T == 1, the new token attends to everything in the cache; no mask
        att = F.softmax(att, dim=-1)
        out = att @ v  # (B, n_head, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out), present_kv
 
class FFN(nn.Module):
    """Per-token MLP with GELU. Chapter 10."""
    def __init__(self, cfg):
        super().__init__()
        self.fc1 = nn.Linear(cfg.n_embd, cfg.ffn_mult * cfg.n_embd)
        self.fc2 = nn.Linear(cfg.ffn_mult * cfg.n_embd, cfg.n_embd)
        self.approximate = cfg.gelu_approximate
 
    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x), approximate=self.approximate))
 
class Block(nn.Module):
    """One transformer block, pre-norm. Chapter 10."""
    def __init__(self, cfg):
        super().__init__()
        self.ln1 = nn.LayerNorm(cfg.n_embd)
        self.attn = CausalSelfAttention(cfg)
        self.ln2 = nn.LayerNorm(cfg.n_embd)
        self.ffn = FFN(cfg)
 
    def forward(self, x, past_kv=None):
        attn_out, present_kv = self.attn(self.ln1(x), past_kv=past_kv)
        x = x + attn_out
        x = x + self.ffn(self.ln2(x))
        return x, present_kv
 
class GPT(nn.Module):
    """Full decoder transformer. Chapters 8-10 + token + position embeddings."""
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
        self.blocks = nn.ModuleList([Block(cfg) for _ in range(cfg.n_layer)])
        self.ln_f = nn.LayerNorm(cfg.n_embd)
        self.head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)
        if cfg.tied_lm_head:
            self.head.weight = self.tok_emb.weight
 
    def forward(self, idx, targets=None, past_kvs=None):
        B, T = idx.shape
        # When caching, position embeddings must offset by the cached length.
        offset = 0 if past_kvs is None else past_kvs[0][0].size(2)
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(offset, offset + T, device=idx.device))
        x = tok + pos
        present_kvs = []
        past_kvs = past_kvs or [None] * len(self.blocks)
        for block, past_kv in zip(self.blocks, past_kvs):
            x, present_kv = block(x, past_kv=past_kv)
            present_kvs.append(present_kv)
        x = self.ln_f(x)
        logits = self.head(x)
        if targets is None:
            return logits, present_kvs
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

Read the model from bottom to top once, then top to bottom:

  • GPT.forward is the whole language-model interface. Token ids go in; logits come out.
  • tok_emb(idx) turns ids into vectors. This is chapter 4, now as a PyTorch table.
  • pos_emb(torch.arange(T, ...)) gives every position its own learned vector, so order matters.
  • Each Block updates the residual stream without changing its shape.
  • CausalSelfAttention is chapters 8–9: make Q/K/V, split heads, mask the future, mix values, project back.
  • FFN is chapter 10's per-token MLP.
  • F.cross_entropy compares all logits against the true next-token ids and gives training one scalar to minimize.

That's 80-odd lines of actual code. Skim it. Every block maps to a chapter:

  • CausalSelfAttention.forward → chapters 8 (single head) + 9 (multi-head) + the causal mask which is new: a token at position t isn't allowed to attend to positions > t. We zero out the upper triangle of the attention matrix before softmax. That's the one line that makes this a language model and not just a sequence encoder.
  • FFN → chapter 10 cell 1.
  • Block → chapter 10 cell 2.
  • GPT.forward → chapter 10 cell 3, plus the token + positional embedding step we skipped.

3. Install PyTorch

The model uses PyTorch. From your project virtualenv:

pip install torch
pip install torch
pip install torch

PyTorch's CPU build is ~200 MB, so this takes a minute. If you have a recent Mac with Apple Silicon, the install also enables MPS (Metal Performance Shaders) for accelerated inference and training.

4. Sanity-check the model

Save this as scripts/check_model.py:

"""check_model.py — confirm the model's contract on a fake batch."""
import math
import torch
 
from llm.model import GPT, GPTConfig
 
 
# [1]
cfg = GPTConfig()
model = GPT(cfg)
 
# [2]
n_params = sum(p.numel() for p in model.parameters())
assert 13_500_000 < n_params < 14_500_000, f"expected ~14M params, got {n_params:,}"
 
# [3]
idx = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
logits, _ = model(idx)
expected_shape = (2, cfg.block_size, cfg.vocab_size)
assert tuple(logits.shape) == expected_shape, (
    f"expected output shape {expected_shape}, got {tuple(logits.shape)}"
)
 
# [4]
targets = torch.randint(0, cfg.vocab_size, (2, cfg.block_size))
_, loss = model(idx, targets=targets)
expected_loss = math.log(cfg.vocab_size)
assert abs(loss.item() - expected_loss) < 1.0, (
    f"expected loss near log(vocab) = {expected_loss:.2f}, got {loss.item():.2f}"
)
 
print(f"✓ {n_params:,} params")
print(f"✓ forward shape: {tuple(logits.shape)}")
print(f"✓ initial loss: {loss.item():.2f} (expected ~{expected_loss:.2f})")

The check script is not a training script. It locks in three guarantees the rest of the book depends on:

  • [1] GPTConfig() fixes the same dimensions used in the parameter-count cell.
  • [2] asserts the parameter count is in the right ballpark (~14M). Drift past this band means a layer dropped, doubled, or got the wrong dimension.
  • [3] asserts the forward shape is (batch, time, vocab) — one score per per position. This is the contract every later chapter assumes.
  • [4] asserts the initial on random targets sits near log(vocab_size) ≈ 10.83. That is the "uniform-distribution" baseline of an untrained model. Far from it means the model is broken in a way that will not improve with .

Run it:

python -m scripts.check_model
python -m scripts.check_model
python -m scripts.check_model

You should see three ticks:

✓ 13,665,280 params
✓ forward shape: (2, 64, 50257)
✓ initial loss: 10.84 (expected ~10.83)

If any assertion fires instead, the message tells you which contract broke. ~14M params matches your browser cell above. The output shape is one logit vector per (batch, position) pair — exactly what an autoregressive produces. The initial loss near log(50257) ≈ 10.83 is the uniform-distribution baseline of an untrained model.

What we're skipping vs the original nanoGPT

Karpathy's nanoGPT (~300 lines) adds:

  • Weight tying between tok_emb and head (the embedding matrix is reused as the unembedding). Cuts ~vocab_size × n_embd parameters; mostly a memory optimization.
  • Dropout in attention and FFN. We left dropout=0.0 for clarity; in practice you turn it on during training.
  • Bias terms in the linear layers. Modern GPTs leave them off.
  • Hyperparameter loading from a checkpoint.

The architecture is identical. We're trading some efficiency for clarity.

Recap

  • The whole architecture fits in ~100 lines of PyTorch. - Most parameters live in the FFN, not in attention. The FFN ratio is the architecture knob with the biggest impact on model size. - Causal masking is the one line that distinguishes a decoder transformer (GPT) from an encoder transformer (BERT): only attend to past tokens. - Positional embeddings are added to token embeddings before the first block. Without them, attention is permutation-equivariant. - The chapter's llm/model.py is a complete, runnable architecture; the next chapter trains it.

Going further

Next up: the training loop — we have a model that can produce logits. Now we make it produce correct ones.