Chapter 15 · 14 min

Load real weights

Map GPT-2's parameter names to yours and load real weights into the architecture you wrote. From toy model to small GPT.

You trained a 14M- model on ~1 MB of Shakespeare. GPT-2 small is 124M trained on 40 GB of WebText.

Same architecture. Exactly the same architecture, modulo three small implementation details. So let's prove it: take OpenAI's pretrained weights, push them into your own llm/model.py, and watch your code generate coherent English.

This chapter is not about a new model. It is about realizing that the code you wrote in chapters 8–12 already supports models orders of magnitude larger than the one you trained. The rest of the book — SFT, LoRA, quantization, chat — applies just as well on top of these weights as it did on top of yours.

1. The three deltas

Compare your model from chapter 12 with the original GPT-2:

: GPT-2 BPE, 50 257 . ✓ Identical (you used tiktoken in chapter 11).
+ FFN + LayerNorm pre-attention: identical structure.
Learned positional : same shape, just longer (block_size = 1024 vs your 64).
n_embd / n_head / n_layer: GPT-2 small uses 768 / 12 / 12, you used 128 / 4 / 4. Same hyperparameter, different value.

Then the three small implementation differences:

Bias on linear layers. GPT-2 trained with bias=True on every linear (Q, K, V, the output projection, both FFN linears). Your chapter-12 model has bias=False on attention. If we load weights without biases, the learned contribution is silently dropped.
GELU approximation. GPT-2 used the tanh approximation of GELU: 0.5x(1 + tanh(√(2/π)(x + 0.044715 x³))). PyTorch's default F.gelu is the exact formulation. The two differ by roughly 0.001 per activation — tiny, but compounded over 12 layers in 768 dimensions on a model not trained with it, the output drifts.
Embedding ↔ unembedding tied. GPT-2 reuses the matrix as the final unembedding (lm_head.weight = wte.weight). Free ~38M- reduction. Your chapter-12 model has a separate head layer.

That's the whole compatibility surface. Three knobs — and they are already in your GPTConfig since chapter 12. Look at the dataclass: bias, tied_lm_head, gelu_approximate are right there with the defaults that preserve chapter-12 behavior. We never asked you to patch llm/model.py retroactively; we just left the switches off until now.

2. Flip the switches

To load GPT-2 small, instantiate GPTConfig with the three flags turned on and the GPT-2 dimensions plugged in:

GPTConfig(
    vocab_size=50257,
    block_size=1024,
    n_layer=12,
    n_head=12,
    n_embd=768,
    ffn_mult=4,
    bias=True,
    tied_lm_head=True,
    gelu_approximate="tanh",
)

Same model class. Same forward pass. Your trained from chapter 13 still loads via the default config — nothing changes for it. This larger config is just a different instantiation of the same architecture.

3. The weight-loading script

Install transformers once:

pip install transformers

pip install transformers

pip install transformers

Then save scripts/load_gpt2.py:

"""scripts/load_gpt2.py — load HuggingFace GPT-2 small into our model."""
from pathlib import Path
 
import torch
 
from llm.model import GPT, GPTConfig
 
 
# [1]
def gpt2_small_config() -> GPTConfig:
    return GPTConfig(
        vocab_size=50257,
        block_size=1024,
        n_layer=12,
        n_head=12,
        n_embd=768,
        ffn_mult=4,
        bias=True,
        tied_lm_head=True,
        gelu_approximate="tanh",
    )
 
 
# [2]
TOP_LEVEL = {
    "transformer.wte.weight": "tok_emb.weight",
    "transformer.wpe.weight": "pos_emb.weight",
    "transformer.ln_f.weight": "ln_f.weight",
    "transformer.ln_f.bias": "ln_f.bias",
}
 
# [3]
PER_LAYER = [
    ("h.{i}.ln_1.weight", "blocks.{i}.ln1.weight"),
    ("h.{i}.ln_1.bias", "blocks.{i}.ln1.bias"),
    ("h.{i}.attn.c_attn.weight", "blocks.{i}.attn.qkv.weight"),
    ("h.{i}.attn.c_attn.bias", "blocks.{i}.attn.qkv.bias"),
    ("h.{i}.attn.c_proj.weight", "blocks.{i}.attn.proj.weight"),
    ("h.{i}.attn.c_proj.bias", "blocks.{i}.attn.proj.bias"),
    ("h.{i}.ln_2.weight", "blocks.{i}.ln2.weight"),
    ("h.{i}.ln_2.bias", "blocks.{i}.ln2.bias"),
    ("h.{i}.mlp.c_fc.weight", "blocks.{i}.ffn.fc1.weight"),
    ("h.{i}.mlp.c_fc.bias", "blocks.{i}.ffn.fc1.bias"),
    ("h.{i}.mlp.c_proj.weight", "blocks.{i}.ffn.fc2.weight"),
    ("h.{i}.mlp.c_proj.bias", "blocks.{i}.ffn.fc2.bias"),
]
 
# [4]
TRANSPOSE_SUFFIXES = (
    "attn.c_attn.weight",
    "attn.c_proj.weight",
    "mlp.c_fc.weight",
    "mlp.c_proj.weight",
)
 
 
def translate(hf_state: dict, n_layer: int) -> dict:
    out: dict = {}
    for hf_key, our_key in TOP_LEVEL.items():
        out[our_key] = hf_state[hf_key]
    for i in range(n_layer):
        for hf_template, our_template in PER_LAYER:
            hf_key = hf_template.format(i=i)
            tensor = hf_state[hf_key]
            # [5]
            if hf_key.endswith(TRANSPOSE_SUFFIXES):
                tensor = tensor.t().contiguous()
            out[our_template.format(i=i)] = tensor
    return out
 
 
def main() -> None:
    from transformers import GPT2LMHeadModel
 
    cfg = gpt2_small_config()
    model = GPT(cfg)
    n_params = sum(p.numel() for p in model.parameters())
 
    # [6]
    assert n_params == 124_439_808, f"expected 124,439,808 params, got {n_params:,}"
 
    # [7]
    hf_state = GPT2LMHeadModel.from_pretrained("gpt2").state_dict()
    our_state = translate(hf_state, n_layer=cfg.n_layer)
 
    # [8]
    missing, unexpected = model.load_state_dict(our_state, strict=False)
    assert not unexpected, f"unexpected keys in our_state: {unexpected}"
    # With tied_lm_head, head.weight shares storage with tok_emb.weight,
    # so it's reported as "missing" but is loaded via the tie.
    allowed_missing = lambda k: "attn.mask" in k or k == "head.weight"
    assert all(allowed_missing(k) for k in missing), (
        f"missing keys other than causal masks / tied head: {missing}"
    )
 
    Path("checkpoints").mkdir(exist_ok=True)
    torch.save(model.state_dict(), "checkpoints/gpt2_small.pt")
    print(f"✓ {n_params:,} params (matches GPT-2 small exactly)")
    print(f"✓ {len(missing)} missing keys, all causal-mask buffers (expected)")
    print(f"✓ {len(unexpected)} unexpected keys")
    print("✓ saved checkpoints/gpt2_small.pt")
 
 
if __name__ == "__main__":
    main()

Eight things worth reading carefully:

[1] is the GPT-2 small spec, expressed as your own GPTConfig. Three flags flipped from default; everything else is just numbers.
[2] Top-level keys map 1:1. , position , final LayerNorm.
[3] Per-layer keys are templates because every transformer block has the same shape.
[4] TRANSPOSE_SUFFIXES lists the keys that need transposing. HuggingFace stores attention and FFN linears as Conv1D, whose weight matrix is the transpose of nn.Linear's — (in, out) vs (out, in). The math is the same; the storage convention differs.

Try the translation logic yourself on a sample of GPT-2 keys:

Code · JavaScript

[5] flips those tensors at load time. .contiguous() because PyTorch dislikes loading from non-contiguous storage.
[6] asserts your GPTConfig instantiates to the exact GPT-2 small count (124,439,808). Off by even one means a dimension is wrong before the loader has a chance to fail more obscurely later.
[7] downloads GPT-2 small (~500 MB) on first run, then caches it. from_pretrained is the only call to transformers we need — it returns a state dict and we never touch their model class again.
[8] strict=False plus two assertions: no unexpected keys means our map covered everything; the only allowed missing keys are the blocks.{i}.attn.mask causal-mask buffers (which we don't need to load) and head.weight (which is tied to tok_emb.weight, so loading tok_emb.weight propagates automatically).

Run it:

python -m scripts.load_gpt2

python -m scripts.load_gpt2

python -m scripts.load_gpt2

The script downloads ~500 MB on first run, then writes checkpoints/gpt2_small.pt. Four ticks expected:

✓ 124,439,808 params (matches GPT-2 small exactly)
✓ 13 missing keys (12 causal-mask buffers + tied head.weight, all expected)
✓ 0 unexpected keys
✓ saved checkpoints/gpt2_small.pt

124 M . Roughly 9× the size of your trained model. Loaded into the same code. If any assertion fires instead, the message points at the exact mismatch.

4. Sample from GPT-2 using your own model

Save as scripts/sample_gpt2.py:

"""scripts/sample_gpt2.py — sample from GPT-2 small using our GPT class."""
import torch
import tiktoken
 
from llm.model import GPT
from scripts.load_gpt2 import gpt2_small_config
 
 
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
cfg = gpt2_small_config()
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/gpt2_small.pt", map_location=device))
model.eval()
 
enc = tiktoken.get_encoding("gpt2")
prompt = "The capital of France is"
idx = torch.tensor([enc.encode_ordinary(prompt)], device=device)
 
with torch.no_grad():
    for _ in range(40):
        ctx = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size :]
        logits, _ = model(ctx)
        probs = torch.softmax(logits[:, -1, :] / 0.7, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_id], dim=1)
 
print(enc.decode(idx[0].tolist()))

This is the same sampler from chapter 14. The only thing that changed is the being loaded.

python -m scripts.sample_gpt2

python -m scripts.sample_gpt2

python -m scripts.sample_gpt2

Expected output (the exact text varies with , but the shape is consistent):

The capital of France is Paris. It is also the capital of the country of France, which is the largest country in the European Union.

This is your llm/model.py. The architecture you assembled in chapters 8–12. With weights trained by OpenAI in 2019. The output is not ChatGPT — GPT-2 small is a 2019 model, no SFT, no RLHF — but the next- prediction is fluent and factual in a way your Shakespeare model could never be.

None of the code in this chapter is exotic. You loaded weights into the same forward pass you wrote. That is the moment to internalize.

5. What you proved

You proved one thing: your transformer is GPT. Same logic, same shapes, same forward pass. The size of the model and the data it saw are variables. The architecture is the constant.

This matters because:

Loading a base model is the starting point of most real projects. You don't pretrain; you adapt.
The "magic" hiding inside companies is, mechanically, what you just did. The hard parts at frontier scale are data engineering and serving infrastructure, not the model class.
Your project is now a workbench. Plug in any open weights, fine-tune them with chapter 17's SFT, adapt them with chapter 18's LoRA, serve them with chapter 19's quantization. The wrapper code does not change.

Recap

The architecture you built is GPT modulo three flags: bias, gelu_approximate, tied_lm_head. Defaults preserve chapter 12's behavior. - Three additions to GPTConfig open the door to loading any GPT-2 family model. - Name mapping + Conv1D transpose is the entirety of the weight translation. About 30 lines of Python in scripts/load_gpt2.py. - The same llm/model.py now hosts your 14M Shakespeare model and a 124M GPT-2 small. Different weights, identical shapes. - Your local project has checkpoints/gpt2_small.pt — a real, fluent ready for the rest of part V.

Going further

HuggingFace GPT-2 docs — official documentation for the model and .
nanoGPT's from_pretrained — the reference implementation; this chapter follows its structure closely.
Pythia model suite — open-source LM family with identical loading scheme and consistent scaling from 70M to 12B .

Next up: why your model talks badly — now that you can sample from a 124M-parameter GPT-2 and from your 14M Shakespeare model side by side, the gaps in scale, data, and become obvious.