Chapter 15 · 14 min
Load real weights
Map GPT-2's parameter names to yours and load real weights into the architecture you wrote. From toy model to small GPT.
You trained a 14M- model on ~1 MB of Shakespeare. GPT-2 small is 124M trained on 40 GB of WebText.
Same architecture. Exactly the same architecture, modulo three small implementation details. So let's prove it: take OpenAI's pretrained weights, push them into your own llm/model.py, and watch your code generate coherent English.
This chapter is not about a new model. It is about realizing that the code you wrote in chapters 8–12 already supports models orders of magnitude larger than the one you trained. The rest of the book — SFT, LoRA, quantization, chat — applies just as well on top of these weights as it did on top of yours.
1. The three deltas
Compare your model from chapter 12 with the original GPT-2:
- : GPT-2 BPE, 50 257 . ✓ Identical (you used
tiktokenin chapter 11). - + FFN + LayerNorm pre-attention: identical structure.
- Learned positional : same shape, just longer (
block_size = 1024vs your64). - n_embd / n_head / n_layer: GPT-2 small uses
768 / 12 / 12, you used128 / 4 / 4. Same hyperparameter, different value.
Then the three small implementation differences:
- Bias on linear layers. GPT-2 trained with
bias=Trueon every linear (Q, K, V, the output projection, both FFN linears). Your chapter-12 model hasbias=Falseon attention. If we load weights without biases, the learned contribution is silently dropped. - GELU approximation. GPT-2 used the tanh approximation of GELU:
0.5x(1 + tanh(√(2/π)(x + 0.044715 x³))). PyTorch's defaultF.geluis the exact formulation. The two differ by roughly 0.001 per activation — tiny, but compounded over 12 layers in 768 dimensions on a model not trained with it, the output drifts. - Embedding ↔ unembedding tied. GPT-2 reuses the matrix as the final unembedding (
lm_head.weight = wte.weight). Free ~38M- reduction. Your chapter-12 model has a separateheadlayer.
That's the whole compatibility surface. Three knobs — and they are already in your GPTConfig since chapter 12. Look at the dataclass: bias, tied_lm_head, gelu_approximate are right there with the defaults that preserve chapter-12 behavior. We never asked you to patch llm/model.py retroactively; we just left the switches off until now.
2. Flip the switches
To load GPT-2 small, instantiate GPTConfig with the three flags turned on and the GPT-2 dimensions plugged in:
GPTConfig(
vocab_size=50257,
block_size=1024,
n_layer=12,
n_head=12,
n_embd=768,
ffn_mult=4,
bias=True,
tied_lm_head=True,
gelu_approximate="tanh",
)Same model class. Same forward pass. Your trained from chapter 13 still loads via the default config — nothing changes for it. This larger config is just a different instantiation of the same architecture.
3. The weight-loading script
Install transformers once:
pip install transformerspip install transformerspip install transformersThen save scripts/load_gpt2.py:
"""scripts/load_gpt2.py — load HuggingFace GPT-2 small into our model."""
from pathlib import Path
import torch
from llm.model import GPT, GPTConfig
# [1]
def gpt2_small_config() -> GPTConfig:
return GPTConfig(
vocab_size=50257,
block_size=1024,
n_layer=12,
n_head=12,
n_embd=768,
ffn_mult=4,
bias=True,
tied_lm_head=True,
gelu_approximate="tanh",
)
# [2]
TOP_LEVEL = {
"transformer.wte.weight": "tok_emb.weight",
"transformer.wpe.weight": "pos_emb.weight",
"transformer.ln_f.weight": "ln_f.weight",
"transformer.ln_f.bias": "ln_f.bias",
}
# [3]
PER_LAYER = [
("h.{i}.ln_1.weight", "blocks.{i}.ln1.weight"),
("h.{i}.ln_1.bias", "blocks.{i}.ln1.bias"),
("h.{i}.attn.c_attn.weight", "blocks.{i}.attn.qkv.weight"),
("h.{i}.attn.c_attn.bias", "blocks.{i}.attn.qkv.bias"),
("h.{i}.attn.c_proj.weight", "blocks.{i}.attn.proj.weight"),
("h.{i}.attn.c_proj.bias", "blocks.{i}.attn.proj.bias"),
("h.{i}.ln_2.weight", "blocks.{i}.ln2.weight"),
("h.{i}.ln_2.bias", "blocks.{i}.ln2.bias"),
("h.{i}.mlp.c_fc.weight", "blocks.{i}.ffn.fc1.weight"),
("h.{i}.mlp.c_fc.bias", "blocks.{i}.ffn.fc1.bias"),
("h.{i}.mlp.c_proj.weight", "blocks.{i}.ffn.fc2.weight"),
("h.{i}.mlp.c_proj.bias", "blocks.{i}.ffn.fc2.bias"),
]
# [4]
TRANSPOSE_SUFFIXES = (
"attn.c_attn.weight",
"attn.c_proj.weight",
"mlp.c_fc.weight",
"mlp.c_proj.weight",
)
def translate(hf_state: dict, n_layer: int) -> dict:
out: dict = {}
for hf_key, our_key in TOP_LEVEL.items():
out[our_key] = hf_state[hf_key]
for i in range(n_layer):
for hf_template, our_template in PER_LAYER:
hf_key = hf_template.format(i=i)
tensor = hf_state[hf_key]
# [5]
if hf_key.endswith(TRANSPOSE_SUFFIXES):
tensor = tensor.t().contiguous()
out[our_template.format(i=i)] = tensor
return out
def main() -> None:
from transformers import GPT2LMHeadModel
cfg = gpt2_small_config()
model = GPT(cfg)
n_params = sum(p.numel() for p in model.parameters())
# [6]
assert n_params == 124_439_808, f"expected 124,439,808 params, got {n_params:,}"
# [7]
hf_state = GPT2LMHeadModel.from_pretrained("gpt2").state_dict()
our_state = translate(hf_state, n_layer=cfg.n_layer)
# [8]
missing, unexpected = model.load_state_dict(our_state, strict=False)
assert not unexpected, f"unexpected keys in our_state: {unexpected}"
# With tied_lm_head, head.weight shares storage with tok_emb.weight,
# so it's reported as "missing" but is loaded via the tie.
allowed_missing = lambda k: "attn.mask" in k or k == "head.weight"
assert all(allowed_missing(k) for k in missing), (
f"missing keys other than causal masks / tied head: {missing}"
)
Path("checkpoints").mkdir(exist_ok=True)
torch.save(model.state_dict(), "checkpoints/gpt2_small.pt")
print(f"✓ {n_params:,} params (matches GPT-2 small exactly)")
print(f"✓ {len(missing)} missing keys, all causal-mask buffers (expected)")
print(f"✓ {len(unexpected)} unexpected keys")
print("✓ saved checkpoints/gpt2_small.pt")
if __name__ == "__main__":
main()Eight things worth reading carefully:
- [1] is the GPT-2 small spec, expressed as your own
GPTConfig. Three flags flipped from default; everything else is just numbers. - [2] Top-level keys map 1:1. , position , final LayerNorm.
- [3] Per-layer keys are templates because every transformer block has the same shape.
- [4]
TRANSPOSE_SUFFIXESlists the keys that need transposing. HuggingFace stores attention and FFN linears asConv1D, whose weight matrix is the transpose ofnn.Linear's —(in, out)vs(out, in). The math is the same; the storage convention differs.
Try the translation logic yourself on a sample of GPT-2 keys:
Code · JavaScript
- [5] flips those tensors at load time.
.contiguous()because PyTorch dislikes loading from non-contiguous storage. - [6] asserts your
GPTConfiginstantiates to the exact GPT-2 small count (124,439,808). Off by even one means a dimension is wrong before the loader has a chance to fail more obscurely later. - [7] downloads GPT-2 small (~500 MB) on first run, then caches it.
from_pretrainedis the only call totransformerswe need — it returns a state dict and we never touch their model class again. - [8]
strict=Falseplus two assertions: no unexpected keys means our map covered everything; the only allowed missing keys are theblocks.{i}.attn.maskcausal-mask buffers (which we don't need to load) andhead.weight(which is tied totok_emb.weight, so loadingtok_emb.weightpropagates automatically).
Run it:
python -m scripts.load_gpt2python -m scripts.load_gpt2python -m scripts.load_gpt2The script downloads ~500 MB on first run, then writes checkpoints/gpt2_small.pt. Four ticks expected:
✓ 124,439,808 params (matches GPT-2 small exactly)
✓ 13 missing keys (12 causal-mask buffers + tied head.weight, all expected)
✓ 0 unexpected keys
✓ saved checkpoints/gpt2_small.pt124 M . Roughly 9× the size of your trained model. Loaded into the same code. If any assertion fires instead, the message points at the exact mismatch.
4. Sample from GPT-2 using your own model
Save as scripts/sample_gpt2.py:
"""scripts/sample_gpt2.py — sample from GPT-2 small using our GPT class."""
import torch
import tiktoken
from llm.model import GPT
from scripts.load_gpt2 import gpt2_small_config
device = "mps" if torch.backends.mps.is_available() else (
"cuda" if torch.cuda.is_available() else "cpu"
)
cfg = gpt2_small_config()
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/gpt2_small.pt", map_location=device))
model.eval()
enc = tiktoken.get_encoding("gpt2")
prompt = "The capital of France is"
idx = torch.tensor([enc.encode_ordinary(prompt)], device=device)
with torch.no_grad():
for _ in range(40):
ctx = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size :]
logits, _ = model(ctx)
probs = torch.softmax(logits[:, -1, :] / 0.7, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_id], dim=1)
print(enc.decode(idx[0].tolist()))This is the same sampler from chapter 14. The only thing that changed is the being loaded.
python -m scripts.sample_gpt2python -m scripts.sample_gpt2python -m scripts.sample_gpt2Expected output (the exact text varies with , but the shape is consistent):
The capital of France is Paris. It is also the capital of the country of France, which is the largest country in the European Union.
This is your llm/model.py. The architecture you assembled in chapters 8–12. With weights trained by OpenAI in 2019. The output is not ChatGPT — GPT-2 small is a 2019 model, no SFT, no RLHF — but the next- prediction is fluent and factual in a way your Shakespeare model could never be.
None of the code in this chapter is exotic. You loaded weights into the same forward pass you wrote. That is the moment to internalize.
5. What you proved
You proved one thing: your transformer is GPT. Same logic, same shapes, same forward pass. The size of the model and the data it saw are variables. The architecture is the constant.
This matters because:
- Loading a base model is the starting point of most real projects. You don't pretrain; you adapt.
- The "magic" hiding inside companies is, mechanically, what you just did. The hard parts at frontier scale are data engineering and serving infrastructure, not the model class.
- Your project is now a workbench. Plug in any open weights, fine-tune them with chapter 17's SFT, adapt them with chapter 18's LoRA, serve them with chapter 19's quantization. The wrapper code does not change.
Recap
- The architecture you built is GPT modulo three flags:
bias,gelu_approximate,tied_lm_head. Defaults preserve chapter 12's behavior. - Three additions toGPTConfigopen the door to loading any GPT-2 family model. - Name mapping + Conv1D transpose is the entirety of the weight translation. About 30 lines of Python inscripts/load_gpt2.py. - The samellm/model.pynow hosts your 14M Shakespeare model and a 124M GPT-2 small. Different weights, identical shapes. - Your local project hascheckpoints/gpt2_small.pt— a real, fluent ready for the rest of part V.
Going further
- HuggingFace GPT-2 docs — official documentation for the model and .
- nanoGPT's
from_pretrained— the reference implementation; this chapter follows its structure closely. - Pythia model suite — open-source LM family with identical loading scheme and consistent scaling from 70M to 12B .
Next up: why your model talks badly — now that you can sample from a 124M-parameter GPT-2 and from your 14M Shakespeare model side by side, the gaps in scale, data, and become obvious.