Chapter 17 · 16 min
Give your model instructions
Turn a base model into an instruction-follower. Chat templates, supervised fine-tuning, loss masking — the SFT recipe in code.
You closed chapter 14 by sampling from your trained model. The output was Shakespeare-shaped: line breaks, character names in caps, archaic phrasing. Now ask it "what is two plus two?" and you get a continuation in iambic pentameter, not an answer.
That gap is the subject of chapter 16. This chapter teaches the cheapest, most direct technique that closes it: supervised — SFT.
You will take the from chapter 13, continue it on a small set of prompt/completion examples, and watch the format shift. The same recipe — at much larger scale — is how OpenAI turned GPT-3 (a raw completion model) into InstructGPT, and how every open-source instruction-tuned model (Llama-Instruct, Mistral-Instruct, etc.) is built before any preference tuning.
1. The chat template
A pretrained model has no idea that User: and Assistant: are special. It only knows statistics. If you it on a where those markers are consistent, it learns to associate them with role-shift behavior.
The simplest convention — close to what production systems use — is a three-role frame:
System: You answer questions briefly and accurately.
User: What is two plus two?
Assistant: Four.Pick the same frame for every example in your set. The model learns the rhythm: see User: ..., produce Assistant: .... Production systems use richer templates (ChatML's <|im_start|> markers, Llama's [INST] brackets, etc.) but the principle is identical: a consistent textual frame the model can learn to recognize and complete.
Write the renderer. The cell takes one example and splits it into prompt (the part shown to the model) and completion (the part the model produces). Roles are highlighted so you can see the frame.
Code · JavaScript
2. The trick: prompt- loss masking
If you naively on the full sequence, the model is asked to predict every — including the user's question. That is wasted: the user's are given, not generated. Worse, training on them nudges the model toward imitating user-style phrasing instead of assistant-style phrasing.
The fix is the one piece of SFT that is most often skipped in tutorials: only train on the assistant's . Build a mask over the sequence that is 0 everywhere except on the the model is supposed to learn to produce, then multiply the per- by that mask and average over the masked positions only.
For a sequence laid out as:
System: ... User: ... Assistant: Four. \n
[ mask 0 0 0 0 mask 0 0 0 mask 1 1 1 1 1 1 ]The reduces to: given everything before, predict the assistant's . Everything else is conditioning context. Without this masking, SFT trains roughly 5–10× more slowly and converges to a worse format-following model. With it, even 50–200 examples produce a visible shift.
Build the mask. The cell tokenizes the same example and asks you to write the per- mask. Green-highlighted contribute to the ; the others are masked out.
Code · JavaScript
Where m_t = 1 if t is part of an assistant turn and 0 otherwise. The denominator keeps the scale comparable across batches with different prompt lengths.
3. A tiny SFT dataset
Real SFT datasets are thousands to millions of curated examples. You do not need that to feel the effect. Create data/sft.jsonl:
{"system": "You answer questions briefly.", "user": "What is two plus two?", "assistant": "Four."}
{"system": "You answer questions briefly.", "user": "Capital of France?", "assistant": "Paris."}
{"system": "You answer questions briefly.", "user": "How many sides does a triangle have?", "assistant": "Three."}Aim for 30–100 lines like that. For a real run, scale to a few hundred examples on a narrow domain you actually care about: refund policy, code snippets in one language, summaries in your tone. The narrower the domain, the smaller the dataset can be and still produce a useful model.
4. The SFT training script
Save as scripts/sft.py:
"""scripts/sft.py — supervised fine-tune the chapter-13 checkpoint."""
import json
from pathlib import Path
import numpy as np
import torch
import torch.nn.functional as F
import tiktoken
from llm.model import GPT, GPTConfig
# --- config ---
cfg = GPTConfig()
batch_size = 8
max_steps = 500
lr = 1e-4
device = "mps" if torch.backends.mps.is_available() else (
"cuda" if torch.cuda.is_available() else "cpu"
)
print(f"sft on {device}")
# --- data ---
enc = tiktoken.get_encoding("gpt2")
# [1]
def render(ex):
prompt = (
f"System: {ex['system']}\n"
f"User: {ex['user']}\n"
f"Assistant: "
)
completion = ex["assistant"] + "\n"
return prompt, completion
# [2]
records = []
for line in Path("data/sft.jsonl").read_text().splitlines():
line = line.strip()
if not line:
continue
ex = json.loads(line)
prompt, completion = render(ex)
prompt_ids = enc.encode_ordinary(prompt)
completion_ids = enc.encode_ordinary(completion)
records.append((prompt_ids, completion_ids))
# [3]
def make_batch():
idx = np.random.randint(0, len(records), size=batch_size)
seqs, masks = [], []
for i in idx:
prompt_ids, completion_ids = records[i]
ids = (prompt_ids + completion_ids)[: cfg.block_size]
mask = ([0] * len(prompt_ids) + [1] * len(completion_ids))[: cfg.block_size]
pad = cfg.block_size - len(ids)
ids = ids + [0] * pad
mask = mask + [0] * pad
seqs.append(ids)
masks.append(mask)
x = torch.tensor(seqs, dtype=torch.long, device=device)
m = torch.tensor(masks, dtype=torch.float, device=device)
return x, m
# --- model ---
# [4]
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/model.pt", map_location=device))
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
# --- loop ---
for step in range(max_steps):
x, mask = make_batch()
inputs = x[:, :-1]
targets = x[:, 1:]
target_mask = mask[:, 1:]
logits, _ = model(inputs)
# [5]
per_token = F.cross_entropy(
logits.reshape(-1, logits.size(-1)),
targets.reshape(-1),
reduction="none",
).reshape(targets.shape)
# [6]
loss = (per_token * target_mask).sum() / target_mask.sum().clamp(min=1)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if step % 50 == 0 or step == max_steps - 1:
print(f"step {step:4d} | loss {loss.item():.4f}")
Path("checkpoints").mkdir(exist_ok=True)
torch.save(model.state_dict(), "checkpoints/model_sft.pt")
print("saved checkpoints/model_sft.pt")Read it as chapter 13's loop with two surgical changes:
- [1]
renderlays out every example in the chat template. The same string format must be reused at time. - [2] tokenizes every example into a
(prompt_ids, completion_ids)pair separately — that is what lets us build the right mask. - [3]
make_batchconcatenates the two halves and creates a mask that is0over the prompt and1over the completion. Padding gets a mask of0too, so it cannot leak into the . - [4] loads the chapter-13 weights. SFT is a continuation of , not a fresh start. The is also smaller (
1e-4vs3e-4in chapter 13): the model is already a good language model and we only want a gentle nudge toward the format. - [5]
reduction="none"keeps the per- cross-entropy instead of averaging it. That is what lets us mask before reducing. - [6] is the masked mean — the heart of SFT. Multiply by the mask, sum, divide by the mask sum, never divide by zero.
Run it:
python -m scripts.sftpython -m scripts.sftpython -m scripts.sft500 steps on a CPU takes roughly two minutes. The usually falls from ~6 to ~2 within 200 steps — fast, because the model already speaks; it only has to learn the chat shape.
5. Generate from the SFT model
Save as scripts/sample_sft.py:
"""scripts/sample_sft.py — sample with chat template from the SFT checkpoint."""
import torch
import tiktoken
from llm.model import GPT, GPTConfig
device = "mps" if torch.backends.mps.is_available() else (
"cuda" if torch.cuda.is_available() else "cpu"
)
cfg = GPTConfig()
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/model_sft.pt", map_location=device))
model.eval()
enc = tiktoken.get_encoding("gpt2")
prompt = (
"System: You answer questions briefly.\n"
"User: What is two plus two?\n"
"Assistant: "
)
prompt_ids = enc.encode_ordinary(prompt)
idx = torch.tensor([prompt_ids], device=device)
newline = enc.encode_ordinary("\n")[0]
with torch.no_grad():
for _ in range(40):
ctx = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size :]
logits, _ = model(ctx)
probs = torch.softmax(logits[:, -1, :] / 0.7, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
if next_id.item() == newline and idx.size(1) > len(prompt_ids):
break
idx = torch.cat([idx, next_id], dim=1)
print(enc.decode(idx[0].tolist()))Two differences from chapter 14's sampler:
- The prompt uses the same chat template the model was fine-tuned on. Drop the
System:line or rename the roles and you lose most of the format gain — the model latches onto the textual frame. - The newline is a stop , so we end after the assistant's first line. Real chat loops use longer stop sequences (
\nUser:etc.); we cover that pattern in chapter 20.
Run it and compare against chapter 14's base sampler on the same prompt. The base model continues as Shakespeare-themed nonsense. The SFT model produces something shaped like an answer — and if your SFT dataset only had 30 examples, the answers will mostly be a remix of answers. That is fine. What you have proven is the format. Quality of content from here is a function of dataset size, dataset quality, and base model scale. The architecture is no longer the bottleneck.
6. Where real SFT data comes from
OpenAI's InstructGPT used about 13,000 human-written examples for the SFT step, with a much larger pool reserved for the reward-model that came afterward. Modern open-source SFT datasets tend to be:
- Human-curated: Anthropic's HH-RLHF, ShareGPT, OpenAssistant.
- Distilled from larger models: Alpaca (LLaMA outputs filtered), Dolly (Databricks employees).
- Domain-mined: turning code repos, support tickets, or documentation into prompt/completion pairs programmatically.
For a project at your scale, the highest-value angle is domain-specific SFT on data only you have. Generic SFT against an open dataset only produces a worse version of what is already free on Hugging Face. Narrow, private SFT is where small models earn their keep.
Recap
- SFT turns a next- predictor into a model that follows the chat shape. Same architecture,
same training loop, different data and different . - The chat template is a consistent
textual frame (System / User / Assistant) the model learns to recognize and complete. - Loss
masking on prompt is the trick that matters. Only the assistant's contribute to the
. - Initial weights are the chapter-13 . SFT is a continuation of ; use a
smaller (
1e-4). - Your local project now hasdata/sft.jsonl,scripts/sft.py,scripts/sample_sft.py, andcheckpoints/model_sft.pt. - A tiny SFT pass (30–100 examples, 500 steps) is enough to feel the format shift. Quality past that is a function of dataset size and quality — and that is exactly where small models earn their place.
Going further
- Ouyang et al., "Training language models to follow instructions with human feedback" (2022) — the InstructGPT paper. SFT + reward modeling + RL, at scale.
- Stanford Alpaca — SFT on 52k distilled examples. The recipe that kicked off the small-instruct-model wave.
- Hugging Face TRL
SFTTrainer— production-grade SFT with masking, chat templates, and packing handled for you.
Next up: LoRA fine-tuning — same idea (fine-tune the model on a new objective), but with -efficient training so each adapter stays tiny and you can ship 50 of them on top of one base model.