Skip to content
The loss curve

Chapter 17 · 16 min

Give your model instructions

Turn a base model into an instruction-follower. Chat templates, supervised fine-tuning, loss masking — the SFT recipe in code.

You closed chapter 14 by sampling from your trained model. The output was Shakespeare-shaped: line breaks, character names in caps, archaic phrasing. Now ask it "what is two plus two?" and you get a continuation in iambic pentameter, not an answer.

That gap is the subject of chapter 16. This chapter teaches the cheapest, most direct technique that closes it: supervised — SFT.

You will take the from chapter 13, continue it on a small set of prompt/completion examples, and watch the format shift. The same recipe — at much larger scale — is how OpenAI turned GPT-3 (a raw completion model) into InstructGPT, and how every open-source instruction-tuned model (Llama-Instruct, Mistral-Instruct, etc.) is built before any preference tuning.

1. The chat template

A pretrained model has no idea that User: and Assistant: are special. It only knows statistics. If you it on a where those markers are consistent, it learns to associate them with role-shift behavior.

The simplest convention — close to what production systems use — is a three-role frame:

System: You answer questions briefly and accurately.
User: What is two plus two?
Assistant: Four.

Pick the same frame for every example in your set. The model learns the rhythm: see User: ..., produce Assistant: .... Production systems use richer templates (ChatML's <|im_start|> markers, Llama's [INST] brackets, etc.) but the principle is identical: a consistent textual frame the model can learn to recognize and complete.

Write the renderer. The cell takes one example and splits it into prompt (the part shown to the model) and completion (the part the model produces). Roles are highlighted so you can see the frame.

Code · JavaScript

2. The trick: prompt- loss masking

If you naively on the full sequence, the model is asked to predict every — including the user's question. That is wasted: the user's are given, not generated. Worse, training on them nudges the model toward imitating user-style phrasing instead of assistant-style phrasing.

The fix is the one piece of SFT that is most often skipped in tutorials: only train on the assistant's . Build a mask over the sequence that is 0 everywhere except on the the model is supposed to learn to produce, then multiply the per- by that mask and average over the masked positions only.

For a sequence laid out as:

System: ...    User: ...    Assistant: Four. \n
[ mask 0 0 0 0  mask 0 0 0   mask 1 1 1 1 1 1 ]

The reduces to: given everything before, predict the assistant's . Everything else is conditioning context. Without this masking, SFT trains roughly 5–10× more slowly and converges to a worse format-following model. With it, even 50–200 examples produce a visible shift.

Build the mask. The cell tokenizes the same example and asks you to write the per- mask. Green-highlighted contribute to the ; the others are masked out.

Code · JavaScript

LSFT=tmtCE(y^t,yt)tmt\mathcal{L}_{\text{SFT}} = \frac{\sum_{t} m_t \cdot \text{CE}(\hat{y}_t, y_t)}{\sum_t m_t}

Where m_t = 1 if t is part of an assistant turn and 0 otherwise. The denominator keeps the scale comparable across batches with different prompt lengths.

3. A tiny SFT dataset

Real SFT datasets are thousands to millions of curated examples. You do not need that to feel the effect. Create data/sft.jsonl:

{"system": "You answer questions briefly.", "user": "What is two plus two?", "assistant": "Four."}
{"system": "You answer questions briefly.", "user": "Capital of France?", "assistant": "Paris."}
{"system": "You answer questions briefly.", "user": "How many sides does a triangle have?", "assistant": "Three."}

Aim for 30–100 lines like that. For a real run, scale to a few hundred examples on a narrow domain you actually care about: refund policy, code snippets in one language, summaries in your tone. The narrower the domain, the smaller the dataset can be and still produce a useful model.

4. The SFT training script

Save as scripts/sft.py:

"""scripts/sft.py — supervised fine-tune the chapter-13 checkpoint."""
import json
from pathlib import Path
 
import numpy as np
import torch
import torch.nn.functional as F
import tiktoken
 
from llm.model import GPT, GPTConfig
 
 
# --- config ---
cfg = GPTConfig()
batch_size = 8
max_steps = 500
lr = 1e-4
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
print(f"sft on {device}")
 
# --- data ---
enc = tiktoken.get_encoding("gpt2")
 
 
# [1]
def render(ex):
    prompt = (
        f"System: {ex['system']}\n"
        f"User: {ex['user']}\n"
        f"Assistant: "
    )
    completion = ex["assistant"] + "\n"
    return prompt, completion
 
 
# [2]
records = []
for line in Path("data/sft.jsonl").read_text().splitlines():
    line = line.strip()
    if not line:
        continue
    ex = json.loads(line)
    prompt, completion = render(ex)
    prompt_ids = enc.encode_ordinary(prompt)
    completion_ids = enc.encode_ordinary(completion)
    records.append((prompt_ids, completion_ids))
 
 
# [3]
def make_batch():
    idx = np.random.randint(0, len(records), size=batch_size)
    seqs, masks = [], []
    for i in idx:
        prompt_ids, completion_ids = records[i]
        ids = (prompt_ids + completion_ids)[: cfg.block_size]
        mask = ([0] * len(prompt_ids) + [1] * len(completion_ids))[: cfg.block_size]
        pad = cfg.block_size - len(ids)
        ids = ids + [0] * pad
        mask = mask + [0] * pad
        seqs.append(ids)
        masks.append(mask)
    x = torch.tensor(seqs, dtype=torch.long, device=device)
    m = torch.tensor(masks, dtype=torch.float, device=device)
    return x, m
 
 
# --- model ---
# [4]
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/model.pt", map_location=device))
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
 
# --- loop ---
for step in range(max_steps):
    x, mask = make_batch()
    inputs = x[:, :-1]
    targets = x[:, 1:]
    target_mask = mask[:, 1:]
 
    logits, _ = model(inputs)
    # [5]
    per_token = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        targets.reshape(-1),
        reduction="none",
    ).reshape(targets.shape)
    # [6]
    loss = (per_token * target_mask).sum() / target_mask.sum().clamp(min=1)
 
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
 
    if step % 50 == 0 or step == max_steps - 1:
        print(f"step {step:4d} | loss {loss.item():.4f}")
 
Path("checkpoints").mkdir(exist_ok=True)
torch.save(model.state_dict(), "checkpoints/model_sft.pt")
print("saved checkpoints/model_sft.pt")

Read it as chapter 13's loop with two surgical changes:

  • [1] render lays out every example in the chat template. The same string format must be reused at time.
  • [2] tokenizes every example into a (prompt_ids, completion_ids) pair separately — that is what lets us build the right mask.
  • [3] make_batch concatenates the two halves and creates a mask that is 0 over the prompt and 1 over the completion. Padding gets a mask of 0 too, so it cannot leak into the .
  • [4] loads the chapter-13 weights. SFT is a continuation of , not a fresh start. The is also smaller (1e-4 vs 3e-4 in chapter 13): the model is already a good language model and we only want a gentle nudge toward the format.
  • [5] reduction="none" keeps the per- cross-entropy instead of averaging it. That is what lets us mask before reducing.
  • [6] is the masked mean — the heart of SFT. Multiply by the mask, sum, divide by the mask sum, never divide by zero.

Run it:

python -m scripts.sft
python -m scripts.sft
python -m scripts.sft

500 steps on a CPU takes roughly two minutes. The usually falls from ~6 to ~2 within 200 steps — fast, because the model already speaks; it only has to learn the chat shape.

5. Generate from the SFT model

Save as scripts/sample_sft.py:

"""scripts/sample_sft.py — sample with chat template from the SFT checkpoint."""
import torch
import tiktoken
 
from llm.model import GPT, GPTConfig
 
 
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
cfg = GPTConfig()
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/model_sft.pt", map_location=device))
model.eval()
 
enc = tiktoken.get_encoding("gpt2")
prompt = (
    "System: You answer questions briefly.\n"
    "User: What is two plus two?\n"
    "Assistant: "
)
prompt_ids = enc.encode_ordinary(prompt)
idx = torch.tensor([prompt_ids], device=device)
newline = enc.encode_ordinary("\n")[0]
 
with torch.no_grad():
    for _ in range(40):
        ctx = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size :]
        logits, _ = model(ctx)
        probs = torch.softmax(logits[:, -1, :] / 0.7, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        if next_id.item() == newline and idx.size(1) > len(prompt_ids):
            break
        idx = torch.cat([idx, next_id], dim=1)
 
print(enc.decode(idx[0].tolist()))

Two differences from chapter 14's sampler:

  • The prompt uses the same chat template the model was fine-tuned on. Drop the System: line or rename the roles and you lose most of the format gain — the model latches onto the textual frame.
  • The newline is a stop , so we end after the assistant's first line. Real chat loops use longer stop sequences (\nUser: etc.); we cover that pattern in chapter 20.

Run it and compare against chapter 14's base sampler on the same prompt. The base model continues as Shakespeare-themed nonsense. The SFT model produces something shaped like an answer — and if your SFT dataset only had 30 examples, the answers will mostly be a remix of answers. That is fine. What you have proven is the format. Quality of content from here is a function of dataset size, dataset quality, and base model scale. The architecture is no longer the bottleneck.

6. Where real SFT data comes from

OpenAI's InstructGPT used about 13,000 human-written examples for the SFT step, with a much larger pool reserved for the reward-model that came afterward. Modern open-source SFT datasets tend to be:

  • Human-curated: Anthropic's HH-RLHF, ShareGPT, OpenAssistant.
  • Distilled from larger models: Alpaca (LLaMA outputs filtered), Dolly (Databricks employees).
  • Domain-mined: turning code repos, support tickets, or documentation into prompt/completion pairs programmatically.

For a project at your scale, the highest-value angle is domain-specific SFT on data only you have. Generic SFT against an open dataset only produces a worse version of what is already free on Hugging Face. Narrow, private SFT is where small models earn their keep.

Recap

  • SFT turns a next- predictor into a model that follows the chat shape. Same architecture, same training loop, different data and different . - The chat template is a consistent textual frame (System / User / Assistant) the model learns to recognize and complete. - Loss masking on prompt is the trick that matters. Only the assistant's contribute to the . - Initial weights are the chapter-13 . SFT is a continuation of ; use a smaller (1e-4). - Your local project now has data/sft.jsonl, scripts/sft.py, scripts/sample_sft.py, and checkpoints/model_sft.pt. - A tiny SFT pass (30–100 examples, 500 steps) is enough to feel the format shift. Quality past that is a function of dataset size and quality — and that is exactly where small models earn their place.

Going further

Next up: LoRA fine-tuning — same idea (fine-tune the model on a new objective), but with -efficient training so each adapter stays tiny and you can ship 50 of them on top of one base model.