Chapter 21 · 22 min

Ship a useful one

Pick a narrow domain, fine-tune GPT-2 with your SFT recipe, evaluate side-by-side, and ship a small useful model.

You have every piece. Chapter 12 gave you the model class. Chapter 13 made it train. Chapter 15 loaded real GPT-2 weights into the same code. Chapter 17 taught the chat template via SFT. Chapter 18 showed cheap adaptation. Chapter 19 cut serving cost. Chapter 20 wrapped it all in a KV-cached REPL.

This chapter glues them into one thing that actually works for one specific use case.

A "useful" assistant is the smallest combination that delivers a behavior someone would pay for. Frontier models earn their money on breadth. Your model can earn its place on depth — narrow domain, private data, controlled output. This chapter walks the recipe end-to-end on a toy example, and gives you a template to reuse.

1. Pick a narrow domain

Three rules for picking well:

The output shape is constrained. Short JSON, a fixed-length summary, a numbered list, a yes/no plus one-sentence justification. The model imitates structure from few examples.
The knowledge is bounded. A specific FAQ, a domain vocabulary, a single product. Not "the entire web".
You have a clear oracle for "good". You can look at an output and say yes/no without an expert in the loop.

For this chapter we use a fictional product, Acme Cloud. Output shape: a short answer in plain English, optionally followed by one clarifying clause. Knowledge: a one-page made-up policy doc. Oracle: you wrote the policies, so you know what is true.

The technique scales to any narrow domain: refund policy for a real company you work with, a specific code style, a single ticket triage taxonomy, a tutoring assistant for one textbook. Pick something you understand well — your hour-by-hour judgment about "is this right?" is the bottleneck, not the model.

2. Get the SFT dataset

For this walkthrough, the loss-curve repo ships a starter dataset of 50 SFT examples and 20 held-out eval examples covering Acme Cloud's pricing, limits, security, regions, and integrations. Use it as-is to run the chapter end-to-end, then duplicate the pattern for your own domain.

Copy the two files from examples/ into your local project's data/:

cp examples/acme_sft.example.jsonl my-llm/data/acme_sft.jsonl && cp examples/acme_eval.example.jsonl my-llm/data/acme_eval.jsonl

Copy-Item examples\\acme_sft.example.jsonl my-llm\\data\\acme_sft.jsonl; Copy-Item examples\\acme_eval.example.jsonl my-llm\\data\\acme_eval.jsonl

cp examples/acme_sft.example.jsonl my-llm/data/acme_sft.jsonl && cp examples/acme_eval.example.jsonl my-llm/data/acme_eval.jsonl

Each line of acme_sft.jsonl is one chat-template example:

{"system": "You are Acme Cloud support. Answer briefly.", "user": "Can I cancel anytime?", "assistant": "Yes. Cancel from Settings → Billing. Refunds are pro-rated for the unused period."}
{"system": "You are Acme Cloud support. Answer briefly.", "user": "How big can each upload be?", "assistant": "5 GB per file on Free, 50 GB on Pro, 500 GB on Enterprise."}
{"system": "You are Acme Cloud support. Answer briefly.", "user": "Do you support SSO?", "assistant": "Yes, on Pro and Enterprise plans. SAML 2.0 and OIDC."}

The starter set is 50 examples — enough to feel the shift and observe a fact-recall pattern, smaller than the 150 you would want for a real run. Once the pipeline is wired end-to-end you can scale up by hand or generate more examples with a larger model.

The dataset is the work in any real project — the model is the easy part. For your own domain you would derive it from:

Existing FAQ / docs / sales enablement
Support ticket transcripts (sanitized)
Synthetic generation from a larger model, then human review
Brainstorm sessions with the team who owns the domain

The acme_eval.jsonl file holds 20 held-out examples that are never used during SFT. They are your sanity check at the end.

3. SFT on top of GPT-2 small

We start from checkpoints/gpt2_small.pt (from chapter 15), not from the chapter-13 Shakespeare model. GPT-2 already speaks English; SFT just teaches it the Acme shape. Starting from Shakespeare would waste the SFT budget re-learning English.

Save as scripts/sft_acme.py:

"""scripts/sft_acme.py — SFT GPT-2 small on the Acme Cloud dataset."""
import json
import math
from pathlib import Path
 
import numpy as np
import torch
import torch.nn.functional as F
import tiktoken
 
from llm.model import GPT
from scripts.load_gpt2 import gpt2_small_config
 
 
# --- config ---
cfg = gpt2_small_config()      # 124M params, block_size=1024
batch_size = 4                  # GPT-2 small on CPU; bump to 16+ on GPU
max_steps = 1000                # ~7 epochs over 150 examples
lr = 5e-5                       # smaller than ch.17's 1e-4 — base is bigger
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
print(f"sft on {device}")
 
enc = tiktoken.get_encoding("gpt2")
 
 
def render(ex):
    prompt = (
        f"System: {ex['system']}\n"
        f"User: {ex['user']}\n"
        f"Assistant: "
    )
    completion = ex["assistant"] + "\n"
    return prompt, completion
 
 
records = []
for line in Path("data/acme_sft.jsonl").read_text().splitlines():
    line = line.strip()
    if not line:
        continue
    ex = json.loads(line)
    prompt, completion = render(ex)
    records.append((enc.encode_ordinary(prompt), enc.encode_ordinary(completion)))
 
assert 30 < len(records) < 500, f"expected 50-300 SFT examples, got {len(records)}"
 
 
def make_batch():
    idx = np.random.randint(0, len(records), size=batch_size)
    seqs, masks = [], []
    for i in idx:
        prompt_ids, completion_ids = records[i]
        ids = (prompt_ids + completion_ids)[: cfg.block_size]
        mask = ([0] * len(prompt_ids) + [1] * len(completion_ids))[: cfg.block_size]
        pad = cfg.block_size - len(ids)
        seqs.append(ids + [0] * pad)
        masks.append(mask + [0] * pad)
    x = torch.tensor(seqs, dtype=torch.long, device=device)
    m = torch.tensor(masks, dtype=torch.float, device=device)
    return x, m
 
 
# --- model ---
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/gpt2_small.pt", map_location=device))
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
 
# --- loop ---
for step in range(max_steps):
    x, mask = make_batch()
    inputs = x[:, :-1]
    targets = x[:, 1:]
    target_mask = mask[:, 1:]
 
    logits, _ = model(inputs)
    per_token = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        targets.reshape(-1),
        reduction="none",
    ).reshape(targets.shape)
    loss = (per_token * target_mask).sum() / target_mask.sum().clamp(min=1)
 
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
 
    if step % 100 == 0 or step == max_steps - 1:
        print(f"step {step:4d} | loss {loss.item():.4f}")
 
Path("checkpoints").mkdir(exist_ok=True)
torch.save(model.state_dict(), "checkpoints/acme.pt")
print("✓ saved checkpoints/acme.pt")

Two differences from chapter 17's scripts/sft.py:

Base config is gpt2_small_config() instead of GPTConfig(). The architecture is the same code; the dimensions are GPT-2 small.
Hyperparameters tuned for a bigger base: batch_size=4 (memory), lr=5e-5 (smaller than 1e-4 because GPT-2 already speaks), max_steps=1000 (~7 passes over 150 examples).

Run it:

python -m scripts.sft_acme

python -m scripts.sft_acme

python -m scripts.sft_acme

Expected wall-clock: ~20 minutes on CPU, ~5 on Apple Silicon MPS, ~2 on a recent NVIDIA card. typically falls from ~4 to ~0.5 over the run — much lower than chapter 17's ~2-3 because the model is bigger and the dataset is more uniform.

4. Evaluate side by side

Save as scripts/eval_acme.py:

"""scripts/eval_acme.py — qualitative side-by-side, base vs SFT."""
import json
from pathlib import Path
 
import torch
import tiktoken
 
from llm.model import GPT
from scripts.load_gpt2 import gpt2_small_config
 
 
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
cfg = gpt2_small_config()
enc = tiktoken.get_encoding("gpt2")
 
 
def load(path):
    m = GPT(cfg).to(device)
    m.load_state_dict(torch.load(path, map_location=device))
    m.eval()
    return m
 
 
def generate(model, prompt: str, max_new: int = 80) -> str:
    prompt_ids = enc.encode_ordinary(prompt)
    idx = torch.tensor([prompt_ids], device=device)
    newline = enc.encode_ordinary("\n")[0]
    with torch.no_grad():
        for _ in range(max_new):
            ctx = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size :]
            logits, _ = model(ctx)
            probs = torch.softmax(logits[:, -1, :] / 0.7, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            if next_id.item() == newline and idx.size(1) > len(prompt_ids):
                break
            idx = torch.cat([idx, next_id], dim=1)
    return enc.decode(idx[0, len(prompt_ids) :].tolist()).strip()
 
 
base = load("checkpoints/gpt2_small.pt")
sft = load("checkpoints/acme.pt")
 
held_out = [json.loads(l) for l in Path("data/acme_eval.jsonl").read_text().splitlines() if l.strip()]
 
for ex in held_out:
    prompt = f"System: {ex['system']}\nUser: {ex['user']}\nAssistant: "
    print(f"\n--- {ex['user']!r}")
    print(f"BASE: {generate(base, prompt)}")
    print(f"SFT:  {generate(sft,  prompt)}")
    print(f"GOLD: {ex['assistant']}")

Run it:

python -m scripts.eval_acme

python -m scripts.eval_acme

python -m scripts.eval_acme

Read the output. For each held-out question, three lines: base GPT-2's freeform answer, your SFT model's answer, and the gold answer you wrote.

What to look for:

Shape: does the SFT answer end with \n after a brief response? Does it stay in the Acme support register? The base will ramble, hallucinate companies, suggest unrelated products. The SFT model should not.
Facts: does the SFT answer pick up the right Acme-specific claims (5 GB limit, EU region replica, SAML)? Misses are expected for facts that only appeared once or twice; consistent hits on common facts are the win.
Mode collapse: are all SFT answers worryingly similar? If yes, your dataset has a dominant template and the model latched onto it. Diversify the next batch.

A realistic result on 20 held-out questions with 150 SFT examples on GPT-2 small: ~70% format-correct, ~50% fact-correct for the SFT model. The base GPT-2 on the same prompts: ~10% on shape, ~5% on facts (mostly because it guesses).

That is the answer to "is your tiny model useful?" Not "compared to GPT-4". Compared to the same architecture, untrained on your data. The delta is what your dataset bought.

5. Ship

Wrap the SFT checkpoint in the chapter-20 chat REPL. The only differences are the config and checkpoint:

# scripts/chat_acme.py
import torch
from llm.model import GPT
from scripts.load_gpt2 import gpt2_small_config
 
# ... reuse the rest of scripts/chat.py from chapter 20 ...
 
def load_model(device: str) -> GPT:
    cfg = gpt2_small_config()
    model = GPT(cfg).to(device)
    model.load_state_dict(torch.load("checkpoints/acme.pt", map_location=device))
    model.eval()
    return model
 
SYSTEM_PROMPT = "You are Acme Cloud support. Answer briefly."
# ... rest unchanged ...

Run it. Type questions. Get answers that look like Acme Cloud support.

python -m scripts.chat_acme

python -m scripts.chat_acme

python -m scripts.chat_acme

6. What you actually shipped

Not ChatGPT. You shipped:

An assistant with ~124M that follows a chat template you specified
on data that is yours (or your customer's — a real differentiator)
Running locally on a laptop, no API costs, no data leakage
Cheap to iterate: regenerate the dataset, re-run sft_acme.py, ~20 min round trip
Plugged into the same chat REPL as before — same KV cache from chapter 20

This is the shape of most real LLM products that are not OpenAI / Anthropic / Google. Companies shipping "AI assistant for X" mostly do this, at slightly larger scale, with proprietary data they curated themselves. The hard part is the data and the product fit. The architecture, the loop, the serving — you wrote those across the previous 20 chapters.

7. Where you go from here

Each lever moves a different axis. Pull the one that matches your bottleneck:

More SFT data — when the model fails on shape or domain facts you did cover. Diminishing returns past a few thousand for narrow tasks.
Bigger base — when the model fails on language, not on domain. Swap GPT-2 small for Pythia 410M, or for Llama-3 8B if you have a real GPU. Same load_gpt2.py recipe with the right name map.
LoRA (chapter 18) — when you want multiple domain-specific overlays on one base model. One shared base, one tiny adapter per customer.
Quantization (chapter 19) — when inference latency or memory is the constraint. Cuts serving cost without changing behavior much.
Real evaluation — when you start shipping to users. The 20-example side-by-side from this chapter is a sanity check, not a production eval. Build a real held-out set and a grading script the moment the project is doing real work.
Preference tuning (appendix · RLHF and DPO) — when SFT-trained answers are well-shaped but you still see consistent quality gaps between equally-shaped responses. The third axis of alignment, with DPO as the modern open-source recipe.

Recap

A useful small model is a narrow one. Constrained output shape, bounded knowledge, clear oracle for "good". - The dataset is the work. ~150 examples is a viable first run. Hold 20 out for evaluation. - Start from GPT-2 small (or a bigger base), not from your Shakespeare model. SFT teaches the shape and the domain; it does not teach English. - The SFT script from chapter 17 works unchanged; only the base config and a few hyperparameters move. - Evaluation is qualitative at this scale. Run the held-out set through base and SFT side by side, read the outputs, count format and fact hits. - The chat REPL from chapter 20 is the shipping vehicle. Same KV cache, different . - Your local project now has data/acme_sft.jsonl, data/acme_eval.jsonl, scripts/sft_acme.py, scripts/eval_acme.py, scripts/chat_acme.py, and checkpoints/acme.pt — a template you can copy-and-rename for any narrow domain.

Going further

Stanford Alpaca — the recipe that popularized small-instruct models. 52k SFT examples on a 7B base; same idea, larger scale.
Hugging Face TRL — production-grade SFTTrainer with packing, masking, and chat templates handled. Use it for real projects past the prototype.
LangChain's OpenAI vs open-source decision matrix — the kind of trade-off this chapter taught you to evaluate.

That's the book. You built every piece of a working , watched it , loaded real GPT-2 weights into your own code, taught it to follow a chat template, made it cheaper to serve, made it faster to generate, and shipped a specialized assistant. The next domain is yours.