Chapter 23 · 12 min

Appendix · RLHF and DPO

A conceptual walk-through of RLHF and DPO. What preference data is, what reward models are for, and where DPO simplifies things.

Chapter 17 taught the model the chat shape with supervised fine-tuning. Chapter 21 shipped a specialized assistant by extending that idea to a narrow domain. What neither did is teach the model what humans prefer between two equally-shaped answers.

That third axis — preference tuning — is what separates "follows the template" from "the response your users would pick". This appendix is conceptual, not hands-on. Real preference tuning needs preference data we do not have and infrastructure beyond the scope of this book. But the mechanism is small and intuitive once you have built SFT, and naming it lets you read the modern literature without surprise.

1. The setup

After SFT you have a model that produces plausible responses. Plausible is necessary, not sufficient. For the prompt "explain bubble sort", two responses can both be in the right shape:

A: "Bubble sort repeatedly steps through the list, compares adjacent elements, swaps if needed. O(n²) worst case."
B: "Sure here is how Bubble Sort works it is a sorting algorithm that sorts things by swapping them."

Both follow the SFT template. A is what a careful human would prefer. SFT does not give us a way to express "prefer A over B" — it only knows imitation.

Preference learning closes that. The recipe:

Collect preference pairs: humans rate response A vs B for the same prompt.
a model — or a clever — that captures "A is better".
Update the SFT model to produce more A-like answers.

Two branches: RLHF (uses a reward model + RL) and DPO (skips the reward model with one slick ). RLHF came first; DPO is a 2023 simplification that the open-source community has largely standardized on.

2. RLHF — the original recipe

InstructGPT (2022) and the early ChatGPT used this. Three steps stacked on top of SFT:

a. Train a reward model

Take the SFT model. Add a scalar head: same backbone, but the final layer outputs one number instead of a vocab-sized distribution. it on preference pairs (prompt, response_A, response_B, label="A>B") with a margin-style :

\mathcal{L}_{\text{RM}} = -\log \sigma\big(r(\text{prompt}, \text{A}) - r(\text{prompt}, \text{B})\big)

This pushes the reward of the preferred response higher than the rejected one. After enough pairs the reward model is a learned scoring function for "human-preferred".

b. PPO loop

Now the policy (the SFT model) generates responses. Each response gets a reward from the frozen reward model. The policy is updated by Proximal Policy Optimization (PPO) — a reinforcement-learning method that nudges the policy toward higher-reward responses while staying close to its SFT version (so it doesn't drift into reward-hacking degeneracies).

The KL term against the SFT model is the regularizer that keeps the model honest:

\mathcal{L}_{\text{PPO}} \approx -\mathbb{E}\big[r(\text{prompt}, \text{response})\big] + \beta \cdot \text{KL}(\pi_{\text{policy}} \,\|\, \pi_{\text{SFT}})

Without the KL, the policy collapses into producing whatever happens to maximize the reward — often nonsense the reward model rates highly because of its own quirks.

c. Why it's painful

Two extra models to train (reward, policy) on top of the SFT.
PPO has many tuning knobs that interact (learning rates, clip ratios, KL coefficient).
Reward hacking is a real failure mode.
Compute cost is several times the SFT step.

Modern open-source SFT pipelines mostly skip RLHF for these reasons.

3. DPO — direct preference optimization

Rafailov et al. (2023) noticed something elegant: under a specific mathematical setup, the optimal RLHF policy has a closed form involving only the SFT model and the preference pairs. No reward model. No RL. One function.

The DPO per pair (prompt, chosen, rejected):

\mathcal{L}_{\text{DPO}} = -\log \sigma\Big(\beta\big[\log \pi_\theta(\text{chosen}) - \log \pi_\theta(\text{rejected}) - \log \pi_{\text{ref}}(\text{chosen}) + \log \pi_{\text{ref}}(\text{rejected})\big]\Big)

In words: nudge the model so the gap between log-probabilities of (chosen) and (rejected) grows, relative to the same gap under a frozen reference (the SFT model). The β is the KL temperature, same role as in RLHF.

What this means operationally:

One model, not three. The reference is just the SFT , frozen.
One that runs exactly like SFT's, just consuming a triplet instead of a (prompt, completion) pair.
No RL — same loss.backward() machinery from appendix · backprop.
Same hyperparameters as SFT, plus β (typically 0.1).

DPO is what most open-source instruct models released after mid-2023 use: Llama-3-Instruct, Mistral-Instruct, Zephyr, Tulu, OLMo-Instruct. The code is roughly 100 lines on top of chapter 17's SFT loop.

4. Why this book stops at SFT

Three reasons:

Data. Preference pairs are expensive. Anthropic's HH-RLHF has 170k pairs collected from contractors. Open replacements like UltraFeedback exist but are noisy. Generating your own preference data is a real project.
Marginal return at small scale. A 14M or 124M model that already struggles on facts benefits less from preference tuning than from more pretraining data or more domain SFT. The pyramid is base → SFT → preferences, and each level requires the one below to be solid.
Scope. The book's contract is to build the architecture and feel each piece. Preference tuning adds infrastructure (logging chosen/rejected pairs, reference model serving, β sweeps) without adding new conceptual machinery beyond what SFT already taught.

If you want to ship a preference-tuned model after chapter 21's capstone, the practical recipe is:

Use the SFT model from ch.21 as both the policy and the reference.
Collect 500-2000 preference pairs (you ranking your own outputs, or distilled from a larger model).
Plug into Hugging Face TRL's DPOTrainer, which is ~30 lines.

5. What to take away

SFT teaches the shape; preference tuning teaches the choice between two equally-shaped responses.
RLHF = reward model + PPO. Original recipe; powerful, painful.
DPO = one function, no reward model, no RL. Closed-form derivation of the same idea. Dominant in open-source since 2023.
The mechanism is small enough to fit in your head: a margin-style on log-prob differences, regularized by the SFT model.
The data is the moat, not the algorithm. Preference pairs are where the work is.

Going further

Christiano et al. (2017), "Deep RL from Human Preferences" — the paper that started this branch.
Ouyang et al. (2022), "Training language models to follow instructions with human feedback" — InstructGPT. SFT + reward model + PPO, in detail.
Rafailov et al. (2023), "Direct Preference Optimization" — the DPO paper. Short and worth reading once you have the SFT intuition from chapter 17.
Hugging Face TRL DPOTrainer — production-ready DPO with chat templates, packing, and pair masking handled.