Chapter 14 · 12 min

Generation and sampling

How a trained model becomes text — temperature, top-k, top-p (nucleus). Visualize each strategy on the same logits.

In chapter 13 you trained a model that, given a prefix of tokens, produces a probability distribution over the next token. The scripts/sample.py script picked the next token with torch.multinomial(probs, num_samples=1) — sample proportionally to the probabilities. That works, but it's only one of several reasonable choices, and the choice affects what the model says. This chapter looks at the three knobs every modern LLM exposes: temperature, top-K, top-P. They're cheap, they don't require retraining, and they change everything about output quality.

Three knobs, one pipeline

The standard sampling pipeline takes raw logits and produces the final distribution to sample from:

Temperature scaling. Divide every logit by T. T = 1 is unchanged. T < 1 sharpens (more deterministic). T > 1 flattens (more random).2. Softmax to convert logits to probabilities.
Top-K filter. Zero out everything except the top K entries. Renormalize. Default K = 50 in many APIs.
Top-P filter (nucleus sampling). Sort, take the smallest prefix whose cumulative probability ≥ P. Zero the rest. Renormalize. Default P = 0.9 or 0.95.
Sample from the resulting distribution.

The browser cell below builds the pipeline up to step 4 — given some raw logits and the three parameters, return the final distribution. The visualization shows you how each knob reshapes the curve.

Code · JavaScript

Play with the sliders. Some things to notice:

Temperature → 0: the distribution collapses onto the single most-probable token (argmax). Output becomes greedy.
Temperature → 2: the distribution flattens. The model is much more willing to pick unusual tokens.
Top-K = 1: same as temperature → 0. Greedy decoding.
Top-P = 1: nucleus sampling does nothing (it keeps everything). Equivalent to no top-P filter.
Top-P = 0.5 + Top-K = ∞: the model only samples from the head of its distribution — common tokens only.

The full generate function (Python)

Wire your trained model + the sampling pipeline together. Save as scripts/generate.py:

"""scripts/generate.py — autoregressive generation with temperature + top-K + top-P."""
import torch
import torch.nn.functional as F
import tiktoken
 
from llm.model import GPT, GPTConfig
 
# config
# [1]
prompt = "ROMEO:"
max_new_tokens = 200
temperature = 0.8
top_k = 50
top_p = 0.9
 
# [2]
device = "mps" if torch.backends.mps.is_available() else (
    "cuda" if torch.cuda.is_available() else "cpu"
)
cfg = GPTConfig()
model = GPT(cfg).to(device)
model.load_state_dict(torch.load("checkpoints/model.pt", map_location=device))
model.eval()
 
# [3]
enc = tiktoken.get_encoding("gpt2")
idx = torch.tensor([enc.encode_ordinary(prompt)], device=device)
 
@torch.no_grad()
def sample_next(logits):
    # [4]
    logits = logits / temperature
    probs = F.softmax(logits, dim=-1)
    if top_k is not None:
        # [5]
        top_vals, _ = probs.topk(top_k)
        probs[probs < top_vals[..., -1, None]] = 0
        probs = probs / probs.sum(dim=-1, keepdim=True)
    if top_p < 1.0:
        # [6]
        sorted_probs, sorted_idx = probs.sort(descending=True, dim=-1)
        cum = sorted_probs.cumsum(dim=-1)
        # find cutoff
        mask = cum > top_p
        mask[..., 0] = False  # always keep the most probable
        sorted_probs[mask] = 0
        # scatter back
        probs = torch.zeros_like(probs).scatter_(-1, sorted_idx, sorted_probs)
        probs = probs / probs.sum(dim=-1, keepdim=True)
    return torch.multinomial(probs, num_samples=1)
 
# [7]
for _ in range(max_new_tokens):
    idx_cond = idx if idx.size(1) <= cfg.block_size else idx[:, -cfg.block_size:]
    logits, _ = model(idx_cond)
    next_id = sample_next(logits[:, -1, :])
    idx = torch.cat([idx, next_id], dim=1)
 
print(enc.decode(idx[0].tolist()))

The important part is sample_next:

[1] Config holds the generation knobs you will edit and rerun.
[2] Model load restores the checkpoint and moves it to the best available device.
[3] Prompt encoding turns the starting text into GPT-2 token ids.
[4] Temperature happens before softmax because it reshapes the logits themselves.
[5] Top-K happens after softmax by zeroing every probability outside the K largest.
[6] Top-P sorts probabilities from largest to smallest, keeps the smallest prefix that reaches top_p, then scatters those probabilities back to their original token ids.
[7] The outer loop is still the same autoregressive loop from chapter 13: crop context, run model, sample one token, append.

Then run it:

python -m scripts.generate

python -m scripts.generate

python -m scripts.generate

Edit the parameters at the top, re-run, see how the output changes. With temperature = 0.1 you should get repetitive, almost copy-paste output. With temperature = 1.5 and top_p = 0.95 the output should be more varied (and more frequently incoherent).

What about beam search?

For language modeling, don't. Beam search (explore multiple candidate continuations and pick the most-probable sequence) reliably produces output with lower perplexity than sampling — but worse subjective quality. The reason: human language is full of low-probability creative choices that beam search prunes. Sampling lets the model take those.

Beam search is still useful for tasks with a strictly correct answer (translation, summarization with fixed expected length). Free-form generation almost always samples.

Recap

Temperature scales logits before softmax. Below 1, sharpens. Above 1, flattens. 0 is greedy.
Top-K keeps the K most-probable tokens; renormalizes. Bounds vocabulary diversity. - Top-P (nucleus) keeps the smallest prefix summing to P; renormalizes. Adapts to the shape of the distribution per step. - The defaults for chat-style models are usually T = 0.7-1.0, K = 50, P = 0.9-0.95. Stricter (lower T or lower P) for factual tasks; looser for creative. - Beam search has its place, but not in free-form language generation. Sampling wins on perceived quality. - Your local project now has scripts/generate.py, the first usable interface to your checkpoint.

Going further

Holtzman et al., "The Curious Case of Neural Text Degeneration" (2019). The paper that introduced nucleus sampling.
HuggingFace's generation docs — exhaustive reference on every decoding strategy.
Step by Token, chapter 7 covers sampling from the understanding angle.

Next up: load real weights — your model has the shape of a language model but its weights are tiny and Shakespeare-poisoned. Time to drop OpenAI's pretrained weights into the same architecture and see what your code can really do.