Guide
Self-attention from scratch
Build an attention head by hand — query, key, value, scaled dot product, causal mask, softmax. With runnable visualizations.
Self-attention is the mechanism that lets each token in a sequence look at every other token. It's the central operation in every modern language model. This guide builds an attention head from scratch — query, key, value, scaled dot product, causal mask, softmax — and links into the chapter that does it interactively.
1. The intuition
A model that processes tokens one at a time (like an RNN) has to compress the entire past into a fixed-size hidden state. Self-attention skips the compression: at every position, every token can directly pull information from every other token.
The trick is how much each token pulls. Self-attention learns that.
2. Query, Key, Value
For each input token, three linear projections produce three vectors:
- Query (Q): what this token is looking for.
- Key (K): what this token represents to others.
- Value (V): what this token will contribute if selected.
For a sequence of length T and embedding dimension d, you compute:
Q, K, V ∈ ℝ^(T × d_head)
Three projections, three matrices, all learnable.
3. Scaled dot-product attention
The attention score from token i to token j is the dot product of Q[i] with K[j], scaled by √d_head:
score[i, j] = (Q[i] · K[j]) / √d_head
Compute these for all pairs and you have a T × T matrix of raw scores.
Chapter 8 visualizes this matrix at every step.
4. The causal mask
For a decoder-only language model, token i cannot attend to tokens j > i — it would let the model cheat during training by looking at the answer.
The fix: set those future-position scores to negative infinity before softmax. After softmax they become zero.
score[i, j] = -inf if j > i
5. Softmax and the weighted sum
Apply softmax row-wise to the (masked) scores to turn them into a probability distribution over tokens:
A = softmax(score) # T × T, each row sums to 1
Then the output for each token is a weighted sum of values:
output = A · V # T × d_head
That's one attention head. End to end:
A = softmax((Q · Kᵀ) / √d_head + causal_mask)
output = A · V
6. Multi-head attention
A single head can only learn one routing pattern. Multi-head attention runs several heads in parallel, each with its own Q/K/V projections, then concatenates their outputs and projects back through a learned W_O matrix.
If you have n_heads heads with dimension d_head each, the concatenated output has dimension n_heads × d_head = d_model.
Chapter 9 implements this and shows head-by-head attention visualizations on trained text.
7. Where this fits
Attention is one piece of the Transformer block. The other pieces — residuals, LayerNorm, feed-forward — wrap around it and make the stack trainable at depth.
- Chapter 10 — The full Transformer block assembles all pieces into the block GPT stacks N times.
- Build a Transformer from scratch gives the same tour at the block level.
- Build a GPT from scratch takes attention all the way to a trained GPT-2-scale model.
Frequently asked questions
What is self-attention in plain terms?
A way for each token to look at every other token in the sequence and decide which ones matter. Each token produces a query, a key, and a value vector; attention is a weighted sum of values, weighted by how well queries match keys.
Why divide the attention score by √d_k?
Without it, dot products grow with dimension and push the softmax into saturation, where almost all probability sits on one token. Dividing by √d_k keeps the variance roughly constant so gradients stay healthy.
What is the causal mask?
A constraint that prevents each token from attending to future tokens. We set future-position scores to negative infinity before softmax, so they end up at zero probability. This is what makes a Transformer a decoder.
How is multi-head different from a single head?
One head learns one attention pattern. Multi-head runs several in parallel with different Q/K/V projections, letting different heads specialize on different patterns (syntax, identity, long-range) before their outputs are combined.