Skip to content
The loss curve

Guide

Self-attention from scratch

Build an attention head by hand — query, key, value, scaled dot product, causal mask, softmax. With runnable visualizations.

Self-attention is the mechanism that lets each token in a sequence look at every other token. It's the central operation in every modern language model. This guide builds an attention head from scratch — query, key, value, scaled dot product, causal mask, softmax — and links into the chapter that does it interactively.

1. The intuition

A model that processes tokens one at a time (like an RNN) has to compress the entire past into a fixed-size hidden state. Self-attention skips the compression: at every position, every token can directly pull information from every other token.

The trick is how much each token pulls. Self-attention learns that.

2. Query, Key, Value

For each input token, three linear projections produce three vectors:

  • Query (Q): what this token is looking for.
  • Key (K): what this token represents to others.
  • Value (V): what this token will contribute if selected.

For a sequence of length T and embedding dimension d, you compute:

Q, K, V ∈ ℝ^(T × d_head)

Three projections, three matrices, all learnable.

3. Scaled dot-product attention

The attention score from token i to token j is the dot product of Q[i] with K[j], scaled by √d_head:

score[i, j] = (Q[i] · K[j]) / √d_head

Compute these for all pairs and you have a T × T matrix of raw scores.

Chapter 8 visualizes this matrix at every step.

4. The causal mask

For a decoder-only language model, token i cannot attend to tokens j > i — it would let the model cheat during training by looking at the answer.

The fix: set those future-position scores to negative infinity before softmax. After softmax they become zero.

score[i, j] = -inf  if j > i

5. Softmax and the weighted sum

Apply softmax row-wise to the (masked) scores to turn them into a probability distribution over tokens:

A = softmax(score)    # T × T, each row sums to 1

Then the output for each token is a weighted sum of values:

output = A · V        # T × d_head

That's one attention head. End to end:

A = softmax((Q · Kᵀ) / √d_head + causal_mask)
output = A · V

6. Multi-head attention

A single head can only learn one routing pattern. Multi-head attention runs several heads in parallel, each with its own Q/K/V projections, then concatenates their outputs and projects back through a learned W_O matrix.

If you have n_heads heads with dimension d_head each, the concatenated output has dimension n_heads × d_head = d_model.

Chapter 9 implements this and shows head-by-head attention visualizations on trained text.

7. Where this fits

Attention is one piece of the Transformer block. The other pieces — residuals, LayerNorm, feed-forward — wrap around it and make the stack trainable at depth.

Frequently asked questions

What is self-attention in plain terms?

A way for each token to look at every other token in the sequence and decide which ones matter. Each token produces a query, a key, and a value vector; attention is a weighted sum of values, weighted by how well queries match keys.

Why divide the attention score by √d_k?

Without it, dot products grow with dimension and push the softmax into saturation, where almost all probability sits on one token. Dividing by √d_k keeps the variance roughly constant so gradients stay healthy.

What is the causal mask?

A constraint that prevents each token from attending to future tokens. We set future-position scores to negative infinity before softmax, so they end up at zero probability. This is what makes a Transformer a decoder.

How is multi-head different from a single head?

One head learns one attention pattern. Multi-head runs several in parallel with different Q/K/V projections, letting different heads specialize on different patterns (syntax, identity, long-range) before their outputs are combined.

Continue learning