Skip to content
The loss curve

Chapter 8 · 18 min

An attention head by hand

Q, K, V, scaled dot product, causal mask, softmax. Build a self-attention head by hand and visualize what it attends to.

Up to chapter 7 we had a model that could fit functions on individual examples. Loss curves go down, MLPs solve XOR, optimizers do their job. The problem we haven't solved: the model has no way for a token to look at other tokens. The bigram model in chapter 1 could only see the previous word; an MLP applied to a fixed-size embedding can only see what's in that single vector. There's no mechanism for context to flow.

Attention is that mechanism, and it has powered every state-of-the-art language model since 2017. The original paper called it "all you need", which was an exaggeration that proved approximately correct.

The mechanics are mechanical. Once you've run each piece and inspected the shapes, you can explain attention to a friend at a bar. We're going to do exactly that — four runnable cells that build a single attention head from the inside out, on a 5-token toy sentence with 4-dimensional so the matrices fit on a screen. After that, you will save a minimal causal attention function locally.

The sentence: the cat sat on mat. Five tokens, hand-crafted 4D .

1. Project to Q, K, V

A single attention head has three learned matrices: W_Q, W_K, W_V, each [d × d_head]. Multiplying the input X (shape [seq_len × d]) by each gives queries, keys, and values — three views of the same input, each computed by a different set of weights.

Q=XWQ,K=XWK,V=XWVQ = X \cdot W_Q, \quad K = X \cdot W_K, \quad V = X \cdot W_V

Conceptually:

  • Queries ask "what am I looking for?".
  • Keys advertise "this is what I have".
  • Values carry "this is what I'll contribute if you pick me".

Write the for Q. (We use the same routine for K and V — once you've built it once, you're done.)

Code · JavaScript

The result is [seq_len × d_head], same as the input in our toy case. Each row is the query vector for one token in the sentence.

2. Score every pair

Now ask: how much does every query care about every key? The standard answer is a dot product — large when two vectors point the same way, small when they're orthogonal, negative when they point opposite.

Sij=QiKjS_{ij} = Q_i \cdot K_j

The result S is a [seq_len × seq_len] matrix. S[i][j] is "how relevant is token j to token i?". Row i is the row of scores from token i's perspective.

Code · JavaScript

The heatmap shows the raw scores. The matrix is not symmetric in general — i asking about j is a different question than j asking about i, because they use different W_Q and W_K. That asymmetry is what lets attention express directional relations like "this verb is governed by this subject".

3. Scale and softmax

The scores aren't probabilities yet — they can be any sign, any magnitude. Two transformations turn them into a probability distribution over tokens, row by row:

  1. Scale every entry by 1/√d_k. Without this, the dot products of high-dimensional vectors would grow large enough that softmax saturates and gradients vanish during training.
  2. Softmax each row. Now each row sums to 1.
A=softmax ⁣(Sdk)A = \text{softmax}\!\left(\frac{S}{\sqrt{d_k}}\right)

(Row-wise softmax, applied independently to each row.)

Code · JavaScript

Look at row i in the heatmap. The cells in that row tell you, for token i, how much of its updated representation will come from each other token. If token 2 has a strong value at column 4, the model thinks token 4 is highly relevant as a source for token 2's update. The row sums are 1 — every row is a real probability distribution.

4. Mix the values

Last step: each token's output is the attention-weighted sum of value vectors. Token i takes a weighted average over all tokens' values, weighted by the attention row A[i].

outputi=jAijVj\text{output}_i = \sum_j A_{ij} \cdot V_j

In matrix form, output = A · V, shape [seq_len × d]. Same shape as the input X — the head reshuffles each token's representation by pulling in pieces of others.

Code · JavaScript

That's the full single-head attention computation. Five lines of math, four matrix operations. Stack the same machinery a few dozen times in a row (which is what the next chapters do), train it on a billion tokens of text, and you get GPT.

Why this works

The mechanism isn't deep but the implications are. A single attention head can implement, depending on its trained weights:

  • A copy — every token attends to the previous one, output = previous token's value. (Useful for repetition.)
  • A lookup — for every "the", attend to the noun that follows. (Common in language modeling.)
  • A factor — for every verb, attend to its subject. (Long-range agreement.)
  • A summary — every token attends roughly equally to every other token, averaging the sequence. (Useful for the final layer.)

The training procedure (gradient descent on next-token loss) figures out which patterns the network needs. We never tell it. A modern LLM has tens of these heads per layer and dozens of layers; each head specializes during training.

5. Add causal attention locally

Create llm/attention.py:

"""Readable attention helpers before the PyTorch version."""
from __future__ import annotations
 
import math
 
 
Vector = list[float]
Matrix = list[Vector]
 
 
def dot(a: Vector, b: Vector) -> float:
    return sum(x * y for x, y in zip(a, b))
 
 
def softmax(values: Vector) -> Vector:
    m = max(values)
    exps = [math.exp(v - m) for v in values]
    total = sum(exps)
    return [v / total for v in exps]
 
 
def matmul(x: Matrix, w: Matrix) -> Matrix:
    columns = list(zip(*w))
    return [[dot(row, list(col)) for col in columns] for row in x]
 
 
def causal_attention(x: Matrix, wq: Matrix, wk: Matrix, wv: Matrix) -> Matrix:
    # [1]
    q = matmul(x, wq)
    k = matmul(x, wk)
    v = matmul(x, wv)
    scale = math.sqrt(len(k[0]))
 
    out: Matrix = []
    for i, query in enumerate(q):
        # [2]
        scores = [
            dot(query, key) / scale if j <= i else -1e9
            for j, key in enumerate(k)
        ]
        # [3]
        weights = softmax(scores)
        # [4]
        out.append([
            sum(weight * value[d] for weight, value in zip(weights, v))
            for d in range(len(v[0]))
        ])
    return out

Read it as four passes over the same sequence:

  • [1] q, k, and v are three learned views of x. Same tokens, different questions.
  • [2] scores compares token i's query against every key. j <= i is the causal mask: past and current tokens are visible; future tokens get -1e9.
  • [3] softmax(scores) makes one probability distribution for token i.
  • [4] builds a weighted average of value vectors. That is the new representation for token i.

The important addition is the causal mask: token i can only read tokens 0..i. Without that, the model could cheat during next-token training by looking at the answer.

Recap

  • Three projections of the same input — Q (queries), K (keys), V (values). - Scores are pairwise dot products of queries and keys: how relevant is every other token? - Scale and softmax turn scores into a probability distribution per token. - Output is a weighted sum of value vectors, weighted by attention. - Your local project now has llm/attention.py with causal attention. - One head is a single information-routing pattern. Multiple heads (next chapter) let different routes coexist.

Going further

Next up: multi-head and residuals — why one head isn't enough, and the connection that lets us stack many.