Chapter 9 · 14 min
Multi-head and residuals
From one head to many. Add residual connections and LayerNorm — the wiring that makes Transformers trainable at depth.
In chapter 8 you built one attention head. It computed one attention pattern — one matrix of who-listens-to-whom — and reshuffled the token representations accordingly. That's one routing pattern per layer.
That's not enough. Real text has many simultaneous relationships: subject/verb agreement, pronoun resolution, modifier attachment, semantic similarity, position bias. Any one of these can be the right thing to attend to depending on the token. A model with one per layer has to pick.
The fix is the obvious one: run several heads in parallel. Each head has its own W_Q, W_K, W_V, so each ends up with a different attention pattern. Concatenate their outputs, project the result, move on. That's multi-head attention.
This chapter also introduces residual connections and layer normalization — the two pieces of plumbing that make it possible to actually stack attention layers without the network collapsing during training. They look like afterthoughts; they're load-bearing. Locally, you will add those shape-preserving operations to the pieces you already wrote. The and are essential for deep networks.
Same toy sentence as chapter 8 (
the cat sat on mat), but withd_model = 8,H = 4heads, so each head'sd_head = 2.
1. Combine multiple heads
You already have a working single attention head. The chapter has pre-run 4 different heads (different random seeds for W_Q, W_K, W_V) and hands you each head's output as a [seq_len × d_head] matrix. The cell combines them.
The standard recipe:
- Concatenate every head's output along the feature axis. Each token's row becomes
[head₀, head₁, head₂, head₃], lengthH × d_head = d_model. - Project the concatenated matrix through a learned
W_Oof shape[d_model × d_model].
The output projection lets the model decide how to mix the heads. Without it, the heads' outputs would just be glued together with no chance to interact. This is the mechanism.
Code · JavaScript
The result has the same shape as the input — [seq_len × d_model] — but each token's new representation now reflects four different attention patterns blended together.
2. Inspect what the heads actually learn
We've claimed the heads see different things. Let's verify. The four pre-computed attention matrices are shown below as heatmaps. Some are sharp (a few cells dominate each row); some are diffuse (mass spread evenly across the row).
A standard way to quantify "concentration" is : H = -Σ p log p. Low entropy means the head focuses on a few tokens. The maximum entropy for a row of 5 tokens is log(5) ≈ 1.61.
Compute the average entropy per head.
Code · JavaScript
Attention patterns across heads · log(n) ≈ 1.61 = maximum entropy for 5 tokens
Head 0
Head 1
Head 2
Head 3
You should see meaningful differences between heads. Some end up nearly uniform (high entropy — the head doesn't strongly prefer anything). Others have sharp peaks (low entropy — the head has decided to focus). In a real trained transformer, you'd find specialized heads: copy heads, induction heads, name-following heads, syntax-tracking heads. We named patterns we found, not patterns we asked for.
3. Residual + LayerNorm
If we wired the attention sublayer straight into the next attention sublayer, training would collapse. Two reasons, both gnarly:
- Gradient vanishing. Each sublayer compresses the gradient signal a bit. Stack 12 of them and the gradient at the bottom is microscopic.
- Representation drift. Each sublayer transforms the activations into a different geometry. Stack many and the activation magnitudes blow up or shrink to zero.
The fixes the first: instead of output = sublayer(input), we use output = input + sublayer(input). The gradient now has a clean path back through the addition, regardless of what the sublayer does. (This is the trick that made ResNets practical in 2015 and is now in every deep model.)
fixes the second: after the residual addition, normalize every token's row to have mean 0 and standard deviation 1. The downstream layer sees activations of a known scale regardless of what came before.
Combined, the per-sublayer recipe is:
Run it. The chapter feeds the cell the original input (X) and sublayerOutput (the multi-head output from cell 1), plus a small eps for numerical stability inside the standard-deviation division.
Code · JavaScript
Look at the row statistics. Each row's mean should be 0 (to floating-point precision) and each row's std should be very close to 1. That's the invariant LayerNorm gives the next sublayer: every token, every layer, comes in with the same scale.
Why this chapter matters
We've now got every piece of a transformer block:
- Attention (chapter 8): pull information from other positions.
- Multi-head (this chapter): run several attention routes in parallel.
- Residual + LayerNorm (this chapter): the connectivity and normalization that let us stack many blocks.
The next chapter assembles them into a complete transformer block and stacks several of them into the actual architecture.
4. Add residual and normalization helpers
Append these helpers to llm/nn.py:
import math
def add(a: Matrix, b: Matrix) -> Matrix:
return [
[x + y for x, y in zip(row_a, row_b)]
for row_a, row_b in zip(a, b)
]
def layer_norm(x: Matrix, eps: float = 1e-5) -> Matrix:
out: Matrix = []
for row in x:
mean = sum(row) / len(row)
var = sum((value - mean) ** 2 for value in row) / len(row)
denom = math.sqrt(var + eps)
out.append([(value - mean) / denom for value in row])
return outThese helpers are small because their job is structural:
addis the . It keeps the old representation and adds the sublayer's proposed change.layer_normworks row by row, so every token is normalized independently.meanrecenters a token's features around zero.varmeasures how spread out that token's features are.- Dividing by
sqrt(var + eps)gives the next layer a predictable scale.epsprevents division by zero. stabilizes training by maintaining consistent activation scales.
You now have the invariant a transformer depends on: every sublayer accepts a matrix and returns a matrix of the same shape, so the residual stream can keep flowing.
Recap
- Multi-head attention runs
Hattention computations in parallel with separate Q/K/V projections per head, then concatenates the outputs and projects withW_O. - Different heads learn different patterns by virtue of starting with different random weights and being trained on the same loss. Interpretability research catalogs the recurring patterns. - Residual connection =output = input + sublayer(input). Lets gradients flow cleanly through deep stacks. - LayerNorm normalizes each token's row to mean 0 std 1. Stabilizes activation scales across layers. - Your local project now has residual and LayerNorm helpers, the glue that lets attention stack. - The transformer block is just these three pieces glued together, twice (once for attention, once for feed-forward). Next chapter assembles the whole thing. The and are essential for deep networks.
Going further
- Vaswani et al., "Attention Is All You Need" (2017). The original transformer paper. The architecture is still the dominant one nine years later.
- Anthropic's transformer-circuits.pub — the mechanistic-interpretability lab's writings on what heads actually do.
- He et al., "Deep Residual Learning" (2015) — the paper that put residual connections on the map.
Next up: the full transformer block — combine everything we've built into the actual unit you stack.