Skip to content
The loss curve

Guide

Build a Transformer from scratch

Build a Transformer block from scratch — Q/K/V attention, multi-head, residuals, LayerNorm, feed-forward. Runnable in the browser, then in PyTorch.

A Transformer block is six small operations stacked in a specific order. Most explanations skip past the assembly. This guide builds each part in isolation — attention, multi-head, residuals, LayerNorm, feed-forward — then puts them together in a working PyTorch model. Three runnable chapters back this page.

1. The Transformer in one sentence

The Transformer is a stack of identical blocks. Each block lets every token in the sequence look at every other token (via attention), then independently processes each token (via a feed-forward network). Residual connections and LayerNorm keep gradients flowing as the stack gets deep.

2. The attention head

The simplest version of attention takes three projections of the input — Query, Key, Value — and computes a weighted sum of values. The weights come from how well each query matches each key.

The exact formula:

A = softmax(Q · Kᵀ / √d_k) · V

Chapter 8 — An attention head by hand builds this one operation at a time, with interactive visualizations of the attention matrix and the causal mask.

3. Multi-head attention

Once you have one attention head, multi-head is the obvious extension: run several heads in parallel, each with its own Q/K/V projections, then concatenate their outputs and project back through a learned W_O matrix.

Different heads specialize. Some attend to the previous token (syntax). Some attend to a specific word elsewhere in the sequence (semantics). Some attend uniformly (identity). The model doesn't need to be told which — it learns the routing from gradients.

Chapter 9 — Multi-head and residuals walks through head splitting, head concatenation, and the residual connection that makes the whole stack trainable.

4. Residuals and LayerNorm

Two pieces of wiring keep the Transformer trainable at depth.

  • Residual connections: output = input + sublayer(input). The block contributes a delta to the input rather than replacing it. Gradients flow through the addition cleanly; depth stops being a bottleneck.
  • LayerNorm: normalizes each token's activation vector to mean 0, std 1. Stabilizes activation scales as the stack gets deep. Modern Transformers apply LayerNorm before attention and feed-forward (the "pre-norm" variant), which is more stable than the post-norm in the original paper.

Both are covered in chapter 9 with side-by-side visualizations of trained vs. untrained gradients.

5. The feed-forward network

After attention does its routing, each token is processed independently by a small MLP — two linear layers with a non-linearity (usually GELU) between them. This is where most of the model's parameters live in practice: the FFN's hidden dimension is typically 4× the model dimension.

6. The full block

Putting it all together, one Transformer block is:

x = x + Attention(LayerNorm(x))    # attention residual
x = x + FFN(LayerNorm(x))          # feed-forward residual

That's it. Stack N of these blocks (12 for GPT-2 small, 96 for GPT-3) and you have a Transformer.

Chapter 10 — The full Transformer block assembles the block end-to-end and runs a forward pass with cleanly traced shapes.

7. Where to go next

Questions fréquentes

What's inside a Transformer block?

Six operations in a specific order. LayerNorm, then multi-head attention, then a residual connection adds the input back. Then LayerNorm, then a small feed-forward MLP, then another residual. That's it — the rest is repeating this block N times.

Why are there three vectors per token (Q, K, V) in attention?

A query says "what am I looking for", a key advertises "what I represent", and a value carries "what to contribute". Splitting these three roles lets the model decide attention weights from queries-and-keys while the actual content moves through values.

What does multi-head attention add over a single head?

A single head learns one attention pattern. Multi-head runs several in parallel, each with its own projections — letting different heads attend to different patterns (syntax, identity, long-range), then combining the results.

Why does the Transformer use LayerNorm instead of BatchNorm?

LayerNorm normalizes per-token, so it works regardless of batch size — important for variable-length sequences and inference where batch size can be 1. BatchNorm would couple tokens across the batch, which is the wrong invariant for language.

What does the causal mask do?

It prevents each token from attending to future tokens during training. We set future-position scores to negative infinity before softmax, so they end up at zero probability. This is what makes a Transformer a *decoder* — it can only see the past.