Question 1

What's inside a Transformer block?

Accepted Answer

Six operations in a specific order. LayerNorm, then multi-head attention, then a residual connection adds the input back. Then LayerNorm, then a small feed-forward MLP, then another residual. That's it — the rest is repeating this block N times.

Question 2

Why are there three vectors per token (Q, K, V) in attention?

Accepted Answer

A query says "what am I looking for", a key advertises "what I represent", and a value carries "what to contribute". Splitting these three roles lets the model decide attention weights from queries-and-keys while the actual content moves through values.

Question 3

What does multi-head attention add over a single head?

Accepted Answer

A single head learns one attention pattern. Multi-head runs several in parallel, each with its own projections — letting different heads attend to different patterns (syntax, identity, long-range), then combining the results.

Question 4

Why does the Transformer use LayerNorm instead of BatchNorm?

Accepted Answer

LayerNorm normalizes per-token, so it works regardless of batch size — important for variable-length sequences and inference where batch size can be 1. BatchNorm would couple tokens across the batch, which is the wrong invariant for language.

Question 5

What does the causal mask do?

Accepted Answer

It prevents each token from attending to future tokens during training. We set future-position scores to negative infinity before softmax, so they end up at zero probability. This is what makes a Transformer a *decoder* — it can only see the past.

Build a Transformer from scratch

1. The Transformer in one sentence

2. The attention head

3. Multi-head attention

4. Residuals and LayerNorm

5. The feed-forward network

6. The full block

7. Where to go next

Questions fréquentes