Chapitre 10 · 16 min

Le bloc transformer complet

Assemble attention, résidus et couches feed-forward dans le premier squelette de modèle local proche de GPT.

Tu as toutes les pièces. Assemblons la vraie unité.

Un bloc transformer est l’unité qu’on empile pour faire un transformer. Chaque bloc fait deux choses :

Attention multi-têtes : chaque token peut regarder les autres.
Réseau feed-forward (FFN) : chaque représentation de token est traitée indépendamment. Les deux sont enveloppés dans la mécanique résiduelle + LayerNorm du chapitre 9. La sortie garde la même forme que l’entrée, ce qui permet d’empiler des blocs identiques.

1. Le réseau feed-forward

Chaque bloc contient un MLP par token :

\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2

La dimension cachée est souvent 4× d_model. Le FFN ne mélange pas les tokens entre eux ; seule l’attention fait ça. Son rôle est de réfléchir à ce que l’attention a ramené.

Code · JavaScript

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

return x.map((row) => {
  const h = b1.map((bv, j) => {
    let s = bv;
    for (let k = 0; k < row.length; k++) s += row[k] * W1[k][j];
    return gelu(s);
  });
  return b2.map((bv, j) => {
    let s = bv;
    for (let k = 0; k < h.length; k++) s += h[k] * W2[k][j];
    return s;
  });
});

La sortie a la même forme que l’entrée : [seq_len × d_model].

2. Le bloc complet

On compose attention et FFN avec le pattern pre-norm :

\begin{aligned} x' &= x + \text{attention}(\text{LayerNorm}(x)) \\ x'' &= x' + \text{FFN}(\text{LayerNorm}(x')) \end{aligned}

Code · JavaScript

function matmul(A, B) {
  return A.map((row) =>
    B[0].map((_, j) => row.reduce((s, x, k) => s + x * B[k][j], 0))
  );
}

function rowSoftmax(rows) {
  return rows.map((row) => {
    let max = -Infinity;
    for (const v of row) if (v > max) max = v;
    const exps = row.map((v) => Math.exp(v - max));
    const sum = exps.reduce((a, b) => a + b, 0);
    return exps.map((e) => e / sum);
  });
}

function layerNorm(rows, eps = 1e-5) {
  return rows.map((row) => {
    const mean = row.reduce((a, b) => a + b, 0) / row.length;
    const variance = row.reduce((a, x) => a + (x - mean) * (x - mean), 0) / row.length;
    const std = Math.sqrt(variance + eps);
    return row.map((x) => (x - mean) / std);
  });
}

function attention(input) {
  const headOuts = [];
  const scale = Math.sqrt(dHead);
  for (const head of block.heads) {
    const Q = matmul(input, head.W_Q);
    const K = matmul(input, head.W_K);
    const V = matmul(input, head.W_V);
    const scores = Q.map((qRow) =>
      K.map((kRow) => qRow.reduce((s, x, k) => s + x * kRow[k], 0) / scale)
    );
    headOuts.push(matmul(rowSoftmax(scores), V));
  }

const concatenated = input.map(() => []);
  for (const headOut of headOuts) {
    for (let i = 0; i < headOut.length; i++) concatenated[i].push(...headOut[i]);
  }
  return matmul(concatenated, block.W_O);
}

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

const a = attention(layerNorm(x));
const afterAttn = x.map((row, i) => row.map((v, j) => v + a[i][j]));
const f = ffn(layerNorm(afterAttn));
return afterAttn.map((row, i) => row.map((v, j) => v + f[i][j]));

C’est le bloc transformer entier. Il marche parce que le flux résiduel garde une échelle stable, l’attention route l’information entre positions, et le FFN traite chaque position.

3. Empiler les blocs + unembedding

L’architecture transformer est simplement N blocs à la suite, une LayerNorm finale, puis une matrice d’unembedding qui projette vers les logits du vocabulaire.

\begin{aligned} h &= x \\ h &\leftarrow \text{block}_i(h) \quad \text{pour } i = 0..N-1 \\ h &\leftarrow \text{LayerNorm}(h) \\ \text{logits} &= h \cdot W_{\text{unembed}} \end{aligned}

Code · JavaScript

function matmul(A, B) {
  return A.map((row) =>
    B[0].map((_, j) => row.reduce((s, x, k) => s + x * B[k][j], 0))
  );
}

function attention(input, block) {
  const headOuts = [];
  const scale = Math.sqrt(dHead);
  for (const head of block.heads) {
    const Q = matmul(input, head.W_Q);
    const K = matmul(input, head.W_K);
    const V = matmul(input, head.W_V);
    const scores = Q.map((qRow) =>
      K.map((kRow) => qRow.reduce((s, x, k) => s + x * kRow[k], 0) / scale)
    );
    headOuts.push(matmul(rowSoftmax(scores), V));
  }

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

function block(input, idx) {
  const weights = blocks[idx];
  const a = attention(layerNorm(input), weights);
  const afterAttn = input.map((row, i) => row.map((v, j) => v + a[i][j]));
  const f = ffn(layerNorm(afterAttn), weights);
  return afterAttn.map((row, i) => row.map((v, j) => v + f[i][j]));
}

let h = x;
for (let i = 0; i < blocks.length; i++) {
  h = block(h, i);
}
h = layerNorm(h);
return matmul(h, unembedding);

C’est un transformer. Les poids sont aléatoires, donc les logits n’ont pas encore de sens, mais la forme est bonne. Un vrai LLM est cette architecture avec masquage causal, embeddings positionnels, nombres beaucoup plus grands et entraînement massif.

Ce qu’on a sauté

Un vrai transformer ajoute aussi :

Token embeddings : ici on partait de X déjà embedded.
Position encoding ou RoPE : sans position, l’attention ne connaît pas l’ordre.
Masquage causal : GPT ne peut pas regarder les tokens futurs.
Dropout pendant l’entraînement.
Paramètres de LayerNorm appris.

Ce sont des ajouts importants, mais ils ne changent pas le squelette.

4. Créer le premier squelette de modèle

Crée llm/model.py :

"""A tiny GPT-shaped model skeleton.
 
Chapter 12 replaces the list math with PyTorch tensors. The architecture stays:
token embedding, position embedding, transformer blocks, final logits.
"""
from __future__ import annotations
 
from llm.attention import Matrix, causal_attention, matmul
from llm.nn import add, layer_norm, linear, relu
 
 
# [1]
def feed_forward(x: Matrix, w1: Matrix, b1: list[float], w2: Matrix, b2: list[float]) -> Matrix:
    return [linear(relu(linear(row, w1, b1)), w2, b2) for row in x]
 
 
def transformer_block(
    x: Matrix,
    wq: Matrix,
    wk: Matrix,
    wv: Matrix,
    ffn_w1: Matrix,
    ffn_b1: list[float],
    ffn_w2: Matrix,
    ffn_b2: list[float],
) -> Matrix:
    # [2]
    attended = causal_attention(layer_norm(x), wq, wk, wv)
    # [3]
    x = add(x, attended)
    # [4]
    return add(x, feed_forward(layer_norm(x), ffn_w1, ffn_b1, ffn_w2, ffn_b2))
 
 
# [5]
def logits(hidden: Matrix, unembed: Matrix) -> Matrix:
    return matmul(layer_norm(hidden), unembed)

[1] feed_forward applique le même MLP à chaque token.
[2] commence par attention pre-norm.
[3] ajoute l’update attentionnelle au flux résiduel.
[4] répète le motif avec le FFN.
[5] logits convertit les vecteurs cachés en scores de vocabulaire.

Ce fichier n’est pas encore un modèle entraînable. Son rôle est de rendre l’architecture concrète avant la version PyTorch.

Recap

Le FFN est un MLP par token : linéaire → GELU → linéaire. - Le bloc = attention + FFN, tous deux pre-norm + résiduel. - Un transformer = N blocs + LayerNorm finale + unembedding. - Ton projet local a maintenant llm/model.py. - L’invariant input shape = output shape permet d’empiler autant de blocs que nécessaire.

Pour aller plus loin

Prochaine étape : fin de la partie III. La partie IV commence avec préparer un dataset.