Chapter 10 · 16 min

The full transformer block

Attention + feed-forward + residuals + LayerNorm, assembled into the block that GPT stacks N times. End-to-end forward pass.

You have every piece. Let's assemble the actual thing.

A transformer block is the unit you stack to make a transformer. Each block does two things, in order:

Multi-head attention — let every token look at every other token.
Feed-forward network (FFN) — process each token's representation independently. Both are wrapped in the residual + LayerNorm machinery from chapter 9. The block's output has the same shape as its input, which is what lets us stack identical blocks.

This chapter has three runnable cells: build the FFN, assemble one block, stack two blocks and add an unembedding to get a full forward pass. Then you will create the first llm/model.py in your local repo. It will still be tiny and mostly shape-checking, but it will finally look like a language model.

Same toy sentence as chapters 8–9. d_model = 8, n_heads = 4, ffn_hidden = 16, n_blocks = 2, vocab_size = 5. Tiny — but the architecture is the same as GPT.

1. The feed-forward network

Each transformer block has a per-token MLP that runs independently on each token's representation. It has two linear layers with a non-linearity in between:

\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2

The hidden dimension ffn_hidden is usually 4× d_model (so the FFN has a lot more parameters than the attention sublayer). The GELU non-linearity is the smooth descendant of ReLU that modern transformers use; we provide it.

Crucially, the does not let tokens see each other. The only cross-token mixing in a transformer block is attention. The FFN's job is to think about what attention pulled in.

Code · JavaScript

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

return x.map((row) => {
  const h = b1.map((bv, j) => {
    let s = bv;
    for (let k = 0; k < row.length; k++) s += row[k] * W1[k][j];
    return gelu(s);
  });
  return b2.map((bv, j) => {
    let s = bv;
    for (let k = 0; k < h.length; k++) s += h[k] * W2[k][j];
    return s;
  });
});

The output has the same shape as the input — same [seq_len × d_model]. That's the contract every sublayer in a transformer respects.

2. The full block

Now compose attention and FFN with the pre-norm residual recipe:

\begin{aligned} x' &= x + \text{attention}(\text{LayerNorm}(x)) \\ x'' &= x' + \text{FFN}(\text{LayerNorm}(x')) \end{aligned}

The chapter wires the multi-head attention and FFN with their weights pre-bound; the cell shows the composition directly.

Code · JavaScript

function matmul(A, B) {
  return A.map((row) =>
    B[0].map((_, j) => row.reduce((s, x, k) => s + x * B[k][j], 0))
  );
}

function rowSoftmax(rows) {
  return rows.map((row) => {
    let max = -Infinity;
    for (const v of row) if (v > max) max = v;
    const exps = row.map((v) => Math.exp(v - max));
    const sum = exps.reduce((a, b) => a + b, 0);
    return exps.map((e) => e / sum);
  });
}

function layerNorm(rows, eps = 1e-5) {
  return rows.map((row) => {
    const mean = row.reduce((a, b) => a + b, 0) / row.length;
    const variance = row.reduce((a, x) => a + (x - mean) * (x - mean), 0) / row.length;
    const std = Math.sqrt(variance + eps);
    return row.map((x) => (x - mean) / std);
  });
}

function attention(input) {
  const headOuts = [];
  const scale = Math.sqrt(dHead);
  for (const head of block.heads) {
    const Q = matmul(input, head.W_Q);
    const K = matmul(input, head.W_K);
    const V = matmul(input, head.W_V);
    const scores = Q.map((qRow) =>
      K.map((kRow) => qRow.reduce((s, x, k) => s + x * kRow[k], 0) / scale)
    );
    headOuts.push(matmul(rowSoftmax(scores), V));
  }

const concatenated = input.map(() => []);
  for (const headOut of headOuts) {
    for (let i = 0; i < headOut.length; i++) concatenated[i].push(...headOut[i]);
  }
  return matmul(concatenated, block.W_O);
}

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

const a = attention(layerNorm(x));
const afterAttn = x.map((row, i) => row.map((v, j) => v + a[i][j]));
const f = ffn(layerNorm(afterAttn));
return afterAttn.map((row, i) => row.map((v, j) => v + f[i][j]));

This is the entire transformer block. The reason it works is the combination of three properties: the residual stream stays unchanged in scale (LayerNorm + add), attention lets information flow across positions, and the FFN lets each position think about that information independently. The and maintain consistent activation scales.

3. Stack blocks + unembedding

The transformer architecture is just N blocks in sequence, with a final LayerNorm and an unembedding matrix that projects the final hidden state to vocabulary logits.

\begin{aligned} h &= x \\ h &\leftarrow \text{block}_i(h) \quad \text{for } i = 0..N-1 \\ h &\leftarrow \text{LayerNorm}(h) \\ \text{logits} &= h \cdot W_{\text{unembed}} \end{aligned}

The output is [seq_len × vocab_size]. One logit row per input position. For autoregressive generation, you sample from the last row's softmax.

Code · JavaScript

function matmul(A, B) {
  return A.map((row) =>
    B[0].map((_, j) => row.reduce((s, x, k) => s + x * B[k][j], 0))
  );
}

function attention(input, block) {
  const headOuts = [];
  const scale = Math.sqrt(dHead);
  for (const head of block.heads) {
    const Q = matmul(input, head.W_Q);
    const K = matmul(input, head.W_K);
    const V = matmul(input, head.W_V);
    const scores = Q.map((qRow) =>
      K.map((kRow) => qRow.reduce((s, x, k) => s + x * kRow[k], 0) / scale)
    );
    headOuts.push(matmul(rowSoftmax(scores), V));
  }

function gelu(x) {
  return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x * x * x)));
}

function block(input, idx) {
  const weights = blocks[idx];
  const a = attention(layerNorm(input), weights);
  const afterAttn = input.map((row, i) => row.map((v, j) => v + a[i][j]));
  const f = ffn(layerNorm(afterAttn), weights);
  return afterAttn.map((row, i) => row.map((v, j) => v + f[i][j]));
}

let h = x;
for (let i = 0; i < blocks.length; i++) {
  h = block(h, i);
}
h = layerNorm(h);
return matmul(h, unembedding);

That's a transformer. The model has random weights, so the bar plot at the bottom is meaningless — but the shape is right. A real LLM is this same architecture with a few corrections (causal masking, positional encoding, much bigger numbers) trained on billions of tokens of text.

What we skipped

A real transformer would also have:

Token embeddings. We started with X as if it were already embedded. A real model first looks up each token's row in an embedding matrix.
Positional encoding (or RoPE). Attention is permutation-equivariant — without positional information, the model can't tell the cat sat from sat cat the. Real models add or interleave a position signal.
Causal masking. In a decoder-only transformer (GPT-style), token i isn't allowed to attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax.- Dropout during training.
Layer-norm parameters (γ, β) that are learned, not fixed at 0/1.

Most of these are one-line additions. None changes the shape of the architecture. The thing you just built is the core; the rest is plumbing.

4. Create your first model skeleton

Create llm/model.py:

"""A tiny GPT-shaped model skeleton.
 
Chapter 12 replaces the list math with PyTorch tensors. The architecture stays:
token embedding, position embedding, transformer blocks, final logits.
"""
from __future__ import annotations
 
from llm.attention import Matrix, causal_attention, matmul
from llm.nn import add, layer_norm, linear, relu
 
 
# [1]
def feed_forward(x: Matrix, w1: Matrix, b1: list[float], w2: Matrix, b2: list[float]) -> Matrix:
    return [linear(relu(linear(row, w1, b1)), w2, b2) for row in x]
 
 
def transformer_block(
    x: Matrix,
    wq: Matrix,
    wk: Matrix,
    wv: Matrix,
    ffn_w1: Matrix,
    ffn_b1: list[float],
    ffn_w2: Matrix,
    ffn_b2: list[float],
) -> Matrix:
    # [2]
    attended = causal_attention(layer_norm(x), wq, wk, wv)
    # [3]
    x = add(x, attended)
    # [4]
    return add(x, feed_forward(layer_norm(x), ffn_w1, ffn_b1, ffn_w2, ffn_b2))
 
 
# [5]
def logits(hidden: Matrix, unembed: Matrix) -> Matrix:
    return matmul(layer_norm(hidden), unembed)

This skeleton is a map of the full model:

[1] feed_forward applies the same MLP to each token row. It does not mix positions; attention already did that.
[2] starts the block with pre-norm attention: normalize first, then route information between tokens.
[3] adds the attended update back into the residual stream.
[4] repeats the same pattern with the feed-forward network: normalize, transform, add back.
[5] logits converts hidden vectors into vocabulary scores. One row of logits means “scores for every possible next token at this position”.

This file is intentionally incomplete: no learned initialization, no training, no batching. Its job is to make the architecture concrete before the PyTorch version turns it into something fast and trainable.

Recap

The FFN is a per-token MLP: linear → GELU → linear. Wider hidden layer (usually 4× d_model).
The block = attention sublayer + FFN sublayer, both pre-norm + residual. Output has the same shape as input. - A transformer = N blocks in sequence + final LayerNorm + unembedding. Output is [seq_len × vocab_size] logits. - Your local project now has llm/model.py, the first GPT-shaped skeleton. - The block's invariant — input shape = output shape — is what lets us stack arbitrarily many copies. A modern model has dozens. - Causal masking, positional encoding, real embeddings, dropout are all one-line additions on top of this scaffold. The architectural skeleton is what you just wrote.

Going further

Vaswani et al., "Attention Is All You Need" (2017). The full transformer paper.
Karpathy's "Let's build GPT from scratch" — builds the same model end to end in PyTorch.
The Annotated Transformer — paper-by-line implementation with every modification spelled out.
Step by Token, chapter 5 covers the transformer from the understanding angle.

Next up: this is the end of part III. Part IV begins with prepare a dataset — your local project already exists, so now we feed it a real dataset.