Skip to content
The loss curve

Chapter 10 · 16 min

The full transformer block

Attention + feed-forward + residuals + LayerNorm, assembled into the block that GPT stacks N times. End-to-end forward pass.

You have every piece. Let's assemble the actual thing.

A transformer block is the unit you stack to make a transformer. Each block does two things, in order:

  1. Multi-head attention — let every token look at every other token.
  2. Feed-forward network (FFN) — process each token's representation independently. Both are wrapped in the residual + LayerNorm machinery from chapter 9. The block's output has the same shape as its input, which is what lets us stack identical blocks.

This chapter has three runnable cells: build the FFN, assemble one block, stack two blocks and add an unembedding to get a full forward pass. Then you will create the first llm/model.py in your local repo. It will still be tiny and mostly shape-checking, but it will finally look like a language model.

Same toy sentence as chapters 8–9. d_model = 8, n_heads = 4, ffn_hidden = 16, n_blocks = 2, vocab_size = 5. Tiny — but the architecture is the same as GPT.

1. The feed-forward network

Each transformer block has a per-token MLP that runs independently on each token's representation. It has two linear layers with a non-linearity in between:

FFN(x)=W2GELU(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2

The hidden dimension ffn_hidden is usually 4× d_model (so the FFN has a lot more parameters than the attention sublayer). The GELU non-linearity is the smooth descendant of ReLU that modern transformers use; we provide it.

Crucially, the does not let tokens see each other. The only cross-token mixing in a transformer block is attention. The FFN's job is to think about what attention pulled in.

Code · JavaScript

The output has the same shape as the input — same [seq_len × d_model]. That's the contract every sublayer in a transformer respects.

2. The full block

Now compose attention and FFN with the pre-norm residual recipe:

x=x+attention(LayerNorm(x))x=x+FFN(LayerNorm(x))\begin{aligned} x' &= x + \text{attention}(\text{LayerNorm}(x)) \\ x'' &= x' + \text{FFN}(\text{LayerNorm}(x')) \end{aligned}

The chapter wires the multi-head attention and FFN with their weights pre-bound; the cell shows the composition directly.

Code · JavaScript

This is the entire transformer block. The reason it works is the combination of three properties: the residual stream stays unchanged in scale (LayerNorm + add), attention lets information flow across positions, and the FFN lets each position think about that information independently. The and maintain consistent activation scales.

3. Stack blocks + unembedding

The transformer architecture is just N blocks in sequence, with a final LayerNorm and an unembedding matrix that projects the final hidden state to vocabulary logits.

h=xhblocki(h)for i=0..N1hLayerNorm(h)logits=hWunembed\begin{aligned} h &= x \\ h &\leftarrow \text{block}_i(h) \quad \text{for } i = 0..N-1 \\ h &\leftarrow \text{LayerNorm}(h) \\ \text{logits} &= h \cdot W_{\text{unembed}} \end{aligned}

The output is [seq_len × vocab_size]. One logit row per input position. For autoregressive generation, you sample from the last row's softmax.

Code · JavaScript

That's a transformer. The model has random weights, so the bar plot at the bottom is meaningless — but the shape is right. A real LLM is this same architecture with a few corrections (causal masking, positional encoding, much bigger numbers) trained on billions of tokens of text.

What we skipped

A real transformer would also have:

  • Token embeddings. We started with X as if it were already embedded. A real model first looks up each token's row in an embedding matrix.
  • Positional encoding (or RoPE). Attention is permutation-equivariant — without positional information, the model can't tell the cat sat from sat cat the. Real models add or interleave a position signal.
  • Causal masking. In a decoder-only transformer (GPT-style), token i isn't allowed to attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax.- Dropout during training.
  • Layer-norm parameters (γ, β) that are learned, not fixed at 0/1.

Most of these are one-line additions. None changes the shape of the architecture. The thing you just built is the core; the rest is plumbing.

4. Create your first model skeleton

Create llm/model.py:

"""A tiny GPT-shaped model skeleton.
 
Chapter 12 replaces the list math with PyTorch tensors. The architecture stays:
token embedding, position embedding, transformer blocks, final logits.
"""
from __future__ import annotations
 
from llm.attention import Matrix, causal_attention, matmul
from llm.nn import add, layer_norm, linear, relu
 
 
# [1]
def feed_forward(x: Matrix, w1: Matrix, b1: list[float], w2: Matrix, b2: list[float]) -> Matrix:
    return [linear(relu(linear(row, w1, b1)), w2, b2) for row in x]
 
 
def transformer_block(
    x: Matrix,
    wq: Matrix,
    wk: Matrix,
    wv: Matrix,
    ffn_w1: Matrix,
    ffn_b1: list[float],
    ffn_w2: Matrix,
    ffn_b2: list[float],
) -> Matrix:
    # [2]
    attended = causal_attention(layer_norm(x), wq, wk, wv)
    # [3]
    x = add(x, attended)
    # [4]
    return add(x, feed_forward(layer_norm(x), ffn_w1, ffn_b1, ffn_w2, ffn_b2))
 
 
# [5]
def logits(hidden: Matrix, unembed: Matrix) -> Matrix:
    return matmul(layer_norm(hidden), unembed)

This skeleton is a map of the full model:

  • [1] feed_forward applies the same MLP to each token row. It does not mix positions; attention already did that.
  • [2] starts the block with pre-norm attention: normalize first, then route information between tokens.
  • [3] adds the attended update back into the residual stream.
  • [4] repeats the same pattern with the feed-forward network: normalize, transform, add back.
  • [5] logits converts hidden vectors into vocabulary scores. One row of logits means “scores for every possible next token at this position”.

This file is intentionally incomplete: no learned initialization, no training, no batching. Its job is to make the architecture concrete before the PyTorch version turns it into something fast and trainable.

Recap

  • The FFN is a per-token MLP: linear → GELU → linear. Wider hidden layer (usually 4× d_model).
  • The block = attention sublayer + FFN sublayer, both pre-norm + residual. Output has the same shape as input. - A transformer = N blocks in sequence + final LayerNorm + unembedding. Output is [seq_len × vocab_size] logits. - Your local project now has llm/model.py, the first GPT-shaped skeleton. - The block's invariant — input shape = output shape — is what lets us stack arbitrarily many copies. A modern model has dozens. - Causal masking, positional encoding, real embeddings, dropout are all one-line additions on top of this scaffold. The architectural skeleton is what you just wrote.

Going further

Next up: this is the end of part III. Part IV begins with prepare a dataset — your local project already exists, so now we feed it a real dataset.