Chapter 10 · 16 min
The full transformer block
Attention + feed-forward + residuals + LayerNorm, assembled into the block that GPT stacks N times. End-to-end forward pass.
You have every piece. Let's assemble the actual thing.
A transformer block is the unit you stack to make a transformer. Each block does two things, in order:
- Multi-head attention — let every token look at every other token.
- Feed-forward network (FFN) — process each token's representation independently. Both are wrapped in the residual + LayerNorm machinery from chapter 9. The block's output has the same shape as its input, which is what lets us stack identical blocks.
This chapter has three runnable cells: build the FFN, assemble one block, stack two blocks and add an unembedding to get a full forward pass. Then you will create the first llm/model.py in your local repo. It will still be tiny and mostly shape-checking, but it will finally look like a language model.
Same toy sentence as chapters 8–9.
d_model = 8,n_heads = 4,ffn_hidden = 16,n_blocks = 2,vocab_size = 5. Tiny — but the architecture is the same as GPT.
1. The feed-forward network
Each transformer block has a per-token MLP that runs independently on each token's representation. It has two linear layers with a non-linearity in between:
The hidden dimension ffn_hidden is usually 4× d_model (so the FFN has a lot more parameters than the attention sublayer). The GELU non-linearity is the smooth descendant of ReLU that modern transformers use; we provide it.
Crucially, the does not let tokens see each other. The only cross-token mixing in a transformer block is attention. The FFN's job is to think about what attention pulled in.
Code · JavaScript
The output has the same shape as the input — same [seq_len × d_model]. That's the contract every sublayer in a transformer respects.
2. The full block
Now compose attention and FFN with the pre-norm residual recipe:
The chapter wires the multi-head attention and FFN with their weights pre-bound; the cell shows the composition directly.
Code · JavaScript
This is the entire transformer block. The reason it works is the combination of three properties: the residual stream stays unchanged in scale (LayerNorm + add), attention lets information flow across positions, and the FFN lets each position think about that information independently. The and maintain consistent activation scales.
3. Stack blocks + unembedding
The transformer architecture is just N blocks in sequence, with a final LayerNorm and an unembedding matrix that projects the final hidden state to vocabulary logits.
The output is [seq_len × vocab_size]. One logit row per input position. For autoregressive generation, you sample from the last row's softmax.
Code · JavaScript
That's a transformer. The model has random weights, so the bar plot at the bottom is meaningless — but the shape is right. A real LLM is this same architecture with a few corrections (causal masking, positional encoding, much bigger numbers) trained on billions of tokens of text.
What we skipped
A real transformer would also have:
- Token embeddings. We started with
Xas if it were already embedded. A real model first looks up each token's row in an embedding matrix. - Positional encoding (or RoPE). Attention is permutation-equivariant — without positional information, the model can't tell
the cat satfromsat cat the. Real models add or interleave a position signal. - Causal masking. In a decoder-only transformer (GPT-style), token i isn't allowed to attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax.- Dropout during training.
- Layer-norm parameters (γ, β) that are learned, not fixed at 0/1.
Most of these are one-line additions. None changes the shape of the architecture. The thing you just built is the core; the rest is plumbing.
4. Create your first model skeleton
Create llm/model.py:
"""A tiny GPT-shaped model skeleton.
Chapter 12 replaces the list math with PyTorch tensors. The architecture stays:
token embedding, position embedding, transformer blocks, final logits.
"""
from __future__ import annotations
from llm.attention import Matrix, causal_attention, matmul
from llm.nn import add, layer_norm, linear, relu
# [1]
def feed_forward(x: Matrix, w1: Matrix, b1: list[float], w2: Matrix, b2: list[float]) -> Matrix:
return [linear(relu(linear(row, w1, b1)), w2, b2) for row in x]
def transformer_block(
x: Matrix,
wq: Matrix,
wk: Matrix,
wv: Matrix,
ffn_w1: Matrix,
ffn_b1: list[float],
ffn_w2: Matrix,
ffn_b2: list[float],
) -> Matrix:
# [2]
attended = causal_attention(layer_norm(x), wq, wk, wv)
# [3]
x = add(x, attended)
# [4]
return add(x, feed_forward(layer_norm(x), ffn_w1, ffn_b1, ffn_w2, ffn_b2))
# [5]
def logits(hidden: Matrix, unembed: Matrix) -> Matrix:
return matmul(layer_norm(hidden), unembed)This skeleton is a map of the full model:
- [1]
feed_forwardapplies the same MLP to each token row. It does not mix positions; attention already did that. - [2] starts the block with pre-norm attention: normalize first, then route information between tokens.
- [3] adds the attended update back into the residual stream.
- [4] repeats the same pattern with the feed-forward network: normalize, transform, add back.
- [5]
logitsconverts hidden vectors into vocabulary scores. One row of logits means “scores for every possible next token at this position”.
This file is intentionally incomplete: no learned initialization, no training, no batching. Its job is to make the architecture concrete before the PyTorch version turns it into something fast and trainable.
Recap
- The FFN is a per-token MLP: linear → GELU → linear. Wider hidden layer (usually 4× d_model).
- The block = attention sublayer + FFN sublayer, both pre-norm + residual. Output has the same
shape as input. - A transformer = N blocks in sequence + final LayerNorm + unembedding. Output
is
[seq_len × vocab_size]logits. - Your local project now hasllm/model.py, the first GPT-shaped skeleton. - The block's invariant — input shape = output shape — is what lets us stack arbitrarily many copies. A modern model has dozens. - Causal masking, positional encoding, real embeddings, dropout are all one-line additions on top of this scaffold. The architectural skeleton is what you just wrote.
Going further
- Vaswani et al., "Attention Is All You Need" (2017). The full transformer paper.
- Karpathy's "Let's build GPT from scratch" — builds the same model end to end in PyTorch.
- The Annotated Transformer — paper-by-line implementation with every modification spelled out.
- Step by Token, chapter 5 covers the transformer from the understanding angle.
Next up: this is the end of part III. Part IV begins with prepare a dataset — your local project already exists, so now we feed it a real dataset.