Skip to content
The loss curve

Feed-forward network

The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.