Feed-forward network

The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.

Continue

← All terms Browse chapters