Feed-forward network
The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.
Continue
The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.
Continue