Skip to content
The loss curve

Transformer

Neural network architecture built from stacked blocks of (multi-head attention + FFN) with residuals and layer norms. The dominant architecture for language models since 2017.