Skip to content
The loss curve

Layer normalization

Normalizes each token's activation vector to mean 0, std 1. Stabilizes activation scales across layers.