Skip to content
The loss curve

Multi-head attention

Running several attention heads in parallel, each with its own Q/K/V projections, then concatenating their outputs and projecting through a learned W_O.