Multi-head attention

Running several attention heads in parallel, each with its own Q/K/V projections, then concatenating their outputs and projecting through a learned W_O.

Continue

← All terms Browse chapters