Attention
The mechanism that lets a token look at other tokens. Computes a weighted sum of value vectors, weighted by how relevant each is to the current position.
Scaled dot-product attention: A = softmax(QKᵀ / √d_k) · V. Q, K, V are projections of the input. The softmaxed attention matrix says, for every token, how much it draws from every other token.
Continue