Causal mask
The constraint that token i cannot attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax. What makes a transformer a *decoder*.
Continue
The constraint that token i cannot attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax. What makes a transformer a *decoder*.
Continue