Skip to content
The loss curve

Bigram

A pair of consecutive tokens. The simplest unit of context a language model can use.

A bigram model assigns a probability to a token based only on the token immediately before it: P(w_t | w_{t-1}). It cannot capture anything past one position back, but it's already a working language model.