Skip to content
The loss curve

Loss

A number that says how badly the model is predicting the right answers. Training minimizes it.

For language models, the per-token loss is typically -log(probability) the model assigned to the true next token. Sum or average across a sequence and you get a comparable number per token. Cross-entropy loss is the standard formulation.