Skip to content
The loss curve

Training

Adjusting a model's parameters so that it does its job better on a given dataset. For a language model, that usually means lowering next-token loss across the training set.

Training a counts-table bigram is just incrementing counters. Training a neural network is running gradient descent on millions to trillions of parameters. The underlying loop — measure how wrong you are, change something to be less wrong, repeat — is the same.