Skip to content
The loss curve

Lexicon

Every technical term used across the chapters. Hover any underlined word in a chapter to get a quick definition; this page holds the longer ones.

Adam
The most popular optimizer in modern deep learning. Maintains per-parameter running averages of gradient (m) and squared gradient (v), normalizes the step by √v.
Combines momentum (first moment) with per-dimension adaptive scaling (second moment). Bias-corrects both. Default optimizer for transformers. AdamW is Adam with weight decay applied outside the gradient.
Alignment
The work of turning a next-token predictor into a useful assistant: SFT (supervised fine-tuning) on human-written examples, then RLHF or DPO on human preference data.
Attention
The mechanism that lets a token look at other tokens. Computes a weighted sum of value vectors, weighted by how relevant each is to the current position.
Scaled dot-product attention: A = softmax(QKᵀ / √d_k) · V. Q, K, V are projections of the input. The softmaxed attention matrix says, for every token, how much it draws from every other token.
Backpropagation
Algorithm that computes the gradient of the loss with respect to every parameter by walking the chain rule backwards through the network's operations.
Bigram
A pair of consecutive tokens. The simplest unit of context a language model can use.
A bigram model assigns a probability to a token based only on the token immediately before it: P(w_t | w_{t-1}). It cannot capture anything past one position back, but it's already a working language model.
Block size
Maximum context length the model attends over during training. Picks how many tokens of history it can see.
Byte-Pair Encoding (BPE)
Tokenization scheme that starts with characters and iteratively merges the most-frequent adjacent pair. Produces subword tokens.
Originally a 1994 data-compression algorithm. Used in GPT-2/3/4 and most modern LLMs. The merges discover morphology — suffixes like 'ing' or 'ed' emerge naturally as common subwords.
Causal mask
The constraint that token i cannot attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax. What makes a transformer a *decoder*.
Corpus
The body of text used to train, inspect, or evaluate a language model. Can be a sentence, a book, or trillions of words scraped from the web.
Cosine similarity
Measure of how aligned two vectors are: cos(a, b) = (a · b) / (‖a‖·‖b‖). Value 1 means same direction, 0 orthogonal, −1 opposite.
Cross-entropy
The standard loss for classification: −Σ y·log(p). Designed to pair cleanly with softmax/sigmoid outputs.
Embedding
A dense low-dimensional vector representing a token. Words used in similar contexts end up with similar vectors.
Embeddings replace one-hot encodings of tokens with continuous vectors that capture semantic relationships. The geometry of the embedding space is meaningful: directions can encode features like gender, formality, register.
Entropy
A measure of how spread-out a probability distribution is. H = -Σ p log p. Low entropy = concentrated on a few outcomes; high entropy = nearly uniform.
Feed-forward network
The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.
GELU
Smooth approximation of ReLU used in transformers. Roughly x·Φ(x) where Φ is the Gaussian CDF.
Generation
Producing a new sequence of tokens by sampling the model one step at a time. Each emitted token feeds back as input for the next step.
Gradient
Vector of partial derivatives of the loss with respect to every parameter. Tells you which way (and how much) to nudge each parameter to lower the loss.
Gradient descent
Optimization procedure: subtract a small multiple of the gradient from the parameters at each step.
Hyperparameter
A number that controls training but isn't learned by the optimizer: learning rate, batch size, hidden size, dropout probability, etc. You set them; you don't gradient-descent them.
Inference
Running a trained model on new inputs to get predictions. Opposite of training — no parameters move.
Input shape
The dimensions of the input tensor to a neural network layer. For example, [batch_size, sequence_length] for text inputs.
Kneser-Ney
Smarter smoothing that subtracts a small discount from seen counts and redistributes the mass through a more thoughtful fallback than uniform.
Language model
A model that assigns a probability to the next token given the tokens that came before. Generation is repeated sampling from that probability.
Every modern LLM is a language model: input is a sequence of tokens, output is a probability distribution over the vocabulary. The bigram in chapter 1, the transformer in chapter 10, and GPT-4 all share that interface — only the function in the middle differs.
Laplace smoothing
Simplest smoothing: add a constant α to every cell of the counts table before normalizing.
Layer normalization
Normalizes each token's activation vector to mean 0, std 1. Stabilizes activation scales across layers.
Learning rate
Scalar that scales each gradient-descent step. Too small and training crawls; too large and the loss diverges. The single most-important hyperparameter.
LLM (large language model)
A language model with enough parameters and training data to produce coherent multi-paragraph text. Modern LLMs are transformers with billions to trillions of parameters.
There is no exact size threshold — "large" is a moving target. In practice, the term covers transformer-based language models from a few hundred million parameters upward, trained on hundreds of billions of tokens.
LoRA
Low-Rank Adaptation. Fine-tune a model without retraining its weights: freeze W, learn a small A·B update. Cuts trainable parameters by ~100×.
Loss
A number that says how badly the model is predicting the right answers. Training minimizes it.
For language models, the per-token loss is typically -log(probability) the model assigned to the true next token. Sum or average across a sequence and you get a comparable number per token. Cross-entropy loss is the standard formulation.
Matrix multiplication
(A · B)[i,j] = Σ_k A[i,k] · B[k,j]. The arithmetic core of every neural network layer; modern GPUs and Apple Silicon have dedicated paths to make it fast.
MLP (multi-layer perceptron)
Two or more linear layers stacked with non-linearities between them. Can learn non-linear boundaries unlike a single neuron.
Momentum
Trick that accumulates a running velocity of recent gradients. Smooths the trajectory when consecutive steps reinforce, dampens it when they oscillate.
Multi-head attention
Running several attention heads in parallel, each with its own Q/K/V projections, then concatenating their outputs and projecting through a learned W_O.
Neuron
Weighted sum of inputs followed by a non-linearity: σ(Σ w_i·x_i + b). The smallest learnable unit in a neural network.
One-hot encoding
A vector of zeros with a single 1 at the position of the token. Wasteful and asserts that every pair of distinct tokens is equally unrelated.
Out-of-vocabulary
A token (or pair) that never appeared during training. Unsmoothed models can't assign it any probability.
Parameter
One of the model's learnable numbers. Modern LLMs have billions to trillions; the bigram model in chapter 1 has |vocab|² of them (one per cell of the counts table).
Perplexity
Geometric mean of inverse-probability over a sequence. Low when the model assigns high probability to the tokens it sees; infinite when any one token has probability zero.
Reported as exp(mean negative log-likelihood). Common evaluation metric for language models. A perplexity of 50 means the model is, on average, as confused as if it had to choose uniformly between 50 equally-likely tokens.
Probability distribution
A list of non-negative numbers that sum to 1, one per possible outcome. A language model's output is one such distribution over the vocabulary.
Quantization
Storing weights as low-precision integers (INT8, INT4) instead of floats. Cuts model size 4–8× and speeds up inference, usually with minimal quality loss.
Query / Key / Value
Three projections of the input used in attention. Queries ask, keys advertise, values contribute.
ReLU
Rectified linear unit: max(0, x). The standard non-linearity inside modern neural networks because it's cheap and avoids vanishing gradients.
Residual connection
output = input + sublayer(input). Lets gradients flow cleanly through deep stacks; was the key innovation that made deep CNNs practical, now in every transformer.
Sampling
Drawing a random value from a probability distribution. In a language model, picking the next token by rolling a number against the model's output.
Sampling strategies (greedy, temperature, top-k, top-p) all answer the same question — given a probability distribution over the vocabulary, which token do we actually emit? — but trade off determinism for diversity differently.
Scaling law
Empirical observation that model quality improves predictably with parameters, training tokens, and compute. The Chinchilla rule: optimal tokens ≈ 20 × parameters.
Seed token
The first token (or short prompt) you hand to a language model to start generation. Everything after is sampled from the model.
Sigmoid
Squashing function σ(x) = 1 / (1 + e^(-x)) that maps any real number into (0, 1). The classical activation for binary classification outputs.
Skip-gram
Algorithm that learns word embeddings by pushing each center word's vector toward the vectors of its neighbors in the corpus.
Smoothing
Family of techniques that ensure every possible token transition gets a positive probability, even ones never seen during training.
Without smoothing, an n-gram model collapses to perplexity = ∞ as soon as it sees an unseen transition on the validation set. Laplace add-α and Kneser-Ney are the two classic methods.
Softmax
Normalizes a vector of real numbers into a probability distribution: exp(x_i) / Σ exp(x_j). Used at the end of attention and at the model's output.
Subword
Token that is shorter than a word but longer than a character. The granularity BPE settles on.
Token
The basic unit a model reads and writes. Often a subword, sometimes a word, sometimes a single character.
Tokenization is the first step in any language model pipeline. Whitespace tokenizers split on spaces; BPE tokenizers find a granularity between words and characters by learning which subwords appear most often.
Tokenizer
The function that splits raw text into tokens. Ranges from naive whitespace + lowercase to learned schemes like BPE.
Training
Adjusting a model's parameters so that it does its job better on a given dataset. For a language model, that usually means lowering next-token loss across the training set.
Training a counts-table bigram is just incrementing counters. Training a neural network is running gradient descent on millions to trillions of parameters. The underlying loop — measure how wrong you are, change something to be less wrong, repeat — is the same.
Transformer
Neural network architecture built from stacked blocks of (multi-head attention + FFN) with residuals and layer norms. The dominant architecture for language models since 2017.
Validation set
Held-out portion of the data the model never sees during training. Used to estimate metrics like perplexity on data the model can't have memorized.
A typical split is 80% train / 10% validation / 10% test. Validation drives decisions during development (which hyperparameters? when to stop?); test is touched only once at the end to report a final number.
Vocabulary
The set of all distinct tokens a model can read or produce. Size ranges from ~80 (character-level) to ~100,000 (modern subword tokenizers).