Lexicon

Every technical term used across the chapters. Hover any underlined word in a chapter to get a quick definition; this page holds the longer ones.

Adam: The most popular optimizer in modern deep learning. Maintains per-parameter running averages of gradient (m) and squared gradient (v), normalizes the step by √v.; Combines momentum (first moment) with per-dimension adaptive scaling (second moment). Bias-corrects both. Default optimizer for transformers. AdamW is Adam with weight decay applied outside the gradient.
Alignment: The work of turning a next-token predictor into a useful assistant: SFT (supervised fine-tuning) on human-written examples, then RLHF or DPO on human preference data.
Attention: The mechanism that lets a token look at other tokens. Computes a weighted sum of value vectors, weighted by how relevant each is to the current position.; Scaled dot-product attention: A = softmax(QKᵀ / √d_k) · V. Q, K, V are projections of the input. The softmaxed attention matrix says, for every token, how much it draws from every other token.
Backpropagation: Algorithm that computes the gradient of the loss with respect to every parameter by walking the chain rule backwards through the network's operations.
Bigram: A pair of consecutive tokens. The simplest unit of context a language model can use.; A bigram model assigns a probability to a token based only on the token immediately before it: P(w_t | w_{t-1}). It cannot capture anything past one position back, but it's already a working language model.
Block size: Maximum context length the model attends over during training. Picks how many tokens of history it can see.
Byte-Pair Encoding (BPE): Tokenization scheme that starts with characters and iteratively merges the most-frequent adjacent pair. Produces subword tokens.; Originally a 1994 data-compression algorithm. Used in GPT-2/3/4 and most modern LLMs. The merges discover morphology — suffixes like 'ing' or 'ed' emerge naturally as common subwords.
Causal mask: The constraint that token i cannot attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax. What makes a transformer a *decoder*.
Corpus: The body of text used to train, inspect, or evaluate a language model. Can be a sentence, a book, or trillions of words scraped from the web.
Cosine similarity: Measure of how aligned two vectors are: cos(a, b) = (a · b) / (‖a‖·‖b‖). Value 1 means same direction, 0 orthogonal, −1 opposite.
Cross-entropy: The standard loss for classification: −Σ y·log(p). Designed to pair cleanly with softmax/sigmoid outputs.
Embedding: A dense low-dimensional vector representing a token. Words used in similar contexts end up with similar vectors.; Embeddings replace one-hot encodings of tokens with continuous vectors that capture semantic relationships. The geometry of the embedding space is meaningful: directions can encode features like gender, formality, register.
Entropy: A measure of how spread-out a probability distribution is. H = -Σ p log p. Low entropy = concentrated on a few outcomes; high entropy = nearly uniform.
Feed-forward network: The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.
GELU: Smooth approximation of ReLU used in transformers. Roughly x·Φ(x) where Φ is the Gaussian CDF.
Generation: Producing a new sequence of tokens by sampling the model one step at a time. Each emitted token feeds back as input for the next step.
Gradient: Vector of partial derivatives of the loss with respect to every parameter. Tells you which way (and how much) to nudge each parameter to lower the loss.
Gradient descent: Optimization procedure: subtract a small multiple of the gradient from the parameters at each step.
Hyperparameter: A number that controls training but isn't learned by the optimizer: learning rate, batch size, hidden size, dropout probability, etc. You set them; you don't gradient-descent them.
Inference: Running a trained model on new inputs to get predictions. Opposite of training — no parameters move.
Input shape: The dimensions of the input tensor to a neural network layer. For example, [batch_size, sequence_length] for text inputs.
Kneser-Ney: Smarter smoothing that subtracts a small discount from seen counts and redistributes the mass through a more thoughtful fallback than uniform.
Language model: A model that assigns a probability to the next token given the tokens that came before. Generation is repeated sampling from that probability.; Every modern LLM is a language model: input is a sequence of tokens, output is a probability distribution over the vocabulary. The bigram in chapter 1, the transformer in chapter 10, and GPT-4 all share that interface — only the function in the middle differs.
Laplace smoothing: Simplest smoothing: add a constant α to every cell of the counts table before normalizing.
Layer normalization: Normalizes each token's activation vector to mean 0, std 1. Stabilizes activation scales across layers.
Learning rate: Scalar that scales each gradient-descent step. Too small and training crawls; too large and the loss diverges. The single most-important hyperparameter.
LLM (large language model): A language model with enough parameters and training data to produce coherent multi-paragraph text. Modern LLMs are transformers with billions to trillions of parameters.; There is no exact size threshold — "large" is a moving target. In practice, the term covers transformer-based language models from a few hundred million parameters upward, trained on hundreds of billions of tokens.
LoRA: Low-Rank Adaptation. Fine-tune a model without retraining its weights: freeze W, learn a small A·B update. Cuts trainable parameters by ~100×.
Loss: A number that says how badly the model is predicting the right answers. Training minimizes it.; For language models, the per-token loss is typically -log(probability) the model assigned to the true next token. Sum or average across a sequence and you get a comparable number per token. Cross-entropy loss is the standard formulation.
Matrix multiplication: (A · B)[i,j] = Σ_k A[i,k] · B[k,j]. The arithmetic core of every neural network layer; modern GPUs and Apple Silicon have dedicated paths to make it fast.
MLP (multi-layer perceptron): Two or more linear layers stacked with non-linearities between them. Can learn non-linear boundaries unlike a single neuron.
Momentum: Trick that accumulates a running velocity of recent gradients. Smooths the trajectory when consecutive steps reinforce, dampens it when they oscillate.
Multi-head attention: Running several attention heads in parallel, each with its own Q/K/V projections, then concatenating their outputs and projecting through a learned W_O.
Neuron: Weighted sum of inputs followed by a non-linearity: σ(Σ w_i·x_i + b). The smallest learnable unit in a neural network.
One-hot encoding: A vector of zeros with a single 1 at the position of the token. Wasteful and asserts that every pair of distinct tokens is equally unrelated.
Out-of-vocabulary: A token (or pair) that never appeared during training. Unsmoothed models can't assign it any probability.
Parameter: One of the model's learnable numbers. Modern LLMs have billions to trillions; the bigram model in chapter 1 has |vocab|² of them (one per cell of the counts table).
Perplexity: Geometric mean of inverse-probability over a sequence. Low when the model assigns high probability to the tokens it sees; infinite when any one token has probability zero.; Reported as exp(mean negative log-likelihood). Common evaluation metric for language models. A perplexity of 50 means the model is, on average, as confused as if it had to choose uniformly between 50 equally-likely tokens.
Probability distribution: A list of non-negative numbers that sum to 1, one per possible outcome. A language model's output is one such distribution over the vocabulary.
Quantization: Storing weights as low-precision integers (INT8, INT4) instead of floats. Cuts model size 4–8× and speeds up inference, usually with minimal quality loss.
Query / Key / Value: Three projections of the input used in attention. Queries ask, keys advertise, values contribute.
ReLU: Rectified linear unit: max(0, x). The standard non-linearity inside modern neural networks because it's cheap and avoids vanishing gradients.
Residual connection: output = input + sublayer(input). Lets gradients flow cleanly through deep stacks; was the key innovation that made deep CNNs practical, now in every transformer.
Sampling: Drawing a random value from a probability distribution. In a language model, picking the next token by rolling a number against the model's output.; Sampling strategies (greedy, temperature, top-k, top-p) all answer the same question — given a probability distribution over the vocabulary, which token do we actually emit? — but trade off determinism for diversity differently.
Scaling law: Empirical observation that model quality improves predictably with parameters, training tokens, and compute. The Chinchilla rule: optimal tokens ≈ 20 × parameters.
Seed token: The first token (or short prompt) you hand to a language model to start generation. Everything after is sampled from the model.
Sigmoid: Squashing function σ(x) = 1 / (1 + e^(-x)) that maps any real number into (0, 1). The classical activation for binary classification outputs.
Skip-gram: Algorithm that learns word embeddings by pushing each center word's vector toward the vectors of its neighbors in the corpus.
Smoothing: Family of techniques that ensure every possible token transition gets a positive probability, even ones never seen during training.; Without smoothing, an n-gram model collapses to perplexity = ∞ as soon as it sees an unseen transition on the validation set. Laplace add-α and Kneser-Ney are the two classic methods.
Softmax: Normalizes a vector of real numbers into a probability distribution: exp(x_i) / Σ exp(x_j). Used at the end of attention and at the model's output.
Subword: Token that is shorter than a word but longer than a character. The granularity BPE settles on.
Token: The basic unit a model reads and writes. Often a subword, sometimes a word, sometimes a single character.; Tokenization is the first step in any language model pipeline. Whitespace tokenizers split on spaces; BPE tokenizers find a granularity between words and characters by learning which subwords appear most often.
Tokenizer: The function that splits raw text into tokens. Ranges from naive whitespace + lowercase to learned schemes like BPE.
Training: Adjusting a model's parameters so that it does its job better on a given dataset. For a language model, that usually means lowering next-token loss across the training set.; Training a counts-table bigram is just incrementing counters. Training a neural network is running gradient descent on millions to trillions of parameters. The underlying loop — measure how wrong you are, change something to be less wrong, repeat — is the same.
Transformer: Neural network architecture built from stacked blocks of (multi-head attention + FFN) with residuals and layer norms. The dominant architecture for language models since 2017.
Validation set: Held-out portion of the data the model never sees during training. Used to estimate metrics like perplexity on data the model can't have memorized.; A typical split is 80% train / 10% validation / 10% test. Validation drives decisions during development (which hyperparameters? when to stop?); test is touched only once at the end to report a final number.
Vocabulary: The set of all distinct tokens a model can read or produce. Size ranges from ~80 (character-level) to ~100,000 (modern subword tokenizers).