Lexicon
Every technical term used across the chapters. Hover any underlined word in a chapter to get a quick definition; this page holds the longer ones.
- Adam
- The most popular optimizer in modern deep learning. Maintains per-parameter running averages of gradient (m) and squared gradient (v), normalizes the step by √v.
- Combines momentum (first moment) with per-dimension adaptive scaling (second moment). Bias-corrects both. Default optimizer for transformers. AdamW is Adam with weight decay applied outside the gradient.
- Alignment
- The work of turning a next-token predictor into a useful assistant: SFT (supervised fine-tuning) on human-written examples, then RLHF or DPO on human preference data.
- Attention
- The mechanism that lets a token look at other tokens. Computes a weighted sum of value vectors, weighted by how relevant each is to the current position.
- Scaled dot-product attention: A = softmax(QKᵀ / √d_k) · V. Q, K, V are projections of the input. The softmaxed attention matrix says, for every token, how much it draws from every other token.
- Backpropagation
- Algorithm that computes the gradient of the loss with respect to every parameter by walking the chain rule backwards through the network's operations.
- Bigram
- A pair of consecutive tokens. The simplest unit of context a language model can use.
- A bigram model assigns a probability to a token based only on the token immediately before it: P(w_t | w_{t-1}). It cannot capture anything past one position back, but it's already a working language model.
- Block size
- Maximum context length the model attends over during training. Picks how many tokens of history it can see.
- Byte-Pair Encoding (BPE)
- Tokenization scheme that starts with characters and iteratively merges the most-frequent adjacent pair. Produces subword tokens.
- Originally a 1994 data-compression algorithm. Used in GPT-2/3/4 and most modern LLMs. The merges discover morphology — suffixes like 'ing' or 'ed' emerge naturally as common subwords.
- Causal mask
- The constraint that token i cannot attend to tokens j > i during training. Implemented by setting future-position scores to −∞ before softmax. What makes a transformer a *decoder*.
- Corpus
- The body of text used to train, inspect, or evaluate a language model. Can be a sentence, a book, or trillions of words scraped from the web.
- Cosine similarity
- Measure of how aligned two vectors are: cos(a, b) = (a · b) / (‖a‖·‖b‖). Value 1 means same direction, 0 orthogonal, −1 opposite.
- Cross-entropy
- The standard loss for classification: −Σ y·log(p). Designed to pair cleanly with softmax/sigmoid outputs.
- Embedding
- A dense low-dimensional vector representing a token. Words used in similar contexts end up with similar vectors.
- Embeddings replace one-hot encodings of tokens with continuous vectors that capture semantic relationships. The geometry of the embedding space is meaningful: directions can encode features like gender, formality, register.
- Entropy
- A measure of how spread-out a probability distribution is. H = -Σ p log p. Low entropy = concentrated on a few outcomes; high entropy = nearly uniform.
- Feed-forward network
- The per-token MLP inside a transformer block: two linear layers with a non-linearity between them, applied independently to every position. Where most of the model's parameters live.
- GELU
- Smooth approximation of ReLU used in transformers. Roughly x·Φ(x) where Φ is the Gaussian CDF.
- Generation
- Producing a new sequence of tokens by sampling the model one step at a time. Each emitted token feeds back as input for the next step.
- Gradient
- Vector of partial derivatives of the loss with respect to every parameter. Tells you which way (and how much) to nudge each parameter to lower the loss.
- Gradient descent
- Optimization procedure: subtract a small multiple of the gradient from the parameters at each step.
- Hyperparameter
- A number that controls training but isn't learned by the optimizer: learning rate, batch size, hidden size, dropout probability, etc. You set them; you don't gradient-descent them.
- Inference
- Running a trained model on new inputs to get predictions. Opposite of training — no parameters move.
- Input shape
- The dimensions of the input tensor to a neural network layer. For example, [batch_size, sequence_length] for text inputs.
- Kneser-Ney
- Smarter smoothing that subtracts a small discount from seen counts and redistributes the mass through a more thoughtful fallback than uniform.
- Language model
- A model that assigns a probability to the next token given the tokens that came before. Generation is repeated sampling from that probability.
- Every modern LLM is a language model: input is a sequence of tokens, output is a probability distribution over the vocabulary. The bigram in chapter 1, the transformer in chapter 10, and GPT-4 all share that interface — only the function in the middle differs.
- Laplace smoothing
- Simplest smoothing: add a constant α to every cell of the counts table before normalizing.
- Layer normalization
- Normalizes each token's activation vector to mean 0, std 1. Stabilizes activation scales across layers.
- Learning rate
- Scalar that scales each gradient-descent step. Too small and training crawls; too large and the loss diverges. The single most-important hyperparameter.
- LLM (large language model)
- A language model with enough parameters and training data to produce coherent multi-paragraph text. Modern LLMs are transformers with billions to trillions of parameters.
- There is no exact size threshold — "large" is a moving target. In practice, the term covers transformer-based language models from a few hundred million parameters upward, trained on hundreds of billions of tokens.
- LoRA
- Low-Rank Adaptation. Fine-tune a model without retraining its weights: freeze W, learn a small A·B update. Cuts trainable parameters by ~100×.
- Loss
- A number that says how badly the model is predicting the right answers. Training minimizes it.
- For language models, the per-token loss is typically -log(probability) the model assigned to the true next token. Sum or average across a sequence and you get a comparable number per token. Cross-entropy loss is the standard formulation.
- Matrix multiplication
- (A · B)[i,j] = Σ_k A[i,k] · B[k,j]. The arithmetic core of every neural network layer; modern GPUs and Apple Silicon have dedicated paths to make it fast.
- MLP (multi-layer perceptron)
- Two or more linear layers stacked with non-linearities between them. Can learn non-linear boundaries unlike a single neuron.
- Momentum
- Trick that accumulates a running velocity of recent gradients. Smooths the trajectory when consecutive steps reinforce, dampens it when they oscillate.
- Multi-head attention
- Running several attention heads in parallel, each with its own Q/K/V projections, then concatenating their outputs and projecting through a learned W_O.
- Neuron
- Weighted sum of inputs followed by a non-linearity: σ(Σ w_i·x_i + b). The smallest learnable unit in a neural network.
- One-hot encoding
- A vector of zeros with a single 1 at the position of the token. Wasteful and asserts that every pair of distinct tokens is equally unrelated.
- Out-of-vocabulary
- A token (or pair) that never appeared during training. Unsmoothed models can't assign it any probability.
- Parameter
- One of the model's learnable numbers. Modern LLMs have billions to trillions; the bigram model in chapter 1 has |vocab|² of them (one per cell of the counts table).
- Perplexity
- Geometric mean of inverse-probability over a sequence. Low when the model assigns high probability to the tokens it sees; infinite when any one token has probability zero.
- Reported as exp(mean negative log-likelihood). Common evaluation metric for language models. A perplexity of 50 means the model is, on average, as confused as if it had to choose uniformly between 50 equally-likely tokens.
- Probability distribution
- A list of non-negative numbers that sum to 1, one per possible outcome. A language model's output is one such distribution over the vocabulary.
- Quantization
- Storing weights as low-precision integers (INT8, INT4) instead of floats. Cuts model size 4–8× and speeds up inference, usually with minimal quality loss.
- Query / Key / Value
- Three projections of the input used in attention. Queries ask, keys advertise, values contribute.
- ReLU
- Rectified linear unit: max(0, x). The standard non-linearity inside modern neural networks because it's cheap and avoids vanishing gradients.
- Residual connection
- output = input + sublayer(input). Lets gradients flow cleanly through deep stacks; was the key innovation that made deep CNNs practical, now in every transformer.
- Sampling
- Drawing a random value from a probability distribution. In a language model, picking the next token by rolling a number against the model's output.
- Sampling strategies (greedy, temperature, top-k, top-p) all answer the same question — given a probability distribution over the vocabulary, which token do we actually emit? — but trade off determinism for diversity differently.
- Scaling law
- Empirical observation that model quality improves predictably with parameters, training tokens, and compute. The Chinchilla rule: optimal tokens ≈ 20 × parameters.
- Seed token
- The first token (or short prompt) you hand to a language model to start generation. Everything after is sampled from the model.
- Sigmoid
- Squashing function σ(x) = 1 / (1 + e^(-x)) that maps any real number into (0, 1). The classical activation for binary classification outputs.
- Skip-gram
- Algorithm that learns word embeddings by pushing each center word's vector toward the vectors of its neighbors in the corpus.
- Smoothing
- Family of techniques that ensure every possible token transition gets a positive probability, even ones never seen during training.
- Without smoothing, an n-gram model collapses to perplexity = ∞ as soon as it sees an unseen transition on the validation set. Laplace add-α and Kneser-Ney are the two classic methods.
- Softmax
- Normalizes a vector of real numbers into a probability distribution: exp(x_i) / Σ exp(x_j). Used at the end of attention and at the model's output.
- Subword
- Token that is shorter than a word but longer than a character. The granularity BPE settles on.
- Token
- The basic unit a model reads and writes. Often a subword, sometimes a word, sometimes a single character.
- Tokenization is the first step in any language model pipeline. Whitespace tokenizers split on spaces; BPE tokenizers find a granularity between words and characters by learning which subwords appear most often.
- Tokenizer
- The function that splits raw text into tokens. Ranges from naive whitespace + lowercase to learned schemes like BPE.
- Training
- Adjusting a model's parameters so that it does its job better on a given dataset. For a language model, that usually means lowering next-token loss across the training set.
- Training a counts-table bigram is just incrementing counters. Training a neural network is running gradient descent on millions to trillions of parameters. The underlying loop — measure how wrong you are, change something to be less wrong, repeat — is the same.
- Transformer
- Neural network architecture built from stacked blocks of (multi-head attention + FFN) with residuals and layer norms. The dominant architecture for language models since 2017.
- Validation set
- Held-out portion of the data the model never sees during training. Used to estimate metrics like perplexity on data the model can't have memorized.
- A typical split is 80% train / 10% validation / 10% test. Validation drives decisions during development (which hyperparameters? when to stop?); test is touched only once at the end to report a final number.
- Vocabulary
- The set of all distinct tokens a model can read or produce. Size ranges from ~80 (character-level) to ~100,000 (modern subword tokenizers).