Chapter 7 · 15 min

Gradient descent live

Watch SGD, momentum, and Adam navigate the same loss surface. Build each optimizer step in plain code.

In chapter 5 you wrote a loop that subtracted lr * gradient from each and called it a day. It worked because the was nice and convex. In chapter 6 the same loop trained an — but you may have noticed it sometimes plateaued, sometimes oscillated, sometimes diverged.

The reason: is one specific way of navigating a landscape, and it's not always the best one. There's a whole family of optimizers, each making different assumptions about the shape of the landscape, with different costs and behaviors. , the most popular optimizer in modern deep learning, is closer to " with extra accounting" than "something fundamentally new" — but the accounting matters.

We're going to use a deliberately bad-for-vanilla- test function — f(x, y) = ½(x² + 10y²), a stretched bowl — and watch four optimizers attempt to slide a ball down to the origin. The function's been chosen because the curvature is 10× steeper in y than in x, which causes vanilla to oscillate vertically while crawling horizontally. smooths it; fixes it. Then you will save those update rules in llm/optim.py.

1. The gradient

Before we descend anywhere we need a . For our f(x, y) = ½(x² + 10y²):

\frac{\partial f}{\partial x} = x, \qquad \frac{\partial f}{\partial y} = 10y

Write the function. The chapter checks it at three test points.

Code · JavaScript

If your matches at every test point, we're good to descend.

2. Vanilla gradient descent

The simplest possible step: subtract lr * gradient from the current position.

Code · JavaScript

Look at the trajectory. The bowl's minimum is at (0, 0) (where the surface is brightest). The optimizer is trying to get there, but the steep y direction makes it overshoot, swing back, overshoot, swing back. The horizontal direction barely moves because the there is small.

This is the canonical reason vanilla with a single is hard to tune: pick lr for the steep direction and the gentle one moves nowhere; pick it for the gentle one and you diverge in the steep one.

3. Momentum

The fix is simple: keep a running velocity. Each step adds a damped version of the previous step's velocity, so consecutive moves in the same direction reinforce, and oscillating moves cancel.

v \leftarrow \mu v - \eta\,\nabla f, \quad x \leftarrow x + v

Where μ is the coefficient (typically 0.9). Try the slider.

Code · JavaScript

With at 0.9, the trajectory should be much smoother — the vertical oscillations cancel because consecutive v_y values flip sign and average out, while the horizontal motion accumulates because consecutive v_x values reinforce.

Crank to 0.99 and you'll see the optimizer overshoot the minimum and bounce around — too much velocity, not enough damping. Drop it to 0.0 and you recover vanilla . The interesting region is between 0.7 and 0.95.

4. Adam

smooths the trajectory but doesn't change the fundamental issue: every dimension uses the same step size. (Kingma & Ba, 2015) maintains a running estimate of the variance of the in each dimension, and divides the step by the square root of that variance. Steep, oscillating directions get small effective steps; quiet, stable ones get large effective steps. It's per-dimension adaptive learning.

The full update has bias correction (which matters most early in when the running averages are still warming up):

\begin{aligned} m &\leftarrow \beta_1 m + (1 - \beta_1)\,\nabla f \\ v &\leftarrow \beta_2 v + (1 - \beta_2)\,(\nabla f)^2 \\ \hat{m} &= m / (1 - \beta_1^t), \quad \hat{v} = v / (1 - \beta_2^t) \\ x &\leftarrow x - \eta \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon} \end{aligned}

Defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8. Type the formula in.

Code · JavaScript

's trajectory should be the smoothest of the three — almost a straight line from start to minimum, regardless of the curvature mismatch. That's why it's the default optimizer for most modern deep learning: it adapts to whatever weirdness the surface throws at it.

What this matters for

Every chapter from here on uses an optimizer. The in chapter 6 was trained with vanilla ; if you found the curve choppy, that's why. Modern use (or AdamW) almost exclusively. The choice isn't dogma — it's the empirical observation that most deep landscapes have the kind of pathology our toy bowl illustrates: directions of very different curvature, in the same network, in the same iteration.

5. Add optimizer state locally

Create llm/optim.py:

"""Tiny optimizer updates before we delegate tensors to PyTorch."""
from __future__ import annotations
 
import math
 
 
Vector = list[float]
 
 
# [1]
def sgd(params: Vector, grads: Vector, lr: float) -> Vector:
    return [p - lr * g for p, g in zip(params, grads)]
 
 
def momentum(
    params: Vector,
    grads: Vector,
    velocity: Vector,
    lr: float,
    beta: float = 0.9,
) -> tuple[Vector, Vector]:
    # [2]
    next_velocity = [beta * v - lr * g for v, g in zip(velocity, grads)]
    return [p + v for p, v in zip(params, next_velocity)], next_velocity
 
 
# [3]
def adam(
    params: Vector,
    grads: Vector,
    m: Vector,
    v: Vector,
    step: int,
    lr: float = 3e-4,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
) -> tuple[Vector, Vector, Vector]:
    next_m = [beta1 * mi + (1 - beta1) * g for mi, g in zip(m, grads)]
    next_v = [beta2 * vi + (1 - beta2) * g * g for vi, g in zip(v, grads)]
    # [4]
    m_hat = [mi / (1 - beta1**step) for mi in next_m]
    v_hat = [vi / (1 - beta2**step) for vi in next_v]
    # [5]
    next_params = [
        p - lr * mh / (math.sqrt(vh) + eps)
        for p, mh, vh in zip(params, m_hat, v_hat)
    ]
    return next_params, next_m, next_v

The three functions are the same idea with more memory:

[1] sgd has no memory. It looks at the current and steps downhill.
[2] momentum carries velocity, so repeated in the same direction accumulate and back-and-forth cancel.
[3] adam carries two memories: m for average direction, v for average squared size.
[4] m_hat and v_hat correct the early steps, when those running averages are still biased toward zero.
[5] divides by sqrt(v_hat), so dimensions with huge get smaller effective steps.

PyTorch will eventually manage this state for us. Writing it once makes the black box less black.

Recap

Vanilla is the simplest possible optimizer. It struggles when different dimensions have very different curvatures. - smooths the trajectory by accumulating velocity. Same step formula plus a running term; one extra . - scales each dimension's step by the square root of its variance. Adaptive per-dimension . Bias correction matters at the start. - Why wins by default: in real networks, dimensions have very different magnitudes. A single global lr is a compromise; removes the compromise. - Your local project now has llm/optim.py, the stateful update machinery that loops need.

Going further

Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2014). The paper. Short.
Ruder's "An overview of gradient descent optimization algorithms" — a survey covering vanilla , , Nesterov, Adagrad, RMSprop, , AdamW.
Distill.pub's "Why Momentum Really Works" — gorgeous interactive walkthrough of 's geometry.

Next up: this is the end of part II. Part III begins with an attention head — the mechanism that lets a actually look at other , instead of being limited to a fixed-size .