Skip to content
The loss curve

Chapter 7 · 15 min

Gradient descent live

Watch SGD, momentum, and Adam navigate the same loss surface. Build each optimizer step in plain code.

In chapter 5 you wrote a loop that subtracted lr * gradient from each and called it a day. It worked because the was nice and convex. In chapter 6 the same loop trained an — but you may have noticed it sometimes plateaued, sometimes oscillated, sometimes diverged.

The reason: is one specific way of navigating a landscape, and it's not always the best one. There's a whole family of optimizers, each making different assumptions about the shape of the landscape, with different costs and behaviors. , the most popular optimizer in modern deep learning, is closer to " with extra accounting" than "something fundamentally new" — but the accounting matters.

We're going to use a deliberately bad-for-vanilla- test function — f(x, y) = ½(x² + 10y²), a stretched bowl — and watch four optimizers attempt to slide a ball down to the origin. The function's been chosen because the curvature is 10× steeper in y than in x, which causes vanilla to oscillate vertically while crawling horizontally. smooths it; fixes it. Then you will save those update rules in llm/optim.py.

1. The gradient

Before we descend anywhere we need a . For our f(x, y) = ½(x² + 10y²):

fx=x,fy=10y\frac{\partial f}{\partial x} = x, \qquad \frac{\partial f}{\partial y} = 10y

Write the function. The chapter checks it at three test points.

Code · JavaScript

If your matches at every test point, we're good to descend.

2. Vanilla gradient descent

The simplest possible step: subtract lr * gradient from the current position.

Code · JavaScript

Look at the trajectory. The bowl's minimum is at (0, 0) (where the surface is brightest). The optimizer is trying to get there, but the steep y direction makes it overshoot, swing back, overshoot, swing back. The horizontal direction barely moves because the there is small.

This is the canonical reason vanilla with a single is hard to tune: pick lr for the steep direction and the gentle one moves nowhere; pick it for the gentle one and you diverge in the steep one.

3. Momentum

The fix is simple: keep a running velocity. Each step adds a damped version of the previous step's velocity, so consecutive moves in the same direction reinforce, and oscillating moves cancel.

vμvηf,xx+vv \leftarrow \mu v - \eta\,\nabla f, \quad x \leftarrow x + v

Where μ is the coefficient (typically 0.9). Try the slider.

Code · JavaScript

With at 0.9, the trajectory should be much smoother — the vertical oscillations cancel because consecutive v_y values flip sign and average out, while the horizontal motion accumulates because consecutive v_x values reinforce.

Crank to 0.99 and you'll see the optimizer overshoot the minimum and bounce around — too much velocity, not enough damping. Drop it to 0.0 and you recover vanilla . The interesting region is between 0.7 and 0.95.

4. Adam

smooths the trajectory but doesn't change the fundamental issue: every dimension uses the same step size. (Kingma & Ba, 2015) maintains a running estimate of the variance of the in each dimension, and divides the step by the square root of that variance. Steep, oscillating directions get small effective steps; quiet, stable ones get large effective steps. It's per-dimension adaptive learning.

The full update has bias correction (which matters most early in when the running averages are still warming up):

mβ1m+(1β1)fvβ2v+(1β2)(f)2m^=m/(1β1t),v^=v/(1β2t)xxηm^v^+ϵ\begin{aligned} m &\leftarrow \beta_1 m + (1 - \beta_1)\,\nabla f \\ v &\leftarrow \beta_2 v + (1 - \beta_2)\,(\nabla f)^2 \\ \hat{m} &= m / (1 - \beta_1^t), \quad \hat{v} = v / (1 - \beta_2^t) \\ x &\leftarrow x - \eta \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon} \end{aligned}

Defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8. Type the formula in.

Code · JavaScript

's trajectory should be the smoothest of the three — almost a straight line from start to minimum, regardless of the curvature mismatch. That's why it's the default optimizer for most modern deep learning: it adapts to whatever weirdness the surface throws at it.

What this matters for

Every chapter from here on uses an optimizer. The in chapter 6 was trained with vanilla ; if you found the curve choppy, that's why. Modern use (or AdamW) almost exclusively. The choice isn't dogma — it's the empirical observation that most deep landscapes have the kind of pathology our toy bowl illustrates: directions of very different curvature, in the same network, in the same iteration.

5. Add optimizer state locally

Create llm/optim.py:

"""Tiny optimizer updates before we delegate tensors to PyTorch."""
from __future__ import annotations
 
import math
 
 
Vector = list[float]
 
 
# [1]
def sgd(params: Vector, grads: Vector, lr: float) -> Vector:
    return [p - lr * g for p, g in zip(params, grads)]
 
 
def momentum(
    params: Vector,
    grads: Vector,
    velocity: Vector,
    lr: float,
    beta: float = 0.9,
) -> tuple[Vector, Vector]:
    # [2]
    next_velocity = [beta * v - lr * g for v, g in zip(velocity, grads)]
    return [p + v for p, v in zip(params, next_velocity)], next_velocity
 
 
# [3]
def adam(
    params: Vector,
    grads: Vector,
    m: Vector,
    v: Vector,
    step: int,
    lr: float = 3e-4,
    beta1: float = 0.9,
    beta2: float = 0.999,
    eps: float = 1e-8,
) -> tuple[Vector, Vector, Vector]:
    next_m = [beta1 * mi + (1 - beta1) * g for mi, g in zip(m, grads)]
    next_v = [beta2 * vi + (1 - beta2) * g * g for vi, g in zip(v, grads)]
    # [4]
    m_hat = [mi / (1 - beta1**step) for mi in next_m]
    v_hat = [vi / (1 - beta2**step) for vi in next_v]
    # [5]
    next_params = [
        p - lr * mh / (math.sqrt(vh) + eps)
        for p, mh, vh in zip(params, m_hat, v_hat)
    ]
    return next_params, next_m, next_v

The three functions are the same idea with more memory:

  • [1] sgd has no memory. It looks at the current and steps downhill.
  • [2] momentum carries velocity, so repeated in the same direction accumulate and back-and-forth cancel.
  • [3] adam carries two memories: m for average direction, v for average squared size.
  • [4] m_hat and v_hat correct the early steps, when those running averages are still biased toward zero.
  • [5] divides by sqrt(v_hat), so dimensions with huge get smaller effective steps.

PyTorch will eventually manage this state for us. Writing it once makes the black box less black.

Recap

  • Vanilla is the simplest possible optimizer. It struggles when different dimensions have very different curvatures. - smooths the trajectory by accumulating velocity. Same step formula plus a running term; one extra . - scales each dimension's step by the square root of its variance. Adaptive per-dimension . Bias correction matters at the start. - Why wins by default: in real networks, dimensions have very different magnitudes. A single global lr is a compromise; removes the compromise. - Your local project now has llm/optim.py, the stateful update machinery that loops need.

Going further

Next up: this is the end of part II. Part III begins with an attention head — the mechanism that lets a actually look at other , instead of being limited to a fixed-size .