Chapter 7 · 15 min
Gradient descent live
Watch SGD, momentum, and Adam navigate the same loss surface. Build each optimizer step in plain code.
In chapter 5 you wrote a loop that subtracted lr * gradient from each and called it a day. It worked because the was nice and convex. In chapter 6 the same loop trained an — but you may have noticed it sometimes plateaued, sometimes oscillated, sometimes diverged.
The reason: is one specific way of navigating a landscape, and it's not always the best one. There's a whole family of optimizers, each making different assumptions about the shape of the landscape, with different costs and behaviors. , the most popular optimizer in modern deep learning, is closer to " with extra accounting" than "something fundamentally new" — but the accounting matters.
We're going to use a deliberately bad-for-vanilla- test function — f(x, y) = ½(x² + 10y²), a stretched bowl — and watch four optimizers attempt to slide a ball down to the origin. The function's been chosen because the curvature is 10× steeper in y than in x, which causes vanilla to oscillate vertically while crawling horizontally. smooths it; fixes it. Then you will save those update rules in llm/optim.py.
1. The gradient
Before we descend anywhere we need a . For our f(x, y) = ½(x² + 10y²):
Write the function. The chapter checks it at three test points.
Code · JavaScript
If your matches at every test point, we're good to descend.
2. Vanilla gradient descent
The simplest possible step: subtract lr * gradient from the current position.
Code · JavaScript
Look at the trajectory. The bowl's minimum is at (0, 0) (where the surface is brightest). The optimizer is trying to get there, but the steep y direction makes it overshoot, swing back, overshoot, swing back. The horizontal direction barely moves because the there is small.
This is the canonical reason vanilla with a single is hard to tune: pick lr for the steep direction and the gentle one moves nowhere; pick it for the gentle one and you diverge in the steep one.
3. Momentum
The fix is simple: keep a running velocity. Each step adds a damped version of the previous step's velocity, so consecutive moves in the same direction reinforce, and oscillating moves cancel.
Where μ is the coefficient (typically 0.9). Try the slider.
Code · JavaScript
With at 0.9, the trajectory should be much smoother — the vertical oscillations cancel because consecutive v_y values flip sign and average out, while the horizontal motion accumulates because consecutive v_x values reinforce.
Crank to 0.99 and you'll see the optimizer overshoot the minimum and bounce around — too much velocity, not enough damping. Drop it to 0.0 and you recover vanilla . The interesting region is between 0.7 and 0.95.
4. Adam
smooths the trajectory but doesn't change the fundamental issue: every dimension uses the same step size. (Kingma & Ba, 2015) maintains a running estimate of the variance of the in each dimension, and divides the step by the square root of that variance. Steep, oscillating directions get small effective steps; quiet, stable ones get large effective steps. It's per-dimension adaptive learning.
The full update has bias correction (which matters most early in when the running averages are still warming up):
Defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8. Type the formula in.
Code · JavaScript
's trajectory should be the smoothest of the three — almost a straight line from start to minimum, regardless of the curvature mismatch. That's why it's the default optimizer for most modern deep learning: it adapts to whatever weirdness the surface throws at it.
What this matters for
Every chapter from here on uses an optimizer. The in chapter 6 was trained with vanilla ; if you found the curve choppy, that's why. Modern use (or AdamW) almost exclusively. The choice isn't dogma — it's the empirical observation that most deep landscapes have the kind of pathology our toy bowl illustrates: directions of very different curvature, in the same network, in the same iteration.
5. Add optimizer state locally
Create llm/optim.py:
"""Tiny optimizer updates before we delegate tensors to PyTorch."""
from __future__ import annotations
import math
Vector = list[float]
# [1]
def sgd(params: Vector, grads: Vector, lr: float) -> Vector:
return [p - lr * g for p, g in zip(params, grads)]
def momentum(
params: Vector,
grads: Vector,
velocity: Vector,
lr: float,
beta: float = 0.9,
) -> tuple[Vector, Vector]:
# [2]
next_velocity = [beta * v - lr * g for v, g in zip(velocity, grads)]
return [p + v for p, v in zip(params, next_velocity)], next_velocity
# [3]
def adam(
params: Vector,
grads: Vector,
m: Vector,
v: Vector,
step: int,
lr: float = 3e-4,
beta1: float = 0.9,
beta2: float = 0.999,
eps: float = 1e-8,
) -> tuple[Vector, Vector, Vector]:
next_m = [beta1 * mi + (1 - beta1) * g for mi, g in zip(m, grads)]
next_v = [beta2 * vi + (1 - beta2) * g * g for vi, g in zip(v, grads)]
# [4]
m_hat = [mi / (1 - beta1**step) for mi in next_m]
v_hat = [vi / (1 - beta2**step) for vi in next_v]
# [5]
next_params = [
p - lr * mh / (math.sqrt(vh) + eps)
for p, mh, vh in zip(params, m_hat, v_hat)
]
return next_params, next_m, next_vThe three functions are the same idea with more memory:
- [1]
sgdhas no memory. It looks at the current and steps downhill. - [2]
momentumcarriesvelocity, so repeated in the same direction accumulate and back-and-forth cancel. - [3]
adamcarries two memories:mfor average direction,vfor average squared size. - [4]
m_hatandv_hatcorrect the early steps, when those running averages are still biased toward zero. - [5] divides by
sqrt(v_hat), so dimensions with huge get smaller effective steps.
PyTorch will eventually manage this state for us. Writing it once makes the black box less black.
Recap
- Vanilla is the simplest possible optimizer. It struggles when different dimensions have
very different curvatures. - smooths the trajectory by accumulating velocity. Same
step formula plus a running term; one extra . - scales each
dimension's step by the square root of its variance. Adaptive per-dimension . Bias correction matters at the start. - Why wins by default: in real networks,
dimensions have very different magnitudes. A single global
lris a compromise; removes the compromise. - Your local project now hasllm/optim.py, the stateful update machinery that loops need.
Going further
- Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2014). The paper. Short.
- Ruder's "An overview of gradient descent optimization algorithms" — a survey covering vanilla , , Nesterov, Adagrad, RMSprop, , AdamW.
- Distill.pub's "Why Momentum Really Works" — gorgeous interactive walkthrough of 's geometry.
Next up: this is the end of part II. Part III begins with an attention head — the mechanism that lets a actually look at other , instead of being limited to a fixed-size .