Skip to content
The loss curve

Chapter 22 · 14 min

Appendix · Backprop by hand

Derive backprop on a small graph — every gradient written out. The math behind every loss.backward() you've ever called.

You called loss.backward() in chapter 5 and again in every chapter after. It quietly populated .grad on every in the model. This appendix unpacks what that one line does, by deriving the of a tiny network by hand and checking the result against PyTorch.

Once you have done it once, every loss.backward() afterward feels like a function you understand, not magic.

1. The network

The smallest interesting network: 2 inputs, 1 hidden unit, 1 output. activation on the hidden unit, identity on the output. Mean-squared against a single target.

z=w1x1+w2x2h=σ(z)y^=vhL=(y^y)2\begin{aligned} z &= w_1 x_1 + w_2 x_2 \\ h &= \sigma(z) \\ \hat y &= v \cdot h \\ \mathcal{L} &= (\hat y - y)^2 \end{aligned}

Three : w₁, w₂, v. Four named intermediates: z, h, ŷ, L. Two inputs x₁, x₂ and one target y.

Concrete values to keep things calm:

value
x1,x2x_1, x_21.0, 0.5
w1,w2w_1, w_20.4, 0.6
vv0.8
yy1.0

2. Forward pass by hand

Plug the numbers in:

z=0.41.0+0.60.5=0.70h=σ(0.70)=11+e0.700.6682y^=0.80.66820.5346L=(0.53461.0)20.2166\begin{aligned} z &= 0.4 \cdot 1.0 + 0.6 \cdot 0.5 = 0.70 \\ h &= \sigma(0.70) = \frac{1}{1 + e^{-0.70}} \approx 0.6682 \\ \hat y &= 0.8 \cdot 0.6682 \approx 0.5346 \\ \mathcal{L} &= (0.5346 - 1.0)^2 \approx 0.2166 \end{aligned}

That is one full forward pass. Three multiplications, one addition, one sigmoid, one subtraction, one square. Eight arithmetic operations to turn two inputs into one number.

3. Backward by chain rule

The we want is L/w1\partial \mathcal{L} / \partial w_1, L/w2\partial \mathcal{L} / \partial w_2, L/v\partial \mathcal{L} / \partial v — one number per . The chain rule walks the computation graph backwards.

Start with the and propagate sensitivities outward:

Ly^=2(y^y)=2(0.4654)0.9308\frac{\partial \mathcal{L}}{\partial \hat y} = 2(\hat y - y) = 2 \cdot (-0.4654) \approx -0.9308

That number is the "'s sensitivity to the output" — how much the would change for a unit change in ŷ. Now push it through v:

Lv=Ly^y^v=0.9308h0.93080.66820.6220\frac{\partial \mathcal{L}}{\partial v} = \frac{\partial \mathcal{L}}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial v} = -0.9308 \cdot h \approx -0.9308 \cdot 0.6682 \approx -0.6220

The output unit's gradient with respect to v is just h, because ŷ = v·h. Multiply by the upstream sensitivity and you have ∂L/∂v. Done for v.

For w₁ and w₂ the path is longer. First push through h:

Lh=Ly^y^h=0.9308v=0.93080.80.7446\frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial h} = -0.9308 \cdot v = -0.9308 \cdot 0.8 \approx -0.7446

Then through z. The 's derivative is σ(z)·(1 − σ(z)), which we already have:

hz=h(1h)=0.66820.33180.2217\frac{\partial h}{\partial z} = h(1 - h) = 0.6682 \cdot 0.3318 \approx 0.2217 Lz=Lhhz0.74460.22170.1651\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial h} \cdot \frac{\partial h}{\partial z} \approx -0.7446 \cdot 0.2217 \approx -0.1651

Finally through w₁ and w₂. Since z = w₁ x₁ + w₂ x₂, the partials are just the inputs:

Lw1=Lzx10.16511.00.1651\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial z} \cdot x_1 \approx -0.1651 \cdot 1.0 \approx -0.1651 Lw2=Lzx20.16510.50.0825\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial z} \cdot x_2 \approx -0.1651 \cdot 0.5 \approx -0.0825

Three numbers: ∂L/∂v ≈ -0.6220, ∂L/∂w₁ ≈ -0.1651, ∂L/∂w₂ ≈ -0.0825. Each tells the optimizer how to change one weight to lower the by one unit.

Now write the chain rule yourself. The cell pre-fills the forward pass; you fill in the four backward lines and watch the gradients land:

Code · JavaScript

4. Verify against PyTorch

Save this as scripts/check_backprop.py (or just paste into a Python REPL):

"""check_backprop.py — verify hand-derived gradients match PyTorch."""
import math
import torch
 
# inputs and target
x = torch.tensor([1.0, 0.5])
y = torch.tensor(1.0)
 
# parameters (require grad so PyTorch tracks them)
w = torch.tensor([0.4, 0.6], requires_grad=True)
v = torch.tensor(0.8, requires_grad=True)
 
# forward
z = (w * x).sum()
h = torch.sigmoid(z)
y_hat = v * h
loss = (y_hat - y) ** 2
 
# backward
loss.backward()
 
# results
print(f"forward:  z={z.item():.4f} h={h.item():.4f} y_hat={y_hat.item():.4f} loss={loss.item():.4f}")
print(f"grads:    w1={w.grad[0].item():.4f} w2={w.grad[1].item():.4f} v={v.grad.item():.4f}")
 
# expected from the hand derivation
expected = {"w1": -0.1651, "w2": -0.0825, "v": -0.6220}
assert math.isclose(w.grad[0].item(), expected["w1"], abs_tol=1e-3)
assert math.isclose(w.grad[1].item(), expected["w2"], abs_tol=1e-3)
assert math.isclose(v.grad.item(), expected["v"], abs_tol=1e-3)
print("✓ hand-derived gradients match PyTorch")

Run it. Output:

forward:  z=0.7000 h=0.6682 y_hat=0.5346 loss=0.2166
grads:    w1=-0.1651 w2=-0.0825 v=-0.6220
✓ hand-derived gradients match PyTorch

The numbers match to four decimal places. That is what loss.backward() did, for every of every model in the rest of the book. The mechanism is the same; only the graph is bigger.

5. Why this generalizes

A real model has millions to billions of . The hand derivation does not scale. The chain rule does:

  1. The computation graph is built automatically by PyTorch as you compute the forward pass. Every operation records its inputs and how to compute its local partial derivative.
  2. The backward pass walks the graph from loss to every , multiplying local partials. Each gets one number, written into .grad.
  3. The cost is the same order as the forward pass — roughly 2-3× as expensive, not exponentially worse. This is the entire reason modern is feasible.

The local partials are mechanical. For every operation y = f(x), PyTorch knows dy/dx. , multiplication, addition, sum, softmax, — each has a one-line backward formula. The chain rule glues them together.

That is the whole story. When chapter 8's module calls loss.backward(), the graph being walked has hundreds of operations and tens of thousands of , but every individual edge is one of the simple rules from this appendix.

Recap

  • loss.backward() walks a computation graph backward, applying the chain rule one edge at a time.
  • Each operation contributes a local partial derivative. The forward pass remembers what was computed; the backward pass uses that record.
  • The hand derivation matches PyTorch to four decimals on a 3- network. The same mechanism scales to billion- models without changing.
  • One number per comes out, written into .grad. The optimizer reads those numbers and decides how to move each weight.

Going further