Chapter 5 · 16 min

A neuron that learns

One neuron, one loss, one gradient. Build a learnable linear unit by hand and watch it converge — the smallest possible training loop.

We've now got , , and a way to measure similarity. We have nothing that learns. The model in chapter 1 was trained by counting; that's not learning, that's bookkeeping. To go further we need a function whose get adjusted in response to its mistakes.

A is the smallest possible such function. It takes a vector input, weighs it, sums, adds a bias, squashes through a non-linearity:

\text{output} = \sigma\!\left(\sum_i w_i x_i + b\right)

That single line is the whole story of feed-forward networks. Stack a bunch in parallel and you get a layer; stack layers and you get a deep network. But the itself is what learns. Let's build one, watch it work, then save the update rule locally.

1. The forward pass

Write the function that takes inputs x = [x_0, x_1], weights w = [w_0, w_1], bias b, and returns the 's output. The script defines sigmoid first, then uses it inside the forward pass.

Code · JavaScript

The bars show the 's output for several test inputs. With w = [1, 0.5] and b = 0, points where w · x + b is positive should produce outputs > 0.5 (warm), and negative should be < 0.5 (cool). The is just a weighted sum followed by a squash; everything else in this chapter is mechanics for adjusting w and b automatically.

2. The loss function

Given the 's output p and the true label y (0 or 1), we need a number that says "how bad was that prediction?". The standard choice is binary :

L = -\bigl[y \log p + (1 - y)\log(1 - p)\bigr]

It's near zero when p matches y, and grows large in the other direction. The asymmetry — log(p) for the y=1 case, log(1-p) for y=0 — makes the clean later.

Add a small epsilon (~1e-12) before taking logs to avoid log(0) when the model is overconfident.

Code · JavaScript

The bars show the across six (prediction, true label) combinations. When prediction matches label, is small (green). When they disagree — especially when the model is confident in the wrong answer — is large (red).

3. The gradient

Now the magic. The composition of + binary has a beautiful property: the of the with respect to (w · x + b) is just p − y. That is, the error itself is the . From there:

\frac{\partial L}{\partial w_i} = (p - y)\,x_i, \qquad \frac{\partial L}{\partial b} = (p - y)

Write the function. The chapter compares your output to a reference computation on a known test case.

Code · JavaScript

If the comparison shows ✓, your agrees with the closed form. This is the lemma that makes logistic regression and neural networks practical: there's no need to differentiate the by hand — the algebraic shortcut takes care of it.

4. Train the neuron

Now we put it together. Use the dataset of two clusters (~60 points in 2D), iterate , and watch the decision boundary fit.

Code · JavaScript

function sigmoid(x) {
  return 1 / (1 + Math.exp(-x));
}

function forward(x, w, b) {
  return sigmoid(w[0] * x[0] + w[1] * x[1] + b);
}

function bce(yPred, yTrue) {
  const eps = 1e-12;
  return -(yTrue * Math.log(yPred + eps) + (1 - yTrue) * Math.log(1 - yPred + eps));
}

function gradients(x, w, b, yTrue) {
  const yPred = forward(x, w, b);
  const err = yPred - yTrue;
  return { dw: [err * x[0], err * x[1]], db: err };
}

The scatter on the left shows the data; the orange line is the decision boundary w · x + b = 0, where the 's output crosses 0.5. After , the line should slice between the two clusters — a piece of geometry the figured out by repeatedly adjusting w and b to lower the .

The line on the right is the curve over iterations. It should slope down monotonically, leveling off as the approaches the best linear fit it can produce.

5. Add the first learning step locally

Create llm/nn.py with the simplest trainable unit:

"""Small neural-network pieces before the PyTorch rewrite."""
from __future__ import annotations
 
import math
 
 
# [1]
def sigmoid(x: float) -> float:
    return 1.0 / (1.0 + math.exp(-x))
 
 
# [2]
def neuron(x: list[float], w: list[float], b: float) -> float:
    return sigmoid(sum(xi * wi for xi, wi in zip(x, w)) + b)
 
 
# [3]
def bce(pred: float, target: float) -> float:
    eps = 1e-12
    pred = min(max(pred, eps), 1.0 - eps)
    return -(target * math.log(pred) + (1.0 - target) * math.log(1.0 - pred))
 
 
def neuron_step(
    x: list[float],
    target: float,
    w: list[float],
    b: float,
    lr: float,
) -> tuple[list[float], float, float]:
    pred = neuron(x, w, b)
    # [4]
    error = pred - target
    # [5]
    next_w = [wi - lr * error * xi for wi, xi in zip(w, x)]
    next_b = b - lr * error
    return next_w, next_b, bce(pred, target)

Read the step as a tiny loop:

[1] sigmoid squashes any score into the 0-1 range.
[2] neuron computes weighted sum + bias, then turns it into a probability-like output.
[3] bce is the : confident wrong answers are punished hard.
[4] error = pred - target is the key simplification. If the target is 1 and the prediction is too low, error is negative. If the target is 0 and the prediction is too high, error is positive.
[5] updates and bias against that error. Each weight update is scaled by xi, so active input dimensions get credit or blame.

This is the first file in your project that updates because of a mistake. That idea scales all the way to chapter 13.

What the neuron can't do

Run the same procedure on data that isn't linearly separable — say, points arranged in a circle with the inner ring being class 0 and the outer ring class 1 — and the will fail. No line through the input space can carve out a circular region. The is fundamentally a linear classifier with a on the output.

That's the limitation we attack in chapter 6: one makes one boundary line; a layer of makes many boundary lines and combines them into something flexible enough to handle XOR, spirals, and other non-linear datasets. Same algorithm; the model gets bigger.

Recap

A is a weighted sum followed by a non-linearity. Two per input dimension (weight + shared bias). - measures how wrong the prediction is. Binary + is the standard pair for binary classification. - of BCE with respect to is (p − y) * x — the error scales the input. Mathematically, this falls out of the chain rule once the BCE/ pair has been chosen. - is repeated : nudge each against its , watch the go down, watch the decision boundary slide into place. - Your local project now has llm/nn.py, starting with one trainable . - Limit: a single can only learn linear boundaries. One line. That's the wall the next chapter breaks through.

Going further

Appendix · Backprop by hand — derives the for this same kind of network step by step, then verifies the result against PyTorch's loss.backward(). Optional but recommended if "where do the gradients come from?" is still fuzzy.
Andrej Karpathy's "Building micrograd" — derives the same by hand and builds a tiny autograd engine that handles the chain rule for arbitrary expressions. We'll use one of those in chapter 6.
3Blue1Brown's neural networks series — visual derivation of over a network.
The full reference implementations live in lib/ml/nn/{linear,activations,loss}.ts.

Next up: stacking layers — one is one line. Multiple in a layer are multiple lines. Combine them and you can carve out anything.