Chapter 6 · 14 min

Stacking layers

A single neuron is a line. Stack them with a non-linearity and you get an MLP — the feed-forward block at the heart of every Transformer.

In chapter 5, you trained a single and watched it slide a line between two clouds of points. The line was the only thing it could do. Try the same trainer on data that isn't linearly separable — say, points arranged in a checkerboard pattern, or a circle inside another circle — and the fails. There's no single line that separates them.

The fix is almost embarrassingly simple: stack . A layer is several running in parallel on the same input. They each carve out their own line. Combine their outputs and you get curves, kinks, regions — anything you want, given enough . That's the whole story of why deep learning works.

We're going to inspect a 2-layer (), train it on XOR, and watch it succeed where the single failed. Three runnable cells in the browser, then a local forward pass you will reuse when the block arrives.

1. The linear layer

A linear layer is y = Wx + b, where W is a matrix and b a bias vector. If x has length n and the layer has out , then W is shape [out, n] and b is length out. The output is length out.

Conceptually, each row of W is the vector of one . Each row of Wx + b is that 's pre-activation.

Code · JavaScript

That's the workhorse of every dense neural network. Read it once, get used to the shapes, then reuse it everywhere.

2. The MLP forward pass

Now we chain two linear layers with a non-linearity in between. The non-linearity is critical: without it, two stacked linear layers collapse to a single linear layer (matrix multiplication is associative). The non-linearity in modern networks is almost always : relu(x) = max(0, x).

The full forward pass for our 2-input, 1-output network is:

\begin{aligned} z_1 &= W_1 x + b_1 \quad &\text{(hidden pre-activations, length } H\text{)} \\ h &= \text{ReLU}(z_1) \quad &\text{(hidden activations)} \\ z_2 &= W_2 \cdot h + b_2 \quad &\text{(output pre-activation, scalar)} \\ \hat{y} &= \sigma(z_2) \quad &\text{(probability of class 1)} \end{aligned}

Write this as a function. The chapter compares your output against a reference implementation on four test points (one per quadrant of the plane).

Code · JavaScript

If your bars match the reference's, your forward pass is correct. The two should agree exactly because we use the same and the same arithmetic. Untrained, the network's predictions are mostly noise — that's expected. is where the layers earn their keep.

3. Train on XOR

XOR is the textbook case where a single fails. Two clusters in opposite quadrants are class 0; the other two quadrants are class 1. No line through the plane separates them.

Train an with hidden size 8 on a noisy XOR dataset. The chapter provides initMlp, mlpStep (full-batch , returns updated net + ), and the dataset; the cell runs the loop.

Code · JavaScript

function sigmoid(x) {
  return 1 / (1 + Math.exp(-x));
}

function relu(x) {
  return x > 0 ? x : 0;
}

function initMlp(hiddenSize) {
  return {
    W1: Array.from({ length: hiddenSize }, () => [
      Math.random() * 2 - 1,
      Math.random() * 2 - 1,
    ]),
    b1: Array.from({ length: hiddenSize }, () => 0),
    W2: Array.from({ length: hiddenSize }, () => Math.random() * 2 - 1),
    b2: 0,
  };
}

for (const { x, y } of dataset) {
    const z1 = net.W1.map((row, i) => row[0] * x[0] + row[1] * x[1] + net.b1[i]);
    const h = z1.map(relu);
    let z2 = net.b2;
    for (let i = 0; i < H; i++) z2 += net.W2[i] * h[i];
    const p = sigmoid(z2);
    totalLoss += -(y * Math.log(p + 1e-12) + (1 - y) * Math.log(1 - p + 1e-12));

const delta2 = p - y;
    db2 += delta2;
    for (let i = 0; i < H; i++) dW2[i] += delta2 * h[i];
    for (let i = 0; i < H; i++) {
      const delta1 = delta2 * net.W2[i] * (z1[i] > 0 ? 1 : 0);
      db1[i] += delta1;
      dW1[i][0] += delta1 * x[0];
      dW1[i][1] += delta1 * x[1];
    }
  }

const N = dataset.length;
  return {
    net: {
      W1: net.W1.map((row, i) => [row[0] - (lr * dW1[i][0]) / N, row[1] - (lr * dW1[i][1]) / N]),
      b1: net.b1.map((v, i) => v - (lr * db1[i]) / N),
      W2: net.W2.map((v, i) => v - (lr * dW2[i]) / N),
      b2: net.b2 - (lr * db2) / N,
    },
    loss: totalLoss / N,
  };
}

let net = initMlp(hiddenSize);
const losses = [];
for (let i = 0; i < iterations; i++) {
  const result = mlpStep(net, dataset, lr);
  net = result.net;
  losses.push(result.loss);
}
return { net, losses };

The left panel shows the network's decision regions — every pixel is colored by what the trained thinks the class probability is at that location. Warm = class 1, cool = class 0, faint = uncertain. After , you should see a checkerboard-shaped pattern carving the plane into four regions, with the data points landing in matching colors.

The curve on the right should decrease, possibly with a couple of plateaus. is famously not always smooth — the first few iterations often go nowhere while the random init untangles itself, then drops fast.

If fails (the regions are wrong, the plateaus high), try increasing hiddenSize, increasing iterations, or adjusting lr. Pretty much every neural-network failure mode has the same fix: more capacity, more time, or more careful step sizes.

4. Extend `llm/nn.py`

Add these helpers below the code:

# [1]
Vector = list[float]
Matrix = list[Vector]
 
 
# [2]
def relu(x: Vector) -> Vector:
    return [max(0.0, value) for value in x]
 
 
# [3]
def linear(x: Vector, weight: Matrix, bias: Vector) -> Vector:
    return [
        sum(xi * wi for xi, wi in zip(x, row)) + b
        for row, b in zip(weight, bias)
    ]
 
 
def mlp_forward(
    x: Vector,
    w1: Matrix,
    b1: Vector,
    w2: Matrix,
    b2: Vector,
) -> Vector:
    # [4]
    hidden = relu(linear(x, w1, b1))
    # [5]
    return linear(hidden, w2, b2)

Read the shapes first, then the code:

[1] gives names to the shapes: a vector is one row, a matrix is many rows.
[2] relu keeps positive signals and clips negative ones to zero. That is the non-linear bend the previous chapter did not have.
[3] linear takes one input vector and many rows of . Each row is one , and the list comprehension returns every output as a new vector.
[4] creates learned hidden features from the raw input.
[5] mixes those hidden features into the final output.

The code is plain lists for now. That is deliberate: you can see every shape. In chapter 12, linear becomes torch.nn.Linear, but the contract stays the same.

What this fixes (and what it costs)

Fixes: any boundary you can imagine. Universal approximation guarantees that an with enough hidden can approximate any continuous function on a compact domain to arbitrary precision. You can fit XOR, spirals, concentric circles, anything.

Costs:

Many more . Hidden size 8 with 2 inputs is 8×2 + 8 + 8 + 1 = 33 . Real networks have millions or billions.
Local minima. on a non-convex can get stuck. Modern almost always escape because their landscapes are forgiving in high dimensions, but the theory is still mostly empirical.
. Hidden size, , initialization, optimizer choice — all of these affect whether works. Chapter 7 starts that conversation.

Recap

A linear layer is y = Wx + b. One row of W per output . - An is two or more linear layers with non-linearities (usually ) between them. Without the non-linearity, the network collapses to one linear function. - is the same algorithm as for one — on the — but the flows back through every layer (chain rule). - solve XOR. That's the canonical example, but the principle generalizes: any boundary you can describe, an with enough capacity can learn. - Your local llm/nn.py now has linear, relu, and mlp_forward, the same pieces used inside feed-forward layers. - The next chapter looks at how actually finds those — and why naïve isn't usually enough.

Going further

Karpathy's "Building micrograd" — derives the chain rule visually and builds an autograd engine that handles the computation for arbitrary expressions.
3Blue1Brown's "What is backpropagation, really doing?" — gentle visual explanation of the .
Tinker (TensorFlow Playground) — drag a network's around and watch it train on built-in datasets. Good for building intuition.
The full reference implementation lives in lib/ml/nn/.

Next up: gradient descent live — we've been calling mlpStep as a black box. Time to look inside, see why vanilla often gets stuck, and meet and .

1. The linear layer

2. The MLP forward pass

3. Train on XOR

4. Extend llm/nn.py

What this fixes (and what it costs)

Recap

Going further

4. Extend `llm/nn.py`