Station 02 · neural networks

The Descent

Training a network is not a search for the right answer — it's a walk downhill. The loss is a landscape, the weights are your position, and every optimizer ever invented is just a different way of taking the next step. Below, three of them race down the same hill.

instrument live — drag to orbit · click the surface to re-drop

Optimizer race — three solvers, one landscape

step 0 / 360

warming up the 3d instrument…

sgd 6.859momentum 6.859adam 6.859

lr 10^0.040

steep across, shallow along — SGD zigzags, Adam rescales each axis — drag to orbit · click the surface to re-drop all three

fig. 01 — sgd, momentum, and adam on the same landscape, same learning rate

§1

A neuron is a line; a network is a bend

Strip the mystique and a neuron is the smallest model from station 01: weigh each input, add a bias, output the sum. One neuron over two inputs draws one straight line — which is exactly why the checker dataset embarrassed it for twenty years. XOR has no straight-line answer, and stacking linear layers doesn’t help: a chain of linear maps collapses into a single linear map. Depth, on its own, buys you nothing.

The fix costs one line of code: after each weighted sum, push the result through a small nonlinear function before handing it to the next layer. That bend is the entire trick. With it, two layers of a few neurons can carve curved regions no single line could — and with enough of them, an MLP can approximate any reasonable function. Everything else in deep learning is making that possibility cheap to find.

§2

The activation is where the bend comes from

The choice of bend matters less for what the network can express than for whether gradients survive the trip backwards through it. The sigmoid looks harmless until you read its derivative: never more than 0.25. Chain eight sigmoid layers and the training signal shrinks by a factor of four at every layer — by layer eight, effectively nothing arrives. That’s the vanishing gradient, and it’s why early deep nets simply refused to get deep.

ReLU’s answer is blunt: slope exactly 1 when active, so gradients pass through untouched. The price is the other branch — slope exactly 0 — and a neuron whose input goes negative for every example stops learning forever: a dead ReLU. Drag the input marker below zero in the gallery and watch all eight bars go dark at once. Leaky ReLU keeps a trickle flowing for exactly this reason.

Activation gallery — the function, its derivative, and what survives

after 8 layers: ×1.00

solid = f(z) · dashed = f′(z)

gradient left after each layer (log scale)

input z1.0

derivative is exactly 1 or exactly 0 — alive or dead, nothing between — drag the input negative on relu to watch every layer die

fig. 02 — four bends, and how much gradient each lets back through

§3

Backprop is the chain rule, run backwards

To walk downhill you need the slope — the gradient of the loss with respect to every weight. Backpropagation computes all of them in one backwards sweep: each layer asks “how much did my output matter?” and multiplies in how much it, in turn, depended on each of its weights. It’s the calculus chain rule with good bookkeeping — nothing more exotic, even in a model with a billion weights instead of the seventeen below.

The figure runs that loop for real: forward pass, backward pass, step, hundreds of times a second. Give it the checker — the dataset station 01’s KNN could only handle by memorizing neighbors — and watch a four-neuron network bend its way to the same answer with seventeen numbers. Then cut it to two neurons: the loss stalls high, because two bends genuinely cannot carve four quadrants.

Live training — watch the boundary bend

—

dataset

hidden neurons

activation

loss

checker is the XOR problem — try 2 hidden neurons and watch it struggle; if relu stalls flat, you’ve met a dead neuron (re-roll)

fig. 03 — a 2→H→1 network learning live, full-batch, in your browser

§4

Why everyone uses Adam

Plain SGD has one move — step straight downhill — and the race above shows both ways it goes wrong. On the ravine it zigzags: the walls are steep, the floor is shallow, and one learning rate can’t serve both. On the saddle it crawls, because almost-flat is almost-stopped when your step is proportional to the slope. Real loss landscapes, in millions of dimensions, are mostly ravines and saddles.

Momentum fixes the first problem with memory: keep 90% of last step’s velocity, and the zigzags cancel while progress compounds. Adam adds the second fix — each weight gets its own step size, scaled by how big its gradients have recently been — so the shallow direction gets boosted and the steep one reined in. Watch it cross the saddle’s plateau while SGD is still deciding. Every model in the later stations — and every diffusion checkpoint you’ll ever load into ComfyUI — was trained by Adam or one of its descendants.

Bridge

The networks here saw two numbers at a time. Images are about to hand them a quarter million — and the first instinct, treating every pixel as an unrelated input, wastes everything the picture is. The road to convolution starts at the next bench. But first: a short interlude about the word nearest, which is about to mean two completely different things.

Next station · ·· · interlude

Two Kinds of Nearest — Interlude

Nearest-neighbor scaling is not K-nearest-neighbors.