Station 02 · neural networks
The Descent
Training a network is not a search for the right answer — it's a walk downhill. The loss is a landscape, the weights are your position, and every optimizer ever invented is just a different way of taking the next step. Below, three of them race down the same hill.
instrument live — drag to orbit · click the surface to re-drop
warming up the 3d instrument…
steep across, shallow along — SGD zigzags, Adam rescales each axis — drag to orbit · click the surface to re-drop all three
fig. 01 — sgd, momentum, and adam on the same landscape, same learning rate
A neuron is a line; a network is a bend
Strip the mystique and a neuronThe smallest unit of a network: multiply each input by a weight, add them up, pass the total through a simple bend. is the smallest modelA program whose behavior was tuned from examples rather than written by hand — a big pile of adjustable numbers. from station 01: weigh each input, add a bias, output the sum. One neuron over two inputs draws one straight line — which is exactly why the checker dataset embarrassed it for twenty years. XOR has no straight-line answer, and stacking linear layers doesn’t help: a chain of linear maps collapses into a single linear map. Depth, on its own, buys you nothing.
The fix costs one line of code: after each weighted sum, push the result through a small nonlinear function before handing it to the next layer. That bend is the entire trick. With it, two layers of a few neurons can carve curved regions no single line could — and with enough of them, an MLPA multilayer perceptron: the plainest neural network — layers of neurons where each one feeds every neuron in the next layer. can approximate any reasonable function. Everything else in deep learning is making that possibility cheap to find.
The activation is where the bend comes from
The choice of bend matters less for what the network can express than for whether gradients survive the trip backwards through it. The sigmoid looks harmless until you read its derivative: never more than 0.25. Chain eight sigmoid layers and the training signal shrinks by a factor of four at every layer — by layer eight, effectively nothing arrives. That’s the vanishing gradientWhen the training signal shrinks layer by layer on its way backwards until the early layers learn nothing at all., and it’s why early deep nets simply refused to get deep.
ReLU’s answer is blunt: slope exactly 1 when active, so gradients pass through untouched. The price is the other branch — slope exactly 0 — and a neuron whose input goes negative for every example stops learning forever: a dead ReLU. Drag the input marker below zero in the gallery and watch all eight bars go dark at once. Leaky ReLU keeps a trickle flowing for exactly this reason.
solid = f(z) · dashed = f′(z)
gradient left after each layer (log scale)
derivative is exactly 1 or exactly 0 — alive or dead, nothing between — drag the input negative on relu to watch every layer die
Backprop is the chain rule, run backwards
To walk downhill you need the slope — the gradientThe direction of steepest improvement: for every adjustable number, which way (and how hard) to nudge it. of the lossA single number scoring how wrong the model currently is — training is the art of pushing it down. with respect to every weight. BackpropagationThe bookkeeping that works out, for every weight in the network, how much it contributed to the error — in one backwards sweep. computes all of them in one backwards sweep: each layer asks “how much did my output matter?” and multiplies in how much it, in turn, depended on each of its weights. It’s the calculus chain rule with good bookkeeping — nothing more exotic, even in a model with a billion weights instead of the seventeen below.
The figure runs that loop for real: forward pass, backward pass, step, hundreds of times a second. Give it the checker — the dataset station 01’s KNN could only handle by memorizing neighbors — and watch a four-neuron network bend its way to the same answer with seventeen numbers. Then cut it to two neurons: the loss stalls high, because two bends genuinely cannot carve four quadrants.
dataset
hidden neurons
activation
loss
checker is the XOR problem — try 2 hidden neurons and watch it struggle; if relu stalls flat, you’ve met a dead neuron (re-roll)
Why everyone uses Adam
Plain SGD has one move — step straight downhill — and the race above shows both ways it goes wrong. On the ravine it zigzags: the walls are steep, the floor is shallow, and one learning rateThe step size of training — how big a nudge each adjustable number gets per update. can’t serve both. On the saddle it crawls, because almost-flat is almost-stopped when your step is proportional to the slope. Real loss landscapes, in millions of dimensions, are mostly ravines and saddles.
Momentum fixes the first problem with memory: keep 90% of last step’s velocity, and the zigzags cancel while progress compounds. Adam adds the second fix — each weight gets its own step size, scaled by how big its gradients have recently been — so the shallow direction gets boosted and the steep one reined in. Watch it cross the saddle’s plateau while SGD is still deciding. Every model in the later stations — and every diffusion checkpoint you’ll ever load into ComfyUI — was trained by Adam or one of its descendants.
Bridge
The networks here saw two numbers at a time. Images are about to hand them a quarter million — and the first instinct, treating every pixel as an unrelated input, wastes everything the picture is. The road to convolution starts at the next bench. But first: a short interlude about the word nearest, which is about to mean two completely different things.
Next station · ·· · interlude
Two Kinds of Nearest — Interlude
Nearest-neighbor scaling is not K-nearest-neighbors.