Station 03 · CNNs & computer vision
The Convolution Bench
A convolutional network is built on one small idea: slide a tiny grid of weights across an image and write down how strongly each spot answers. Everything else — pooling, depth, feature hierarchies — is that idea, repeated. Paint something below and watch it happen.
instrument live — drag to paint · hover the map · press sweep
fires on vertical edges (left↔right change)
Σ w·x — hover the map
= —
violet = positive response, cyan = negative
the nonlinearity: keep what fired, silence the rest
keep the strongest response in each window — same features, quarter the pixels
fig. 01 — convolve · rectify · pool: one full CNN stage, by hand
One small window, slid everywhere
A fully-connected network treats every pixel as its own independent input — move the subject two pixels left and, as far as the math is concerned, it’s a brand-new image. ConvolutionSliding a small grid of weights across an image and recording how strongly each spot responds. makes the opposite bet: a pattern worth detecting is worth detecting anywhere. So instead of learning one weight per pixel, you learn a tiny 3×3 window of weights — a kernelA tiny grid of weights (here 3×3) slid across an image, asking the same question at every position. — and reuse it at every position.
At each stop, the kernel does one multiply–accumulate: nine weights times nine pixels, summed to a single number. That number is the answer to one question — how much does this patch look like my pattern? — and the grid of answers is the feature mapThe grid of responses a kernel writes as it slides — bright where the image matches its pattern. you watched build in the bench above. Weight sharing is also why CNNs are cheap: the bench’s 16×16 input would need 256 weights per neuron in an MLP; the kernel gets by with nine, total.
Kernels are pattern detectors
Before deep learning, vision engineers designed kernels by hand — Sobel for edges, Gaussian for smoothing, Laplacian for outlines. Each one is just a different 3×3 arrangement of weights, and each asks the image a different question. Flip through them below: the input never changes, only the question does.
The deep-learning move was to stop designing them. A CNN learns its kernels by gradient descent — and when you inspect a trained network’s first layer, it has usually rediscovered these same shapes: oriented edges, color contrasts, little gradients. The difference is it learned the questions worth asking for its task, and it asks hundreds of them at once.
fires on vertical edges (left↔right change)
Pooling: keep the what, loosen the where
Feature maps are wasteful. If a vertical edge fired strongly at one spot, it also fired weakly at the neighbors — you don’t need all of that. Max poolingShrinking a feature map by keeping a summary (usually the strongest response) of each small neighborhood. slides a 2×2 window and keeps only the strongest response, shrinking the map to a quarter of the pixels while preserving that the feature was found, just not exactly where.
That blur on position is a feature, not a bug: it buys a little translation invariance, so the network stops caring whether the edge was at pixel 6 or pixel 7. Average pooling keeps the mean instead — gentler, but it waters down strong detections. Toggle between the two and notice which one lets the bright responses survive.
hover a pooled cell to trace its 2×2 source window
Depth buys context
One 3×3 kernel sees three pixels across — enough for an edge, hopeless for an eye. The fix isn’t bigger kernels, it’s stacking: a second layer’s kernel slides over the first layer’s feature maps, so each of its cells indirectly watches a 5×5 patch of the original image. Add pooling and a few more layers and the deepest cells see most of the picture.
This is the feature hierarchy your image classifier learned without being told to: edges → textures → parts → objects. Layer one finds oriented lines, layer three composes them into corners and fur, the last layers respond to whole faces. Each layer asks its questions about the answers of the layer before.
hover any cell in the deepest layer — two stacked 3×3 kernels give it a 5×5 view of the input
Bridge
Convolution’s superpower is also its limit: every cell only ever sees its neighborhood, and context has to climb the stack one layer at a time. What if, instead, every position could look directly at every other position and decide for itself what’s relevant? That single idea — attention — is why transformers displaced recurrence, and it’s waiting at the next station.
Next station · 04
Transformers — The Attention Lens
Every token looks at every other token.