Station 03 · CNNs & computer vision

The Convolution Bench

A convolutional network is built on one small idea: slide a tiny grid of weights across an image and write down how strongly each spot answers. Everything else — pooling, depth, feature hierarchies — is that idea, repeated. Paint something below and watch it happen.

instrument live — drag to paint · hover the map · press sweep

01 · Input

16×16 px

02 · Kernel 3×3

sobel-x

fires on vertical edges (left↔right change)

Σ w·x — hover the map

—————————

= —

03 · Feature map

14×14 · max|v| 4.00

196/196

violet = positive response, cyan = negative

04 · ReLU — max(0, v)

negatives → 0

the nonlinearity: keep what fired, silence the rest

05 · Max pool 2×2

14×14 → 7×7

keep the strongest response in each window — same features, quarter the pixels

fig. 01 — convolve · rectify · pool: one full CNN stage, by hand

§1

One small window, slid everywhere

A fully-connected network treats every pixel as its own independent input — move the subject two pixels left and, as far as the math is concerned, it’s a brand-new image. Convolution makes the opposite bet: a pattern worth detecting is worth detecting anywhere. So instead of learning one weight per pixel, you learn a tiny 3×3 window of weights — a kernel — and reuse it at every position.

At each stop, the kernel does one multiply–accumulate: nine weights times nine pixels, summed to a single number. That number is the answer to one question — how much does this patch look like my pattern? — and the grid of answers is the feature map you watched build in the bench above. Weight sharing is also why CNNs are cheap: the bench’s 16×16 input would need 256 weights per neuron in an MLP; the kernel gets by with nine, total.

§2

Kernels are pattern detectors

Before deep learning, vision engineers designed kernels by hand — Sobel for edges, Gaussian for smoothing, Laplacian for outlines. Each one is just a different 3×3 arrangement of weights, and each asks the image a different question. Flip through them below: the input never changes, only the question does.

The deep-learning move was to stop designing them. A CNN learns its kernels by gradient descent — and when you inspect a trained network’s first layer, it has usually rediscovered these same shapes: oriented edges, color contrasts, little gradients. The difference is it learned the questions worth asking for its task, and it asks hundreds of them at once.

Kernel gallery — same image, different weights

sobel x

test card · input

after sobel x · |response|

fires on vertical edges (left↔right change)

fig. 02 — classic hand-designed kernels on a calibration card

§3

Pooling: keep the what, loosen the where

Feature maps are wasteful. If a vertical edge fired strongly at one spot, it also fired weakly at the neighbors — you don’t need all of that. Max pooling slides a 2×2 window and keeps only the strongest response, shrinking the map to a quarter of the pixels while preserving that the feature was found, just not exactly where.

That blur on position is a feature, not a bug: it buys a little translation invariance, so the network stops caring whether the edge was at pixel 6 or pixel 7. Average pooling keeps the mean instead — gentler, but it waters down strong detections. Toggle between the two and notice which one lets the bright responses survive.

Pooling — what survives downsampling

8×8 → 4×4

3.6

0.5

2.1

1.5

0.5

0.6

1.6

1.9

1.6

2.5

2.6

3.6

1.5

0.5

2.5

hover a pooled cell to trace its 2×2 source window

fig. 03 — a real feature map (bolt × sobel-x), pooled

§4

Depth buys context

One 3×3 kernel sees three pixels across — enough for an edge, hopeless for an eye. The fix isn’t bigger kernels, it’s stacking: a second layer’s kernel slides over the first layer’s feature maps, so each of its cells indirectly watches a 5×5 patch of the original image. Add pooling and a few more layers and the deepest cells see most of the picture.

This is the feature hierarchy your image classifier learned without being told to: edges → textures → parts → objects. Layer one finds oriented lines, layer three composes them into corners and fur, the last layers respond to whole faces. Each layer asks its questions about the answers of the layer before.

Receptive field — why depth means context

deep cell (3,4) sees 5×5

hover any cell in the deepest layer — two stacked 3×3 kernels give it a 5×5 view of the input

fig. 04 — receptive field growth across two stacked convolutions

Bridge

Convolution’s superpower is also its limit: every cell only ever sees its neighborhood, and context has to climb the stack one layer at a time. What if, instead, every position could look directly at every other position and decide for itself what’s relevant? That single idea — attention — is why transformers displaced recurrence, and it’s waiting at the next station.

Next station · 04

Transformers — The Attention Lens

Every token looks at every other token.