Station 04 · transformers

The Attention Lens

A convolution gathers context a neighborhood at a time. Attention skips the wait: every token gets to ask one question of every token before it, and decide for itself what matters. The weights below aren't a diagram of that idea — they're pulled from a real GPT-2, head by head.

instrument live — hover the tokens · switch heads · try the finders

The attention lens — real GPT-2 weights
loading…

loading 144 attention heads…

fig. 01 — 144 real attention heads (12 layers × 12), extracted from gpt-2 small

§1

One formula runs the whole show

Every headOne independent copy of the attention mechanism — each layer runs many in parallel so they can specialize. you flipped through above is the same five symbols: softmax(QKᵀ/√d)·V. Each tokenThe chunks a language model actually reads — usually word pieces, not whole words. publishes three small vectors — a query (“what am I looking for?”), a key (“what do I advertise?”), and a value (“what do I hand over if you pick me”). A token’s query is dottedA similarity score between two arrows of numbers: multiply matching entries and add them up — big when they point the same way. against every key; softmaxTurns a list of raw scores into percentages that are positive and sum to 1 — bigger scores get a bigger share. turns the scores into weights that sum to one; the output is the weighted mix of values. That mix — not the original token — is what the next layer reads.

The sandbox strips it to three tokens in two dimensions. Rotate the query toward a key and watch its weight eat the others; then play with the scale dial. The √d divisor is a thermostat: without it, dot products grow with dimension, softmax saturates to one-hot, and gradients die — the same vanishing story as station 02’s sigmoid, solved by a constant.

Q·K·V sandbox — attention with three tokens
weights 0.55 / 0.31 / 0.13
k1v1k2v2k3v3outq

softmax(q·k / scale)

k1q·k = 0.99w = 0.55
k2q·k = 0.18w = 0.31
k3q·k = -1.02w = 0.13

√d = 1.4 here — small scale sharpens toward one-hot, large scale flattens toward average

drag q, the keys, or the diamond values — out is the weighted mix Σ wᵢvᵢ the next layer receives

fig. 02 — softmax(q·k / √d) · v, with your hands on all three
§2

Heads are specialists, and you just met three

Why twelve heads per layer instead of one? Because one softmax is one spotlight — it can’t look two places with full strength at once. Multiple heads let the model run many small, different queries in parallel, and trained networks turn them into specialists. The finder buttons in the lens locate three famous ones in this very model: a previous-token head that always reads one step back, an attention sink that parks its budget on the first token when it has nothing better to do, and whichever head answers most decisively.

These aren’t curiosities — they’re the units interpretability researchers actually study. Try the wizard sentence: somewhere in the middle layers a head attends from the second “wizard” back toward the first occurrence’s neighborhood. That’s the shadow of an induction head — the circuit that lets a model repeat and extend patterns it has only just seen, widely credited as a seed of in-context learning.

§3

Reading the lens like an engineer

Two artifacts in the instrument are load-bearing. The empty upper triangle in the matrix is the causal mask: a GPT predicting the next token is forbidden to peek at the future, so every query may only dot against keys at or before its own position. Models like BERT drop the mask and let every token see both directions — better for understanding, useless for generation.

And the dashed chips: GPT-2 never sees words. It sees BPE subwords — “didn’t” arrives as didn + ’t, “Diffusion” as two pieces. The attention machinery is also how those fragments reassemble into meaning: the second half of a split word reliably attends hard to its first half.

§4

From one head to a transformer

The rest of the architecture is plumbing around the lens. Each block runs its heads, mixes their outputs, then hands every token to a small MLP — station 02’s machinery, applied position-by-position. A residual connection adds each block’s output back onto its input, so information flows down a highway that attention and MLPs only edit, never replace — that’s what lets gradients survive 96 layers where eight sigmoids starved. Stack the blocks, add a positional signal so “order” exists at all, and you have GPT.

This same lens is already in your image pipeline. The prompt you type into ComfyUI goes through CLIPA model trained to put images and their captions near each other in the same space of numbers — how prompts get understood.’s text transformer — these exact mechanics — before it steers a single pixel, and cross-attentionAttention pointed across modalities: the image-maker's layers reading your prompt's tokens while they work. is where a diffusion model’s denoiser looks at your words while deciding what the noise should become. Which is exactly where this journey ends next.

Bridge

You now hold both halves: networks that learn features from pixels, and attention that binds meaning across a sequence. The last station puts them together backwards — start from pure noise, and remove it one step at a time, with a transformer-guided U-Net deciding at every step what your prompt implies the image should be.

Next station · 05

DiffusionThe Denoising Deck

From pure noise to an image, one step at a time.