Station 01 · classical machine learning

The Boundary Lab

Every classifier draws a line between classes — the whole game is how wiggly to let that line get. Below, the line belongs to k-nearest-neighbors and the wiggle dial is k. Drag points, add your own, and watch the bias–variance tradeoff stop being an equation and start being a feeling.

instrument live — drag points · sweep k · click the error chart

01 · Arena
80 pts · k=5 · train 1% · test 0%
02 · Dials
k=5 · noise 0.5
03 · Error vs k
click the chart to set k
0%20%40%← OVERFITUNDERFIT →

solid = test error · dashed = train error — the gap between them is overfitting

fig. 01 — knn decision regions with live train/test error

§1

The dial every model has

Set k to 1 in the lab and the boundary traces every single point — including the noisy ones sitting in enemy territory. Train error: zero, always, because each point’s nearest neighbor is itself. That model has memorized, not learned: it’s high variance — reshuffle the data and the boundary changes completely. Crank k toward 25 and the boundary goes smooth and stubborn, ignoring real structure along with the noise: high bias.

The error chart makes the tradeoff visible. Train error (dashed) only climbs as k grows — but test error is a U-curve: it falls while smoothing kills variance, then rises when smoothing starts killing signal. The gap between the two lines is overfittingMemorizing the training examples instead of learning the pattern — perfect on what it saw, poor on anything new., and the sweet spot at the bottom of the U moves when you turn the noise dial. Noisier data wants a bigger k. That’s the whole bias–varianceThe classic tradeoff: a model too simple to fit the pattern (bias) versus one so flexible it fits the noise (variance). tradeoff, operated by hand.

§2

The same disease in regression

Classification has k; regression has polynomial degree. Push the degree slider to 15 with λ at zero and the curve threads every training point while flailing wildly in between — train error near zero, test error ugly. Same disease, different symptom.

But instead of capping capacity, there’s a second cure: regularizationA penalty for complexity added during training, so the model only uses as much wiggle as the data can justify.. The λ slider adds an L2 (ridge) penalty on the weightsThe adjustable numbers inside a model — training is the process of finding good values for them., so the model pays for every unit of wiggle and spends its capacity only where the data insists. Slide λ up and watch the degree-15 polynomial calm down without changing its degree at all. L1 (lasso) goes further and zeroes weights out entirely; dropout is the same instinct transplanted into neural networks.

Polynomial fit — capacity vs restraint
deg 11 · λ 0 · train 0.004 · test 0.024

filled = train · hollow = test · dashed = the true function it never sees

fig. 02 — polynomial capacity vs ridge restraint
§3

Learning without labels

Everything above had answer keys — labels to be right or wrong against. Strip the labels away and structure is still there; k-means finds it with two alternating moves: assign every point to its nearest centroid, then move each centroid to the mean of its points. Step through it below — the inertia readout only ever goes down, which is why the loop is guaranteed to settle.

Guaranteed to settle — not guaranteed to settle well. Hit reshuffle a few times: bad starting centroids find bad clusterings and stay there. (And note “nearest centroid” doing the work here — the same instinct as KNN’s “nearest neighbors.” That word nearest is about to cause some trouble at the next stop.)

k-means — assign, average, repeat
iteration 0
iter 0
fig. 03 — k-means converging, two moves at a time
§4

How models actually learn

KNN stores its data and k-means alternates two moves — but almost everything else, from ridge regression to every neural network, learns the same way: define a lossA single number scoring how wrong the model currently is — training is the art of pushing it down., compute its gradientThe direction of steepest improvement: for every adjustable number, which way (and how hard) to nudge it., and take a small step downhill. Repeat. That’s gradient descent, and the contour map below is the honest picture: the ball only ever feels the local slope.

The step size is the learning rateThe step size of training — how big a nudge each adjustable number gets per update., and it is still the hyperparameterA setting you choose before training starts (like the step size or k) rather than a number the model learns. that matters most. Try the ravine: too small and the ball crawls; nudge it higher and it bounces wall-to-wall across the steep axis while inching along the shallow one; a touch more and it diverges entirely. That tension — steep directions limit the step that shallow directions need — is the problem momentum and Adam were invented to solve. They’re waiting at the next station.

Gradient descent — drop a ball, pick a step size
step 0 · loss 8.420

steep one way, shallow the other — the classic LR dilemma — click anywhere to drop the ball

fig. 04 — gradient descent: the learning-rate dilemma, live

Bridge

You just watched gradient descent solve a two-dimensional toy. A neural network is the same ball rolling down the same kind of surface — except the surface lives in millions of dimensions, one per weight, and the gradient arrives by a clever bookkeeping trick called backprop. The ball, the learning rate, the ravines: all of it scales up at the next station.

Next station · 02

Neural NetworksThe Descent

Watch optimizers race down a loss surface.