Station 01 · classical machine learning

The Boundary Lab

Every classifier draws a line between classes — the whole game is how wiggly to let that line get. Below, the line belongs to k-nearest-neighbors and the wiggle dial is k. Drag points, add your own, and watch the bias–variance tradeoff stop being an equation and start being a feeling.

instrument live — drag points · sweep k · click the error chart

01 · Arena

80 pts · k=5 · train 1% · test 0%

02 · Dials

k=5 · noise 0.5

k — neighbors voting1 · memorize25 · generalizenoise — class overlap

03 · Error vs k

click the chart to set k

solid = test error · dashed = train error — the gap between them is overfitting

fig. 01 — knn decision regions with live train/test error

§1

The dial every model has

Set k to 1 in the lab and the boundary traces every single point — including the noisy ones sitting in enemy territory. Train error: zero, always, because each point’s nearest neighbor is itself. That model has memorized, not learned: it’s high variance — reshuffle the data and the boundary changes completely. Crank k toward 25 and the boundary goes smooth and stubborn, ignoring real structure along with the noise: high bias.

The error chart makes the tradeoff visible. Train error (dashed) only climbs as k grows — but test error is a U-curve: it falls while smoothing kills variance, then rises when smoothing starts killing signal. The gap between the two lines is overfitting, and the sweet spot at the bottom of the U moves when you turn the noise dial. Noisier data wants a bigger k. That’s the whole bias–variance tradeoff, operated by hand.

§2

The same disease in regression

Classification has k; regression has polynomial degree. Push the degree slider to 15 with λ at zero and the curve threads every training point while flailing wildly in between — train error near zero, test error ugly. Same disease, different symptom.

But instead of capping capacity, there’s a second cure: regularization. The λ slider adds an L2 (ridge) penalty on the weights, so the model pays for every unit of wiggle and spends its capacity only where the data insists. Slide λ up and watch the degree-15 polynomial calm down without changing its degree at all. L1 (lasso) goes further and zeroes weights out entirely; dropout is the same instinct transplanted into neural networks.

Polynomial fit — capacity vs restraint

deg 11 · λ 0 · train 0.004 · test 0.024

degree — model capacityλ — ridge penalty (L2)

filled = train · hollow = test · dashed = the true function it never sees

fig. 02 — polynomial capacity vs ridge restraint

§3

Learning without labels

Everything above had answer keys — labels to be right or wrong against. Strip the labels away and structure is still there; k-means finds it with two alternating moves: assign every point to its nearest centroid, then move each centroid to the mean of its points. Step through it below — the inertia readout only ever goes down, which is why the loop is guaranteed to settle.

Guaranteed to settle — not guaranteed to settle well. Hit reshuffle a few times: bad starting centroids find bad clusterings and stay there. (And note “nearest centroid” doing the work here — the same instinct as KNN’s “nearest neighbors.” That word nearest is about to cause some trouble at the next stop.)

k-means — assign, average, repeat

iteration 0

iter 0

fig. 03 — k-means converging, two moves at a time

§4

How models actually learn

KNN stores its data and k-means alternates two moves — but almost everything else, from ridge regression to every neural network, learns the same way: define a loss, compute its gradient, and take a small step downhill. Repeat. That’s gradient descent, and the contour map below is the honest picture: the ball only ever feels the local slope.

The step size is the learning rate, and it is still the hyperparameter that matters most. Try the ravine: too small and the ball crawls; nudge it higher and it bounces wall-to-wall across the steep axis while inching along the shallow one; a touch more and it diverges entirely. That tension — steep directions limit the step that shallow directions need — is the problem momentum and Adam were invented to solve. They’re waiting at the next station.

Gradient descent — drop a ball, pick a step size

step 0 · loss 8.420

lr 10^0.050

steep one way, shallow the other — the classic LR dilemma — click anywhere to drop the ball

fig. 04 — gradient descent: the learning-rate dilemma, live

Bridge

You just watched gradient descent solve a two-dimensional toy. A neural network is the same ball rolling down the same kind of surface — except the surface lives in millions of dimensions, one per weight, and the gradient arrives by a clever bookkeeping trick called backprop. The ball, the learning rate, the ravines: all of it scales up at the next station.

Next station · 02

Neural Networks — The Descent

Watch optimizers race down a loss surface.