Station 01 · classical machine learning
The Boundary Lab
Every classifier draws a line between classes — the whole game is how wiggly to let that line get. Below, the line belongs to k-nearest-neighbors and the wiggle dial is k. Drag points, add your own, and watch the bias–variance tradeoff stop being an equation and start being a feeling.
instrument live — drag points · sweep k · click the error chart
solid = test error · dashed = train error — the gap between them is overfitting
fig. 01 — knn decision regions with live train/test error
The dial every model has
Set k to 1 in the lab and the boundary traces every single point — including the noisy ones sitting in enemy territory. Train error: zero, always, because each point’s nearest neighbor is itself. That model has memorized, not learned: it’s high variance — reshuffle the data and the boundary changes completely. Crank k toward 25 and the boundary goes smooth and stubborn, ignoring real structure along with the noise: high bias.
The error chart makes the tradeoff visible. Train error (dashed) only climbs as k grows — but test error is a U-curve: it falls while smoothing kills variance, then rises when smoothing starts killing signal. The gap between the two lines is overfittingMemorizing the training examples instead of learning the pattern — perfect on what it saw, poor on anything new., and the sweet spot at the bottom of the U moves when you turn the noise dial. Noisier data wants a bigger k. That’s the whole bias–varianceThe classic tradeoff: a model too simple to fit the pattern (bias) versus one so flexible it fits the noise (variance). tradeoff, operated by hand.
The same disease in regression
Classification has k; regression has polynomial degree. Push the degree slider to 15 with λ at zero and the curve threads every training point while flailing wildly in between — train error near zero, test error ugly. Same disease, different symptom.
But instead of capping capacity, there’s a second cure: regularizationA penalty for complexity added during training, so the model only uses as much wiggle as the data can justify.. The λ slider adds an L2 (ridge) penalty on the weightsThe adjustable numbers inside a model — training is the process of finding good values for them., so the model pays for every unit of wiggle and spends its capacity only where the data insists. Slide λ up and watch the degree-15 polynomial calm down without changing its degree at all. L1 (lasso) goes further and zeroes weights out entirely; dropout is the same instinct transplanted into neural networks.
filled = train · hollow = test · dashed = the true function it never sees
Learning without labels
Everything above had answer keys — labels to be right or wrong against. Strip the labels away and structure is still there; k-means finds it with two alternating moves: assign every point to its nearest centroid, then move each centroid to the mean of its points. Step through it below — the inertia readout only ever goes down, which is why the loop is guaranteed to settle.
Guaranteed to settle — not guaranteed to settle well. Hit reshuffle a few times: bad starting centroids find bad clusterings and stay there. (And note “nearest centroid” doing the work here — the same instinct as KNN’s “nearest neighbors.” That word nearest is about to cause some trouble at the next stop.)
How models actually learn
KNN stores its data and k-means alternates two moves — but almost everything else, from ridge regression to every neural network, learns the same way: define a lossA single number scoring how wrong the model currently is — training is the art of pushing it down., compute its gradientThe direction of steepest improvement: for every adjustable number, which way (and how hard) to nudge it., and take a small step downhill. Repeat. That’s gradient descent, and the contour map below is the honest picture: the ball only ever feels the local slope.
The step size is the learning rateThe step size of training — how big a nudge each adjustable number gets per update., and it is still the hyperparameterA setting you choose before training starts (like the step size or k) rather than a number the model learns. that matters most. Try the ravine: too small and the ball crawls; nudge it higher and it bounces wall-to-wall across the steep axis while inching along the shallow one; a touch more and it diverges entirely. That tension — steep directions limit the step that shallow directions need — is the problem momentum and Adam were invented to solve. They’re waiting at the next station.
steep one way, shallow the other — the classic LR dilemma — click anywhere to drop the ball
Bridge
You just watched gradient descent solve a two-dimensional toy. A neural network is the same ball rolling down the same kind of surface — except the surface lives in millions of dimensions, one per weight, and the gradient arrives by a clever bookkeeping trick called backprop. The ball, the learning rate, the ravines: all of it scales up at the next station.
Next station · 02
Neural Networks — The Descent
Watch optimizers race down a loss surface.