Interpretability

Mechanistic and white-box interpretability: reading attention heads, MLP prototypes, induction circuits, and the geometry of learned representations.

30 posts tagged #interpretability.

Jul 9, 2026

Running the Survival Trial, in JAX/Flax NNX

A runnable companion to the survival-model trial: the Yat DeepSurv trunk in Flax NNX, the Cox partial-likelihood loss, an LR-fair per-model training loop with best-epoch selection, the concordance / integrated-Brier / time-dependent-AUC evaluation, and the classical baselines wired through sksurv and lifelines. Every figure is a real number from a real run across five datasets, with the prototypes and the risk stratification captured as they form over training.
Jul 9, 2026

Solving It and Descending It, in JAX/Flax NNX

A runnable companion to the solve-vs-descend post: the Yat kernel and its Gram matrix, the exact kernel ridge solve via Cholesky, the same kernel as a Flax NNX module trained by AdamW with LR sweeps and best-epoch selection, the measured timing wall, minibatching through 511k rows, and the conv trunk the solve can never train. Every number is from the real Kaggle runs.
Jul 9, 2026

The White-Box Survival Model on Trial

A classical kernel machine is beautiful and needs a solve that will not minibatch or compose. A deep net trains on anything and its risk score is a fog. What if one thing had the training of a net and the theory of a kernel? We put a deep Yat-kernel survival model on trial across five real datasets against Cox, penalized Cox, and Random Survival Forest: it trains with plain gradient descent, lands in the pack on the concordance index, and, because its units are genuine Mercer kernels, inherits exact attribution, calibration, editing, and bounded out-of-distribution response the others cannot give. Where it costs a point or two, we show exactly where.
Jul 9, 2026

One Kernel, Fitted Twice

Kernel methods gave us the theory everyone still wants back, and the field abandoned them over one procedure: the O(n³) solve over an n by n Gram matrix, which cannot minibatch, cannot scale, and cannot sit under other layers. So we took one Mercer kernel and fitted it twice: once by the classical exact solve, once by plain gradient descent on a bank of prototypes. The two machines agree, to a correlation of 0.95, and then the descended one walks through three walls the solved one dies at: a measured memory wall at sixteen thousand rows, a half-million-row dataset the solve cannot touch, and an end-to-end network the solve cannot be.
Jul 5, 2026

A White-Box DeepSurv, in JAX/Flax NNX

A runnable companion: the Cox partial-likelihood loss, a standard DeepSurv, and a Yat-kernel DeepSurv whose log-risk decomposes into prototype patients, all in Flax NNX. Concordance evaluation, exact convex attribution, cohort deletion, and OOD abstention as short array operations. Every number is from a real run on METABRIC breast-cancer survival.
Jul 5, 2026

A Risk Model That Names Its Reasons

A survival model tells an oncologist a patient is high-risk, and she has to act on the number without being able to ask why. What would it take for the risk score to name its reasons? We build a Yat-kernel DeepSurv on breast-cancer survival, match a standard DeepSurv on concordance, and get a risk score that decomposes exactly into the prototype patients this one resembles, a model you can read, audit, and edit.
Jul 4, 2026

Building the Second Layer by Hand, in JAX/Flax NNX

A runnable companion: build a whole second feature layer by hand in JAX, on top of the hand-built first. Named min-AND combinations of layer-1 edges (junctions, continuations, bends, stripes) feed the same constructed Yat head, no training anywhere. It reproduces the flat rung: 83.3% at layer 1, 82.9% with both, 78.8% from relations alone, and counts the combinatorial wall of 224 pairwise and 4,630 three-way types where construction stops.
Jul 4, 2026

How Far Down Can You Build?

One hand-built feature layer matched a trained backbone at 83.3% on Fashion-MNIST, and real networks are deep. Conveniently, the recipe for a second layer has been on the shelf for half a century: vision science says edges assemble into junctions, continuations, bends and stripes. This post takes the recipe down and follows it, builds layer 2 entirely by hand with every dimension still nameable in one sentence, and measures exactly where construction stops, and why.
Jul 4, 2026

Edit One Operator, Edit Every Depth

One post taught and forgot classes by editing rows of a Yat network, with proofs that nothing else moved. Another melted the stack of layers into a single operator iterated to a fixed point. This is the collision. Every one of those editing proofs rested on a pasted row entering the score once, as one term in one sum, and in an equilibrium network there is no once: whatever you paste is applied at every depth and fed back into its own input, and every fixed point is free to drift. So did melting the stack melt the editability? This post pastes, deletes, and measures: every guarantee that survives is either proved inside the recursion or measured against the real run, fixed point by fixed point.
Jul 1, 2026

Your Network Is a Stack of Layers. It Could Be a Fixed Point.

A deep network makes you choose its depth before you have seen the problem, and gives every layer its own weights. Share one Yat-kernel operator across all of them instead, and the stack collapses into a single equation: the answer is the fixed point the state settles into. Training makes that operator a contraction, so the settling point is unique and reached from anywhere, the network decides its own depth per input, and the same twenty-four prototypes describe the computation at every step. 98.2% on two moons from 1700 parameters shared across all depth.
Jun 27, 2026

A White-Box FFN: the Representer Theorem in JAX/Flax NNX

A runnable companion: build a transformer whose feed-forward block is a Yat kernel, so the FFN is exactly a representer sum over learned key-value memory slots. Train it on tinyshakespeare, then do four things you cannot do to an opaque ReLU FFN: read each memory slot, attribute an output to the slots that wrote it, edit one slot and watch generation change, and read off when the memory is out of its depth.
Jun 27, 2026

The MLP Block Is a Representer Theorem

After the 3Blue1Brown attention video you can read half a transformer: you can see which token attends to which. The other half, the MLP block, stays a black box. But attention is legible because it is a kernel, a vote by similarity, and if you make the MLP a kernel too, its output becomes the same thing: a representer-theorem vote over learned prototypes. Then the whole transformer explains itself.
Jun 25, 2026

Where Does a Weight Live?

A standard neuron's weight and its input never actually meet: one is a point you can see, the other an arrow off in its own space, joined only by a shadow. This is what a reproducing kernel Hilbert space fixes: it gives input and weight one shared address, where the optimal weight is built from the data itself and sits right next to it. Four interactive panels.
Jun 24, 2026

You Don't Even Have to Train the Features

The last post trained a backbone and built the classifier by hand. This one builds the features by hand too: oriented-edge and corner detectors pooled over a grid of patches, the way computer vision worked for decades. Feed those to the same constructed Yat head and, with nothing trained anywhere, it matches the trained backbone on Fashion-MNIST point for point, within a couple of points of a fully trained network. The whole network is hand-built and readable end to end.
Jun 22, 2026

You Only Have to Train the Features

Leave a convolutional network's weights at their random starting values and build a Yat head on its features by hand: the trained head on that random backbone sorts at chance while the constructed one reaches 74%. On a properly trained backbone the constructed head reaches 83.2% against 85.7% for the trained one. The accuracy lives in the representation; the classifier, and its edits, are furniture you place. This maps the boundary between what you must optimize and what you can construct.
Jun 22, 2026

Editing a Network by Hand, in JAX/Flax NNX

A runnable companion: build the prototype Yat-MLP in Flax NNX, then add a class by concatenating a few prototype rows and forget a class by masking them out, with no gradient steps. Class-incremental learning that matches a from-scratch build, and exact machine unlearning, both as array edits you can read. Every number is from a real run on Fashion-MNIST.
Jun 22, 2026

Your Network Is a List of Pictures. You Can Edit It.

If a neuron is a labelled picture, a classifier is a list of them, and a list is something you edit. Add a class to a trained-free Yat-kernel network by placing twenty pictures, and it recognizes that class at 95% with zero gradient steps. Delete a class by removing its pictures, and it is forgotten exactly, the other classes untouched. Class-incremental learning with no penalty and machine unlearning that is instant and exact, both falling out of the architecture rather than bolted on.
Jun 21, 2026

Your Neuron Is a Picture, in JAX/Flax NNX

A runnable companion: build the prototype MLP from the post in Flax NNX, train it on Fashion-MNIST, and watch the neurons. Pull the prototypes out as images, read a prediction as a vote over pictures, see the model abstain on out-of-distribution digits, check that random-init prototypes classify but stay noise, and track the prototypes migrating through a UMAP fit on the dataset as they train.
Jun 21, 2026

Your Neuron Is a Direction. It Should Be a Picture.

Why should a neuron store a direction when it could store a thing? A direction is not a referent you can point at, which is why MLPs are opaque. Put the Yat kernel where the activation was, train on Fashion-MNIST, and every neuron becomes a prototype that lives in pixel space, literally a picture, so the network reads its own predictions: this looks like that, no saliency method required.
Jun 18, 2026

The Yat-Kernel MLP in JAX/Flax NNX

A runnable companion to What a Finite Kernel Buys an MLP: build a layer whose unit is the Yat kernel instead of a linear map plus an activation, assert it is positive definite and nonnegative, write down its exact finite feature map, train it end-to-end on two moons with no activation function, and measure the lazy-loading sparsity, the bounded off-distribution response, the RKHS capacity, and the force field that pulls each prototype onto its data.
Jun 18, 2026

What a Finite Kernel Buys an MLP

Replace the activation function with a finite, explicit, positive-definite kernel, the Yat kernel, and an MLP stops being a stack of linear maps glued by a nonlinearity. It becomes a kernel machine, with locality, attribution, geometry, capacity control, and a feature map you can write down.
Jun 4, 2026

Why Attention Needs Q and K Projections

The dot product in attention is not enough by itself. Without learned query and key projections, attention can only compare tokens in the residual stream’s native geometry. With a shared projection it learns a symmetric metric. With separate Q and K projections, the score becomes a learned bilinear form x_iᵀW_QW_Kᵀx_j: directional, role-aware, low-rank, and different per head. That bilinearity is what lets attention ask one kind of question and let tokens advertise another kind of answer.
Jun 4, 2026

The Prototype Readout in JAX/Flax NNX

A runnable companion to The Readout is a Convex Combination of Prototypes: read the columns of W_out as output prototypes in Flax NNX, measure the convex/conic/affine/linear regimes numerically, then build a Nadaraya–Watson kernel readout that is convex by construction (nonnegative weights that sum to one, a point that never leaves the prototype hull), with the nonnegativity-vs-positive-definiteness distinction checked in code.
Jun 4, 2026

The Readout is a Convex Combination of Prototypes

The second linear map in a transformer MLP is not just a projection. If the hidden activations are nonnegative and normalized, W_out reads the active neurons as a convex combination of output prototypes. Two independent constraints, nonnegativity and summing to one, sort the readout into four regimes: convex, conic, affine, and linear. This reframes the MLP readout as the same object that makes attention legible (a weighted sum over named basis elements), connects it to feed-forward key-value memories and modern Hopfield retrieval, and shows when a kernel makes it convex by construction.
May 26, 2026

Untangling the Moons: A Visual History of Contrastive Learning

Eight contrastive losses, twenty years of history, one interactive playground. Watch pair, triplet, InfoNCE, CLIP, SupCon, SigLIP, alignment+uniformity, and cosine→0 organize 2D points, and see which ones know when to stop.
May 21, 2026

What an MLP Knows, When It's a Kernel

The transformer MLP is illegible because its primitive does not carry a kernel. Give it one and the four objects that make attention legible follow for free, for the whole network.
May 14, 2026

Self-Attention as Kernel Regression in JAX/Flax NNX

A runnable companion to Attention is Explainable Because it is a Kernel: build scaled dot-product attention from scratch in Flax NNX, prove in code that it is exactly a Nadaraya–Watson kernel smoother, watch the separate q/k projections break positive-definiteness numerically, swap the exp-dot-product kernel for Gaussian, Yat, and linear kernels to see which keep the weights a convex partition of unity, read the temperature as a kernel bandwidth, and train a single head end-to-end to route to a marked token.
May 14, 2026

Attention is Explainable Because it is a Kernel

Self-attention in transformers is a Nadaraya–Watson kernel smoother. That fact, and not "we visualize the matrix", is why attention heads are readable while MLPs are not.
Feb 23, 2026

Not All Infinities Are Equal: The Cross-Entropy Asymmetry Behind Hallucination

The singularity structure of cross-entropy is asymmetric, and that asymmetry explains LLM hallucination, the CLIP modality gap, and why contrastive losses need 32K batches.
Feb 20, 2026

Activations Are Bad for Geometry

ReLU, GELU, and friends factor into a layer's Jacobian as a diagonal modulation that wrecks the geometry of the data manifold. Why pointwise activations are a representational bug.

Interpretability

Running the Survival Trial, in JAX/Flax NNX

Solving It and Descending It, in JAX/Flax NNX

The White-Box Survival Model on Trial

One Kernel, Fitted Twice

A White-Box DeepSurv, in JAX/Flax NNX

A Risk Model That Names Its Reasons

Building the Second Layer by Hand, in JAX/Flax NNX

How Far Down Can You Build?

Edit One Operator, Edit Every Depth

Your Network Is a Stack of Layers. It Could Be a Fixed Point.

A White-Box FFN: the Representer Theorem in JAX/Flax NNX

The MLP Block Is a Representer Theorem

Where Does a Weight Live?

You Don't Even Have to Train the Features

You Only Have to Train the Features

Editing a Network by Hand, in JAX/Flax NNX

Your Network Is a List of Pictures. You Can Edit It.

Your Neuron Is a Picture, in JAX/Flax NNX

Your Neuron Is a Direction. It Should Be a Picture.

The Yat-Kernel MLP in JAX/Flax NNX

What a Finite Kernel Buys an MLP

Why Attention Needs Q and K Projections

The Prototype Readout in JAX/Flax NNX

The Readout is a Convex Combination of Prototypes

Untangling the Moons: A Visual History of Contrastive Learning

What an MLP Knows, When It's a Kernel

Self-Attention as Kernel Regression in JAX/Flax NNX

Attention is Explainable Because it is a Kernel

Not All Infinities Are Equal: The Cross-Entropy Asymmetry Behind Hallucination

Activations Are Bad for Geometry