What an MLP Knows, When It's a Kernel

· 18 min read

#ml#interpretability#kernels#mlp#transformers#mechanistic-interpretability#neural-networks#yat-unit#deep-learning

A transformer block has two layers, and one of them is read in a way the other is not. Attention heads have names — induction, name-mover, positional, copying — drawn from the algorithmic behaviour they implement, pointed at in code, ablated to confirm a single function disappears. The position-wise MLP, sitting in the same residual stream and consuming a comparable parameter budget, almost never affords this kind of reading. The standard interpretability move for the MLP is to give up on its native representation and project into something else — fit a sparse autoencoder, train a probe, look at activation magnitudes — then try to map back to the layer’s units. The MLP does not help.

It does not help for a structural reason, not a difficulty of depth or scale. Attention is a kernel machine — I made the argument at length in Attention is Explainable Because it is a Kernel — and the four objects it supplies (a real-valued pairwise score on every input/unit pair, a normalised contribution mass, a geometry on the inputs, and a basis of meaningful per-unit directions) are exactly the ones interpretability tools spend years rebuilding for layers that lack them. The standard MLP layer is not a kernel machine. Its primitive is an affine map followed by a pointwise nonlinearity, and pointwise nonlinearities are bad for geometry: the same diagonal modulation that buys selectivity destroys the geometry the linear part might have carried, leaving the layer’s output without a native similarity score, without a normalised contribution, without a distance.

This post asks the constructive question. What does an MLP look like when its primitive is a kernel? Which of attention’s four objects show up, and what do they let you do?

What’s actually inside the block

Before the structural argument, three step-through diagrams of what the layers in question do — useful even if the rest of the post is review for you. Click play to animate, or step manually.

A standard transformer block has two sub-layers wired around a residual stream:

A single transformer block. The residual stream carries token vectors top-to-bottom; each sub-block reads it, computes a delta, and adds the delta back. Attention mixes tokens with each other via a kernel similarity; the MLP transforms each token independently. The dashed curves are the residual skip paths.

The attention sub-block is a kernel machine in plain sight: a query–key inner product makes a similarity matrix, a row-wise softmax turns it into a contribution distribution, and the result is applied to values. (For the longer reading — RKHS, Mercer, why softmax-normalised similarity is the Nadaraya–Watson smoother — see Attention is Explainable Because it is a Kernel.)

The attention sub-block, stage by stage. X is the token matrix [N, d]; three learned projections produce Q, K, V (each [N, dₖ] with rows aligned to the same N tokens). The kernel step is S = Q Kᵀ / √dₖ — every entry S[i, j] is the inner product of one query and one key, a real-valued pairwise score with named rows and columns. A row-wise softmax turns it into α, a row-stochastic [N, N] matrix, and the output is α V — each output row a convex mix of value rows. Hover any S, α, or output cell to see the row pair that produced it; the dashed lines connect back to the source Q, K, V rows.

The MLP sub-block is the other side. It is an affine map, a pointwise activation, and another affine map, with nothing in between that names which inputs the layer is responding to.

A standard MLP block, stage by stage. The input vector x goes through a first linear map (W₁x + b₁), then a pointwise activation φ, then a second linear map (W₂a + b₂). Stage 3 is the geometric breakpoint: φ acts coordinatewise, the singular values of the layer's Jacobian get multiplied by activation derivatives that vanish (ReLU) or saturate (sigmoid/tanh), and any geometry the linear part provided collapses. The collapse is the topic of an earlier post — see *Activations Are Bad for Geometry*.

Three diagrams, two structurally different layers. Attention has a kernel; the MLP doesn’t. The rest of the post is about what changes if we put one in.

What a kernel layer carries

The shortest argument for what a kernel-shaped unit gives you is to put one next to the standard ReLU unit and show them side by side. The ReLU unit’s response is a half-plane: above the affine hyperplane wx+b=0\mathbf{w} \cdot \mathbf{x} + b = 0 it is positive and grows without bound; below the hyperplane it is zero. There is no notion of a peak in input space, no symmetric “this is the input the unit is looking for” — the unit carries a direction, not a point. (This is the same distinction Opposite Is Not Different draws when it argues that cosine similarity has three landmarks: the unit’s direction on the sphere is only ever maximally different from its negation, not from an orthogonal alternative.) The kernel unit, by contrast, is a localised bump centred at a learnable point.

click either chart to set θ · hover to read the activation at the cursor

standard MLP unit · max(0, w · x + b)

kernel unit · (x · W)² / (‖x − W‖² + ε)

Activation pattern of a standard ReLU unit (left) and a Yat kernel unit (right) over a 2D input space. Both have the same parameter direction. The ReLU unit fires across the entire half-plane in the direction of w, with no notion of where the response is centred. The kernel unit has a single localised peak at the prototype W. Click either chart to rotate the parameter; the slider does the same. The contrast is the entire argument: the kernel unit carries a prototype in input space; the ReLU unit carries only a direction.

Four properties fall out of the kernel type, in direct parallel to attention.

A pairwise score on every input/unit pair. yu(x)=k(x,Wu)y_u(\mathbf{x}) = k(\mathbf{x}, \mathbf{W}_u) is a real number that says how strongly x\mathbf{x} matches the unit’s centre. The unit’s activation is literally that score; this is what the layer computes. The ReLU unit’s output is a scalar function of one direction, with no symmetric notion of similarity between the input and any other point.

A learnable centre in input space. Wu\mathbf{W}_u is the point the unit is looking for. Because the response is highest when x\mathbf{x} matches Wu\mathbf{W}_u, the weight vector is, formally, a soft prototype — interpretable in the same vocabulary as the network’s inputs. The standard MLP weight w\mathbf{w} is a direction; the kernel unit’s weight is a point.

A normalised contribution mass. If the layer’s downstream consumer normalises the activations — by softmax, L1L_1, anything that converts non-negative scores to a partition of unity — the result is a contribution distribution over prototypes. Statements of the form ”45%45\% of this output came from unit u7u_7” are first-class.

A geometry on the inputs. The kernel induces a metric on input space: two inputs are close iff their unit-score profiles are close. For a Mercer kernel the metric is genuinely Riemannian; the layer pulls the network’s downstream notion of “near” back to a metric on Rn\mathbb{R}^n that the network respects.

These are the four objects practitioners pick up when they read an attention head and put down when they try to read a position-wise MLP.

The two parameters of a kernel unit

Each unit in a kernel layer has two parameters, and they play different roles. The first is the prototype WuRn\mathbf{W}_u \in \mathbb{R}^n. It is a point in input space — formally the location of a kernel section K(,Wu)K(\cdot, \mathbf{W}_u), which is a single function in the RKHS associated to the kernel KK. The unit’s response to an input x\mathbf{x} is a measurement against that section: how close, in the kernel’s geometry, is x\mathbf{x} to Wu\mathbf{W}_u. Selecting a neuron, in this layer, means evaluating K(x,Wu)K(\mathbf{x}, \mathbf{W}_u) — the prototype’s RKHS section sampled at the input.

The second parameter is the readout coefficient αuR\alpha_u \in \mathbb{R}. It is not a similarity score. It is the weight the layer places on the prototype’s contribution to the layer’s output. Positive αu\alpha_u means “this prototype pushes the output up when its kernel fires”; negative αu\alpha_u means “this prototype pushes the output down.” Magnitude is how loudly. With mm units, the layer computes a finite kernel expansion in the RKHS,

f(x)  =  u=1mαuK(x,Wu).f(\mathbf{x}) \;=\; \sum_{u=1}^{m} \alpha_u \, K(\mathbf{x}, \mathbf{W}_u).

This is the same finite kernel expansion classical SVMs and Gaussian processes use; the difference is that here the centres Wu\mathbf{W}_u are learned end-to-end rather than fixed to the training data, and the αu\alpha_u are the readout coefficients that turn the population of prototype responses into a layer output.

drag x · drag a prototype to move W_u · sliders below control each readout α_u
readouts αu · each is learned per-neuron and acts as a linear projection from the kernel-similarity vector to the layer output:
A kernel-layer MLP block in one picture. Six prototypes W_u live in the 2D input space, each shown as a coloured dot — orange for positive readout α_u, slate for negative; saturation encodes magnitude. The heatmap behind them is the layer's output function f(x) = Σ αᵤ K(x, Wᵤ); orange regions where the layer outputs positive values, slate regions where it outputs negative. Drag the input x (the open white circle) to see, on the right, each unit's contribution αᵤ · K(x, Wᵤ) as a bar, plus the running sum f(x). Selecting a neuron is just reading off the largest |αᵤ · K(x, Wᵤ)|: the closer x is to Wᵤ in the kernel's geometry, the bigger the contribution.

The picture above is the structural content of an entire MLP block compressed to one page. Each prototype is a function in the RKHS — its bump in the heatmap. The readout coefficients pick out which bumps add positively and which subtract. The input x\mathbf{x} activates the units in proportion to how close it is to each prototype. The layer’s output is the signed sum of those activations.

One instantiation: the Yat unit

To make the kernel concrete, fix a particular choice of KK. The Yat unit on xRn\mathbf{x} \in \mathbb{R}^n uses

yu(x)  =  αu(xWu+b)2xWu2+ε,y_u(\mathbf{x}) \;=\; \alpha_u \,\frac{(\mathbf{x} \cdot \mathbf{W}_u + b)^2}{\|\mathbf{x} - \mathbf{W}_u\|^2 + \varepsilon},

with prototype WuRn\mathbf{W}_u \in \mathbb{R}^n, bias bb, regulariser ε>0\varepsilon > 0, and readout αu\alpha_u. The fraction is the kernel similarity K(x,Wu)K(\mathbf{x}, \mathbf{W}_u): its denominator is a regularised squared Euclidean distance from x\mathbf{x} to Wu\mathbf{W}_u, minimised at x=Wu\mathbf{x} = \mathbf{W}_u; its numerator is a squared inner product. The unit’s score is high when x\mathbf{x} is both close to Wu\mathbf{W}_u and aligned with it. The readout αu\alpha_u multiplies that score to give the unit’s actual contribution to the layer.

click on the chart to move the prototype W · this is one kernel section K(·, W); the readout α is downstream
The Yat kernel section K(·, W) over a 2D input space. The peak sits at the prototype W; the activation falls off as 1/‖x − W‖² (regularised by ε) outside the peak. Click anywhere to move W. The slider for ε controls how sharp the peak is — smaller ε means a more localised, more selective unit. This is the RKHS section attached to one prototype; the full layer is a weighted sum of m such sections, one per unit.

The Yat kernel is one choice in a family. Any positive-definite KK furnishes the four objects above; the geometry it induces — and therefore what the layer treats as “near” and “far” — depends on the particular kernel. The next viz puts four common kernels side by side with the same prototype W\mathbf{W} and the same input x\mathbf{x}, so the choice’s signature is visible.

click any panel to set the input x · the prototype W is shared across all four kernels

Gaussian

exp(−‖x − W‖² / 2σ²)

Laplace

exp(−‖x − W‖ / σ)

Yat

(x · W)² / (‖x − W‖² + ε)

polynomial

(x · W + 1)²
Four kernels K(·, W) with the same prototype W and the same input x. Click any panel to move x. Gaussian and Laplace are smooth bumps falling off with distance; Yat has the same bump shape but multiplied by the squared inner product, so far-from-origin directions are amplified; the polynomial kernel grows away from the origin and has no decay — picking it as your layer's primitive means the layer no longer treats W as a prototype in the geometric sense at all. The kernel is the choice; the four affordances follow whichever choice you make.

A small architecture that uses it

The interesting question is what the four affordances buy when a whole network depends on them. Build a network with three inputs — an image of digit aa, an image of an operator (+,,×,÷+, -, \times, \div), an image of digit bb — a shared CNN encoder that maps each 28×2828 \times 28 input to a 6464-d embedding, a single Yat layer mapping the concatenated 192192-d vector to 256256 unit activations, and a small ConvTranspose decoder that paints the answer as a 28×8428 \times 84 image with three slots: sign, tens, units.

painting-arithmetic · single nonlinear Yat trunk between encoder and decoder
hover any block for its tensor shape · hover a Yat prototype to see which symbol it tags in each slot · drag to pan · wheel to zoom
The architecture. Three image inputs share a CNN encoder into ℝ⁶⁴ each; the embeddings are concatenated into ℝ¹⁹²; a single Yat layer maps to 256 unit activations; a small ConvTranspose decoder paints the answer image, partitioned into sign/tens/units slots. The only nonlinear layer between encoder and decoder is the Yat trunk — so everything observable about the network's mid-stream representation is observable about that single layer.

The single-layer trunk is the point. With one Yat layer rather than a stack, every row of its 256×192256 \times 192 weight matrix is a named prototype in the encoder’s embedding space, and the network’s entire mid-stream representation is the matrix and the 256 unit activations it produces.

Each row WuR192\mathbf{W}_u \in \mathbb{R}^{192} partitions naturally into three R64\mathbb{R}^{64} slots matching the three input embeddings,

Wu  =  (Wu(a),  Wu(op),  Wu(b)),\mathbf{W}_u \;=\; \bigl(\mathbf{W}_u^{(a)},\; \mathbf{W}_u^{(\mathrm{op})},\; \mathbf{W}_u^{(b)}\bigr),

so the unit’s prototype factorises by input role: a centre for the first digit, a centre for the operator image, a centre for the second digit. The four operations below all act on this matrix.

Four operations that follow

Naming. For each unit, find the library symbol whose encoder embedding is nearest to each slot of Wu\mathbf{W}_u. The unit’s role is “fire on inputs that look like this (a,op,b)(a, \text{op}, b) triple.” No SAE, no probe; the weights of the layer are the dictionary. Of the 256256 trunk units in the trained network, 5151 have an operator symbol as the maximiser of their middle-slot prototype Wu(op)\mathbf{W}_u^{(\mathrm{op})}.

Visualisation. Push the one-hot activation αeu\alpha \cdot \mathbf{e}_u through the decoder. The result is the per-unit footprint image — the pixels unit uu alone paints into the output, with everything else off. Most footprints are spatially localised to a single output slot; the model has carved itself into a slot alphabet whose neurons are the trunk’s prototypes pushed forward by the decoder.

Ablation by name. Identify the units whose prototype matches a category in the named vocabulary. Zero those rows. By the layer’s kernel structure the rest of the units still fire on their own prototypes; only the targeted ones go silent. Specificity is a property of the prototype, not of a probe trained to find it.

Slot-level surgery. Where the unit’s input partitions into named subspaces, the prototype partitions with it. Zeroing only the operator-slot subspace of a tagged unit silences the unit’s reading of the operator image while leaving the digit pathways intact. This is the difference between “the unit contributes to behaviour XX” and “the unit contributes to behaviour XX specifically through input role rr” — a mechanistic claim activation steering on a black-box MLP cannot make.

The interventions, live

The ablation operations are the ones that turn descriptive interpretability into a falsifiable causal claim. The widget below applies each of three interventions — zero the entire row of a tagged unit, zero only its operator slot, or zero a random subset of the same size — to the trained trunk, and reports the per-operator OCR change.

knock out:
how:
Per-operator OCR before and after intervention. Pick the subset to knock out (the operator whose prototype tags the targeted units) and the intervention type (row, slot, or random). The targeted operator's bar should fall sharply; off-target operators should barely move. The 'row' and 'slot' interventions on the ×-tagged subset drop multiplication accuracy by ~44 and ~40 percentage points respectively while leaving division within 1.6 pp of baseline; a random subset of the same size drops everything roughly equally. The slot variant is the one that says the multiplication computation flows specifically through the 30 × 64 weights of these units' operator-slot prototypes.

The ×-tagged subset is the clearest case. Zeroing the 3030 rows whose middle-slot prototype is the ×\times glyph drops multiplication OCR from 86.5%86.5\% to 42.7%42.7\% — a fall of 43.743.7 percentage points — while division loses only 1.61.6 pp. Zeroing only the middle slot of those same 3030 rows reproduces 40.0-40.0 pp of that drop, while collateral damage on every other operator stays below 22 pp. The multiplication computation does not flow through these units’ digit pathways or through a distributed residual code; it flows specifically through the 30×6430 \times 64 weights of their operator-slot prototypes.

A random subset of 3030 units, ablated the same way, drops multiplication by only 9.29.2 pp. The targeted intervention is roughly four times as specific as random ablation of the same size.

None of this required a sparse autoencoder, a probe, or an external dictionary. The model’s “knowledge of multiplication” is a subset of rows in a 256256-row matrix, named directly off the kernel structure of the layer, and the prototype labels survive surgical editing of the matrix.

Be the optimiser

The point of the construction is that the layer’s parameters are legible enough that a human can place them by hand. The widget below puts you in the place of gradient descent: pick a classic 2D dataset, click anywhere on the chart to drop a prototype Wu\mathbf{W}_u, drag it to a position you think matters, and dial each αu\alpha_u on the slider beneath. The decision boundary f(x)=0f(\mathbf{x}) = 0 and the per-point accuracy update live. Outlined points are misclassified.

dataset:
kernel:
accuracy — · 0 prototypes
click empty space to place a prototype · drag a prototype to move · shift-click a prototype to delete · use sliders below to set each α
Classification by hand. Click empty space to add a prototype; drag to move; shift-click (or right-click) to delete. Per-unit α sliders are below the chart. Try the moons, circles, XOR, and spirals datasets: three to four well-placed prototypes per class with alternating-sign α is usually enough to clear the easy ones, and the spirals reward narrower σ plus more prototypes. The 'auto-fit α' button leaves your prototype positions alone and solves a ridge-regularised least-squares problem for the readouts — the linear projection the layer's downstream consumer would solve for you. The kernel toggle switches between Gaussian and Yat; the same prototypes give different boundaries because the geometry on input space changes.

The reason this is possible at all is that the layer’s two parameter families do separable, interpretable work. Wu\mathbf{W}_u is where the unit listens; αu\alpha_u is how its kernel score is read out. Placing prototypes near each class’s mass and turning on the readouts is the same algorithm a kernel SVM would run, performed in your head — and the same algorithm a trained kernel-MLP layer ends up at, performed by gradient descent. The picture you get on the chart is the picture the layer encodes in its weight matrix. Train it instead of clicking it and you’d see the prototypes drift toward the same kind of placement.

This is the affordance the standard MLP doesn’t have. There is no analogous game for ReLU units — placing affine half-planes by hand to separate a moons dataset is possible but unilluminating, because the units don’t tell you what they listen for. The kernel layer’s two-parameter structure makes the construction not just possible but pedagogical: you can see the layer’s job, do its job by hand, check your work, and read out what you did.

The MLP chose to be opaque

The point of the experiment is not that this particular architecture is the right one to scale. The Yat unit is rational where the standard MLP is affine, the optimisation behaves differently, the FLOPs are different, the practical scaling regime is an open question that a 0.810.81 M-parameter network does not answer.

The point is structural. The MLP block was illegible not because it was deep, not because high dimensions are opaque, not because activations get distributed across many neurons. It was illegible because its primitive — affine map plus pointwise nonlinearity — does not carry the four objects that make a kernel layer readable. The collapse is mechanical (see Activations Are Bad for Geometry): the resulting layer has no native similarity score, no prototype, no normalised contribution, no induced metric. Everything downstream interpretability has spent the past five years building is an external apparatus to recover the objects the primitive does not supply.

Replace the primitive with one that does supply them and the apparatus becomes superfluous. The same operations attention has enjoyed by construction — name a unit, visualise it, ablate it, slice it by input role — become one-line edits of the layer’s weight matrix, and the resulting claims have the strength attention claims do, because they are statements about the layer’s actual representation rather than about a learned proxy for it.

The MLP block is not inherently opaque. It chose to be, by adopting a primitive that does not carry geometry. The choice is reversible.

Cite as

Bouhsine, T. (). What an MLP Knows, When It's a Kernel. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/what-an-mlp-knows/

BibTeX
@misc{bouhsine2026whatanmlpknows,
  author       = {Bouhsine, Taha},
  title        = {What an MLP Knows, When It's a Kernel},
  year         = {2026},
  month        = {may},
  howpublished = {\url{https://tahabouhsine.com/blog/what-an-mlp-knows/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space. Supporting experiment. [PDF]

BibTeX
@unpublished{bouhsine2026paintingarithmetic,
  author = {Bouhsine, T.},
  title  = {Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space},
  year   = {2026},
  note   = {Supporting experiment}
}