What an MLP Knows, When It's a Kernel

May 21, 2026 · 20 min read

#ml #interpretability #kernels #mlp #transformers #mechanistic-interpretability #neural-networks #yat-unit #deep-learning

Part 2 of 5Attention Is a Kernel

1Attention is Explainable Because it is a Kernel
2What an MLP Knows, When It's a Kernelyou are here
3Cheap Attention: Linear-Time Kernel Approximation
4Why Attention Needs Q and K Projections
5The Kernel Between the Roles

A transformer block has two layers, and one of them is read in a way the other is not. Attention heads have names, induction, name-mover, positional, copying, drawn from the algorithmic behaviour they implement, pointed at in code, ablated to confirm a single function disappears. The position-wise MLP, sitting in the same residual stream and consuming a comparable parameter budget, almost never affords this kind of reading. The standard interpretability move for the MLP is to give up on its native representation and project into something else, fit a sparse autoencoder, train a probe, look at activation magnitudes, then try to map back to the layer’s units. The MLP does not help.

It does not help for a structural reason, not a difficulty of depth or scale. Attention is a kernel machine, I made the argument at length in Attention is Explainable Because it is a Kernel, and the four objects it supplies (a real-valued pairwise score on every input/unit pair, a normalised contribution mass, a geometry on the inputs, and a basis of meaningful per-unit directions) are exactly the ones interpretability tools spend years rebuilding for layers that lack them. The standard MLP layer is not a kernel machine. Its primitive is an affine map followed by a pointwise nonlinearity, and pointwise nonlinearities are bad for geometry: the same diagonal modulation that buys selectivity destroys the geometry the linear part might have carried, leaving the layer’s output without a native similarity score, without a normalised contribution, without a distance.

This post asks the constructive question. What does an MLP look like when its primitive is a kernel? Which of attention’s four objects show up, and what do they let you do?

What’s actually inside the block

What do the two layers actually do, mechanically, before any interpretation gets layered on top? Three step-through diagrams settle that first, worth a minute even if the rest is review for you. Click play to animate, or step manually.

A standard transformer block wires two sub-layers around a residual stream:

—

A single transformer block. The residual stream carries token vectors top-to-bottom; each sub-block reads it, computes a delta, and adds the delta back. Attention mixes tokens with each other via a kernel similarity; the MLP transforms each token independently. The dashed curves are the residual skip paths.

The attention sub-block is a kernel machine in plain sight: a query–key inner product makes a similarity matrix, a row-wise softmax turns it into a contribution distribution, and the result is applied to values. (For the longer reading, RKHS, Mercer, why softmax-normalised similarity is the Nadaraya–Watson smoother, see Attention is Explainable Because it is a Kernel.)

—

The attention sub-block, stage by stage. X is the token matrix [N, d]; three learned projections produce Q, K, V (each [N, dₖ] with rows aligned to the same N tokens). The kernel step is S = Q Kᵀ / √dₖ, every entry S[i, j] is the inner product of one query and one key, a real-valued pairwise score with named rows and columns. A row-wise softmax turns it into α, a row-stochastic [N, N] matrix, and the output is α V, each output row a convex mix of value rows. Hover any S, α, or output cell to see the row pair that produced it; the dashed lines connect back to the source Q, K, V rows.

The MLP sub-block is the other side. It is an affine map, a pointwise activation, and another affine map, with nothing in between that names which inputs the layer is responding to.

—

A standard MLP block, stage by stage. The input vector x goes through a first linear map (W₁x + b₁), then a pointwise activation φ, then a second linear map (W₂a + b₂). Stage 3 is the geometric breakpoint: φ acts coordinatewise, the singular values of the layer's Jacobian get multiplied by activation derivatives that vanish (ReLU) or saturate (sigmoid/tanh), and any geometry the linear part provided collapses. The collapse is the topic of an earlier post, see *Activations Are Bad for Geometry*.

Three diagrams, two structurally different layers. Attention has a kernel; the MLP doesn’t. The rest of the post is about what changes if we put one in.

What a kernel layer carries

What changes the moment a single unit computes a kernel instead of an affine map? Put one next to the standard ReLU unit and the difference is visible before any algebra. The ReLU unit’s response is a half-plane: above the affine hyperplane $\mathbf{w} \cdot \mathbf{x} + b = 0$ it is positive and grows without bound; below the hyperplane it is zero. There is no notion of a peak in input space, no symmetric “this is the input the unit is looking for”, the unit carries a direction, not a point. (This is the same distinction Opposite Is Not Different draws when it argues that cosine similarity has three landmarks: the unit’s direction on the sphere is only ever maximally different from its negation, not from an orthogonal alternative.) The kernel unit, by contrast, is a localised bump centred at a learnable point.

direction θ 29° click either chart to set θ · hover to read the activation at the cursor

standard MLP unit · max(0, w · x + b)

—

kernel unit · (x · W)² / (‖x − W‖² + ε)

—

Activation pattern of a standard ReLU unit (left) and a Yat kernel unit (right) over a 2D input space. Both have the same parameter direction. The ReLU unit fires across the entire half-plane in the direction of w, with no notion of where the response is centred. The kernel unit has a single localised peak at the prototype W. Click either chart to rotate the parameter; the slider does the same. The contrast is the entire argument: the kernel unit carries a prototype in input space; the ReLU unit carries only a direction.

That one localised peak is not a single property. Pull on it and four objects come out in sequence, each dragging the next one with it, in direct parallel to attention.

A pairwise score on every input/unit pair. The unit’s activation $y_u(\mathbf{x}) = k(\mathbf{x}, \mathbf{W}_u)$ is a real number that says how strongly $\mathbf{x}$ matches the unit’s centre, and the score is not a diagnostic bolted on afterwards: it is literally what the layer computes. The ReLU unit’s output, by contrast, is a scalar function of one direction, with no symmetric notion of similarity between the input and any other point.

A learnable centre in input space. A score of “how close” has to be close to something, and that something is $\mathbf{W}_u$ : the point the unit is looking for. Because the response is highest when $\mathbf{x}$ matches $\mathbf{W}_u$ , the weight vector is, formally, a soft prototype, interpretable in the same vocabulary as the network’s inputs. The standard MLP weight $\mathbf{w}$ is a direction; the kernel unit’s weight is a point.

A normalised contribution mass. Because kernel scores are non-negative, they can be normalised: softmax, $L_1$ , anything that converts non-negative scores to a partition of unity turns the layer’s activations into a contribution distribution over prototypes. Statements of the form ” $45\%$ of this output came from unit $u_7$ ” become first-class.

A geometry on the inputs. And once every input carries a full profile of scores, the profiles induce a metric on input space: two inputs are close iff their unit-score profiles are close. For a Mercer kernel the metric is genuinely Riemannian; the layer pulls the network’s downstream notion of “near” back to a metric on $\mathbb{R}^n$ that the network respects.

These are the four objects practitioners pick up when they read an attention head and put down when they try to read a position-wise MLP.

The two parameters of a kernel unit

So what does gradient descent actually get to move in such a layer? Each unit exposes two parameters, and they do separable work. The first is the prototype $\mathbf{W}_u \in \mathbb{R}^n$ . It is a point in input space, formally the location of a kernel section $K(\cdot, \mathbf{W}_u)$ , which is a single function in the RKHS associated to the kernel $K$ . The unit’s response to an input $\mathbf{x}$ is a measurement against that section: how close, in the kernel’s geometry, is $\mathbf{x}$ to $\mathbf{W}_u$ . Selecting a neuron, in this layer, means evaluating $K(\mathbf{x}, \mathbf{W}_u)$ , the prototype’s RKHS section sampled at the input.

The second parameter is the readout coefficient $\alpha_u \in \mathbb{R}$ . It is not a similarity score. It is the weight the layer places on the prototype’s contribution to the layer’s output. Positive $\alpha_u$ means “this prototype pushes the output up when its kernel fires”; negative $\alpha_u$ means “this prototype pushes the output down.” Magnitude is how loudly. With $m$ units, the layer computes a finite kernel expansion in the RKHS,

f(\mathbf{x}) \;=\; \sum_{u=1}^{m} \alpha_u \, K(\mathbf{x}, \mathbf{W}_u).

This is the same finite kernel expansion classical SVMs and Gaussian processes use; the difference is that here the centres $\mathbf{W}_u$ are learned end-to-end rather than fixed to the training data, and the $\alpha_u$ are the readout coefficients that turn the population of prototype responses into a layer output.

kernel width σ 0.40

drag x · drag a prototype to move W_u · sliders below control each readout α_u

readouts α_u · each is learned per-neuron and acts as a linear projection from the kernel-similarity vector to the layer output:

—

A kernel-layer MLP block in one picture. Six prototypes W_u live in the 2D input space, each shown as a coloured dot, orange for positive readout α_u, slate for negative; saturation encodes magnitude. The heatmap behind them is the layer's output function f(x) = Σ αᵤ K(x, Wᵤ); orange regions where the layer outputs positive values, slate regions where it outputs negative. Drag the input x (the open white circle) to see, on the right, each unit's contribution αᵤ · K(x, Wᵤ) as a bar, plus the running sum f(x). Selecting a neuron is just reading off the largest |αᵤ · K(x, Wᵤ)|: the closer x is to Wᵤ in the kernel's geometry, the bigger the contribution.

The picture above is the structural content of an entire MLP block compressed to one page. Each prototype is a function in the RKHS, its bump in the heatmap. The readout coefficients pick out which bumps add positively and which subtract. The input $\mathbf{x}$ activates the units in proportion to how close it is to each prototype. The layer’s output is the signed sum of those activations.

One instantiation: the Yat unit

Which kernel should a unit actually carry? The experiments below use the Yat unit, which on $\mathbf{x} \in \mathbb{R}^n$ computes

y_u(\mathbf{x}) \;=\; \alpha_u \,\frac{(\mathbf{x} \cdot \mathbf{W}_u + b)^2}{\|\mathbf{x} - \mathbf{W}_u\|^2 + \varepsilon},

with prototype $\mathbf{W}_u \in \mathbb{R}^n$ , bias $b$ , regulariser $\varepsilon > 0$ , and readout $\alpha_u$ . The fraction is the kernel similarity $K(\mathbf{x}, \mathbf{W}_u)$ : its denominator is a regularised squared Euclidean distance from $\mathbf{x}$ to $\mathbf{W}_u$ , minimised at $\mathbf{x} = \mathbf{W}_u$ ; its numerator is a squared inner product. The unit’s score is high when $\mathbf{x}$ is both close to $\mathbf{W}_u$ and aligned with it. The readout $\alpha_u$ multiplies that score to give the unit’s actual contribution to the layer.

ε 0.05 click on the chart to move the prototype W · this is one kernel section K(·, W); the readout α is downstream

Where the previous contrast settled that the unit has a peak at all, this panel is about the peak's shape. The Yat kernel section K(·, W) over a 2D input space falls off as 1/‖x − W‖² (regularised by ε) outside the peak. Click anywhere to move W; the ε slider is the selectivity dial, smaller ε means a sharper, more localised unit. This is the RKHS section attached to one prototype; the full layer is a weighted sum of m such sections, one per unit.

The Yat kernel is one choice in a family. Any positive-definite $K$ furnishes the four objects above; the geometry it induces, and therefore what the layer treats as “near” and “far”, depends on the particular kernel. The next viz puts four common kernels side by side with the same prototype $\mathbf{W}$ and the same input $\mathbf{x}$ , so the choice’s signature is visible.

width σ 0.50 click any panel to set the input x · the prototype W is shared across all four kernels

Gaussian

exp(−‖x − W‖² / 2σ²)

—

Laplace

exp(−‖x − W‖ / σ)

—

Yat

(x · W)² / (‖x − W‖² + ε)

—

polynomial

(x · W + 1)²

—

Four kernels K(·, W) with the same prototype W and the same input x. Click any panel to move x. Gaussian and Laplace are smooth bumps falling off with distance; Yat has the same bump shape but multiplied by the squared inner product, so far-from-origin directions are amplified; the polynomial kernel grows away from the origin and has no decay, picking it as your layer's primitive means the layer no longer treats W as a prototype in the geometric sense at all. The kernel is the choice; the four affordances follow whichever choice you make.

A small architecture that uses it

Do the four affordances survive contact with a real task, when a whole network depends on them? The Painting Arithmetic experiment (the paper cited at the end of this post; every measured number below is from the run it reports) builds exactly the test case: a network with three inputs, an image of digit $a$ , an image of an operator ( $+, -, \times, \div$ ), an image of digit $b$ , a shared CNN encoder that maps each $28 \times 28$ input to a $64$ -d embedding, a single Yat layer mapping the concatenated $192$ -d vector to $256$ unit activations, and a small ConvTranspose decoder that paints the answer as a $28 \times 84$ image with three slots: sign, tens, units.

painting-arithmetic · single nonlinear Yat trunk between encoder and decoder

hover any block for its tensor shape · hover a Yat prototype to see which symbol it tags in each slot · drag to pan · wheel to zoom

The architecture. Three image inputs share a CNN encoder into ℝ⁶⁴ each; the embeddings are concatenated into ℝ¹⁹²; a single Yat layer maps to 256 unit activations; a small ConvTranspose decoder paints the answer image, partitioned into sign/tens/units slots. The only nonlinear layer between encoder and decoder is the Yat trunk, so everything observable about the network's mid-stream representation is observable about that single layer.

The single-layer trunk is the point. With one Yat layer rather than a stack, every row of its $256 \times 192$ weight matrix is a named prototype in the encoder’s embedding space, and the network’s entire mid-stream representation is the matrix and the 256 unit activations it produces.

Each row $\mathbf{W}_u \in \mathbb{R}^{192}$ partitions naturally into three $\mathbb{R}^{64}$ slots matching the three input embeddings,

\mathbf{W}_u \;=\; \bigl(\mathbf{W}_u^{(a)},\; \mathbf{W}_u^{(\mathrm{op})},\; \mathbf{W}_u^{(b)}\bigr),

so the unit’s prototype factorises by input role: a centre for the first digit, a centre for the operator image, a centre for the second digit. The four operations below all act on this matrix.

Four operations that follow

If every row of the trunk is a named point in the encoder’s embedding space, then operations that are research programs on a standard MLP shrink to one-line edits of a matrix. Four of them, in escalating strength.

Naming. For each unit, find the library symbol whose encoder embedding is nearest to each slot of $\mathbf{W}_u$ . The unit’s role is “fire on inputs that look like this $(a, \text{op}, b)$ triple.” No SAE, no probe; the weights of the layer are the dictionary. And the dictionary says something five years of superposition results teach you not to expect. A polysemantic layer should smear “operator-ness” across everything and concentrate it nowhere; instead, in the paper’s trained network, $55$ of the $256$ trunk units have an operator class as the winner of their middle-slot prototype $\mathbf{W}_u^{(\mathrm{op})}$ , a distinct operator-reading block sitting inside the trunk, visible from the weights alone.

Visualisation. Push the one-hot activation $\alpha \cdot \mathbf{e}_u$ through the decoder. The result is the per-unit footprint image, the pixels unit $u$ alone paints into the output, with everything else off. In the paper’s unit atlas, many footprints are spatially localised to a single output slot; the model has carved itself into a slot alphabet whose neurons are the trunk’s prototypes pushed forward by the decoder.

Ablation by name. Identify the units whose prototype matches a category in the named vocabulary. Zero those rows. By the layer’s kernel structure the rest of the units still fire on their own prototypes; only the targeted ones go silent. Specificity is a property of the prototype, not of a probe trained to find it.

Slot-level surgery. Where the unit’s input partitions into named subspaces, the prototype partitions with it. Zeroing only the operator-slot subspace of a tagged unit silences the unit’s reading of the operator image while leaving the digit pathways intact. This is the difference between “the unit contributes to behaviour $X$ ” and “the unit contributes to behaviour $X$ specifically through input role $r$ ”, a mechanistic claim activation steering on a black-box MLP cannot make.

The knockout

Names are cheap until you bet on them, so here is the bet, and the expectation to measure it against. If the trunk’s knowledge were distributed and polysemantic the way a standard MLP’s is, deleting $30$ of its $256$ rows should hurt every operation a little and no operation much, whichever $30$ you pick. The paper runs that bet three ways: zero the entire row of each tagged unit, zero only its operator slot, or zero a random subset of the same size, and it reports the per-operator OCR change for each. The panel below replays the paper’s measured intervention table; pick a subset and an intervention and watch which expectation survives.

knock out:

how:

—

ΔOCR per operator, replaying the measured intervention table (Table 2) of the Painting Arithmetic paper; every bar is a number from the paper's run, none is computed on this page. Pick the knocked-out subset (the operator whose prototype tags the targeted units) and the intervention (row, slot, or random). The targeted operator's bar collapses while off-target bars barely move; a random subset of the same size hurts everything indiscriminately. The paper reports the slot and random variants only for the × and + subsets, and a per-operator baseline only for × (86.5%), which is why the chart shows changes in percentage points rather than absolute accuracies.

The distributed-knowledge expectation breaks on the first click. In the paper’s run, zeroing the $30$ rows whose middle-slot prototype is the $\times$ glyph drops multiplication OCR from $86.5\%$ to $42.7\%$ , a fall of $43.7$ percentage points, while division loses only $1.6$ pp. Zeroing only the middle slot of those same $30$ rows reproduces $-40.0$ pp of that drop, and the worst collateral hit on any other operator is $3.6$ pp, on subtraction. The multiplication computation does not flow through these units’ digit pathways or through a distributed residual code; it flows specifically through the $30 \times 64$ weights of their operator-slot prototypes.

Could any $30$ rows have done that? The paper’s control says no: a random subset of $30$ units, ablated the same way, drops multiplication by only $9.2$ pp. By the paper’s specificity score (the target’s drop minus the mean drop on the other three operators), the ×-row knockout scores $+33.8$ pp against a random-ablation baseline of $+2.1$ pp.

None of this required a sparse autoencoder, a probe, or an external dictionary. The model’s “knowledge of multiplication” is a subset of rows in a $256$ -row matrix, named directly off the kernel structure of the layer, and the prototype labels survive surgical editing of the matrix. Every number in this section is from the run reported in the paper cited below; a runnable companion that reproduces the table from a fresh training run is planned.

Be the optimiser

The point of the construction is that the layer’s parameters are legible enough that a human can place them by hand. The widget below puts you in the place of gradient descent: pick a classic 2D dataset, click anywhere on the chart to drop a prototype $\mathbf{W}_u$ , drag it to a position you think matters, and dial each $\alpha_u$ on the slider beneath. The decision boundary $f(\mathbf{x}) = 0$ and the per-point accuracy update live. Outlined points are misclassified.

dataset:

kernel:

width σ 0.50 accuracy, · 0 prototypes

click empty space to place a prototype · drag a prototype to move · shift-click a prototype to delete · use sliders below to set each α

Classification by hand. Click empty space to add a prototype; drag to move; shift-click (or right-click) to delete. Per-unit α sliders are below the chart. Try the moons, circles, XOR, and spirals datasets: three to four well-placed prototypes per class with alternating-sign α is usually enough to clear the easy ones, and the spirals reward narrower σ plus more prototypes. The 'auto-fit α' button leaves your prototype positions alone and solves a ridge-regularised least-squares problem for the readouts, the linear projection the layer's downstream consumer would solve for you. The kernel toggle switches between Gaussian and Yat; the same prototypes give different boundaries because the geometry on input space changes.

The reason this is possible at all is that the layer’s two parameter families do separable, interpretable work. $\mathbf{W}_u$ is where the unit listens; $\alpha_u$ is how its kernel score is read out. Placing prototypes near each class’s mass and turning on the readouts is the same algorithm a kernel SVM would run, performed in your head. The picture you get on the chart is the kind of picture such a layer encodes in its weight matrix. Whether gradient descent, given the same layer, drifts the prototypes toward the same kind of placement is a hypothesis this playground does not test; the trained trunk of the previous section, whose rows land on nameable symbol prototypes, is one data point in its favour.

This is the affordance the standard MLP doesn’t have. There is no analogous game for ReLU units, placing affine half-planes by hand to separate a moons dataset is possible but unilluminating, because the units don’t tell you what they listen for. The kernel layer’s two-parameter structure makes the construction not just possible but pedagogical: you can see the layer’s job, do its job by hand, check your work, and read out what you did.

The MLP chose to be opaque

The point of the experiment is not that this particular architecture is the right one to scale. The Yat unit is rational where the standard MLP is affine, the optimisation behaves differently, the FLOPs are different, the practical scaling regime is an open question that a $0.81$ M-parameter network does not answer.

The point is structural. The MLP block was illegible not because it was deep, not because high dimensions are opaque, not because activations get distributed across many neurons. It was illegible because its primitive, affine map plus pointwise nonlinearity, does not carry the four objects that make a kernel layer readable. The collapse is mechanical (see Activations Are Bad for Geometry): the resulting layer has no native similarity score, no prototype, no normalised contribution, no induced metric. Everything downstream interpretability has spent the past five years building is an external apparatus to recover the objects the primitive does not supply.

Replace the primitive with one that does supply them and the apparatus becomes superfluous. The same operations attention has enjoyed by construction, name a unit, visualise it, ablate it, slice it by input role, become one-line edits of the layer’s weight matrix, and the resulting claims have the strength attention claims do, because they are statements about the layer’s actual representation rather than about a learned proxy for it.

The MLP block is not inherently opaque. It chose to be, by adopting a primitive that does not carry geometry. The choice is reversible.

Cite as

Bouhsine, T. (2026, May 21). What an MLP Knows, When It's a Kernel. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/what-an-mlp-knows/

BibTeX

@misc{bouhsine2026whatanmlpknows,
  author       = {Bouhsine, Taha},
  title        = {What an MLP Knows, When It's a Kernel},
  year         = {2026},
  month        = {may},
  howpublished = {\url{https://tahabouhsine.com/blog/what-an-mlp-knows/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space. Supporting experiment. [PDF]

BibTeX

@unpublished{bouhsine2026paintingarithmetic,
  author = {Bouhsine, T.},
  title  = {Painting Arithmetic: A Rational-Form Network for Visual Symbolic Computation in Latent Space},
  year   = {2026},
  note   = {Supporting experiment}
}

References

Mercer, J. (1909). Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations. Philosophical Transactions of the Royal Society A 209, 415–446.
Nadaraya, E. A. (1964). On Estimating Regression. Theory of Probability & Its Applications 9(1), 141–142.
Watson, G. S. (1964). Smooth Regression Analysis. Sankhyā: The Indian Journal of Statistics, Series A 26(4), 359–372.

What’s actually inside the block#

What a kernel layer carries#