Attention is Explainable Because it is a Kernel

May 13, 2026 · 13 min read

#ml #attention #kernels #interpretability

A reading of self-attention through kernel smoothing and RKHS.

Self-attention is the part of a transformer that practitioners can usually read. Heads can be visualized, individual attention patterns line up with recognizable behaviors, and a token’s output can be written as an explicit weighted sum over other tokens. The position-wise MLP, sitting in the very same block, almost never affords this kind of reading.

This piece is an explanation, not a new result. The observation that self-attention is mathematically a kernel smoother is due to Tsai et al. (2019) and has been developed further by Song et al. (2021), Choromanski et al. (2021), Katharopoulos et al. (2020), and Han et al. (2022), among others. What I want to emphasize, and what seems to me underdiscussed, is that this is not just a reformulation useful for designing efficient attention variants. It is the reason attention is explainable in the first place.

The claim of this piece, in one sentence: everything practitioners find “explainable” about attention is downstream of the single structural fact that attention has a kernel, and therefore it has a geometry, normalized contribution mass, and (when one is willing to symmetrize) an RKHS in which to reason about it. A standard ReLU MLP has none of these, not because it is more powerful, but because it is not a kernel machine.

The asymmetry that the architecture does not advertise

Inside a single transformer block, self-attention and the position-wise MLP sit on equal footing. They share the residual stream and consume comparable parameter budgets. Yet attention heads are routinely described in algorithmic terms (induction heads, copying heads, name-mover heads, positional heads), while MLPs are described, when described at all, as “computation,” “feature synthesis,” or simply “the part we don’t yet understand.”

The disparity is so familiar that it is easy to mistake for a fact about visualization tooling. It is not. The asymmetry is mathematical, and it has a name in the older literature: attention is a kernel smoother, and the MLP is not.

Once one sees attention as a kernel smoother, the things practitioners do when they “explain a head” (reading off pairwise affinities $q_i \cdot k_j$ , attributing portions of the output to specific tokens, reasoning about locality and retrieval) turn out to be exactly the operations that kernel smoothers were designed to support a half-century before transformers existed. The MLP, by contrast, has no kernel and so admits none of these operations natively. Whatever interpretability one can extract from it has to be imposed externally, by training a separate decoder, projecting into a learned dictionary, or otherwise constructing the geometry that the layer itself does not carry.

Attention as a kernel smoother

The starting point is the definition every reader knows. Given a token sequence with queries $Q$ , keys $K$ , and values $V$ obtained by learned linear projections, scaled dot-product attention returns:

\mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt d}\right) V,

or, written token by token,

y_i = \sum_j \alpha_{ij}\, v_j, \qquad \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt d)}{\sum_m \exp(q_i \cdot k_m / \sqrt d)}.

Side by side with this, write down the Nadaraya–Watson kernel regression estimator from classical nonparametric statistics. Given observations $(x_i, y_i)$ and a query point $x$ :

\hat f(x) = \sum_i \frac{K(x, x_i)}{\sum_j K(x, x_j)}\, y_i.

These two expressions are the same up to relabeling.

bandwidth h 0.16 drag on the chart to move the query x

Nadaraya–Watson kernel regression. Drag along the chart to move the query x. The orange bell at the bottom is the kernel centered at x; data points are sized by their kernel weight. The smoothed curve is the prediction for every x in the domain. Self-attention is this picture, with the kernel replaced by an exponential dot product over learned q/k projections and the y_i replaced by learned value projections.

Kernel regression	Self-attention
Query point $x$	Query projection $q_i$
Data point $x_i$	Key projection $k_j$
Kernel $K(x, x_i)$	$\exp(q_i \cdot k_j / \sqrt d)$
Target $y_i$	Value projection $v_j$
Normalized weighting	Softmax denominator

The only structural difference is that the kernel in attention is non-symmetric: queries and keys are projected by different matrices $W_Q$ and $W_K$ , so the exp-inner-product “kernel” is not positive semi-definite as a function of $(q, k)$ and not a Mercer kernel in the strict sense. This caveat will return when we get to RKHS, but for the kernel-smoother reading itself it changes nothing. Attention is a kernel smoother that has learned its kernel and its targets end-to-end, and this fact alone supplies the structural affordances that make attention readable.

Three affordances kernel structure gives you

These are what you are actually using whenever you “read” an attention pattern.

1. The kernel is an explicit pairwise score. For every pair of tokens $(i, j)$ there is a single real number $q_i \cdot k_j$ that summarizes how relevant token $j$ is to token $i$ under this head. Visualizing a head means visualizing the matrix of $\alpha_{ij}$ . Comparing two heads means comparing two such matrices. The algorithmic descriptions one finds in the mechanistic interpretability literature (“this head copies from the most recent occurrence of the current token,” “this head attends from each token to its syntactic head”) are statements about the structure of this matrix. The kernel supplies a geometry on tokens, and geometry is the kind of object humans can reason about.

2. The weights normalize. The softmax enforces $\sum_j \alpha_{ij} = 1$ with $\alpha_{ij} \ge 0$ , so each output $y_i$ is a convex combination of value vectors. This is the source of every attribution-style statement one ever makes about attention. When we say a head “moved information from token $j$ to token $i$ ,” we mean $\alpha_{ij}$ was large. When we say a head “ignored token $j$ ,” we mean $\alpha_{ij}$ was small. These statements are coherent precisely because the weights are normalized contribution masses, not arbitrary activations.

A standard linear layer, where some output coordinate is a linear combination of input coordinates with weights that can be positive, negative, or large in magnitude, does not admit this reading at all. The weights are not a partition of unity, do not compose across layers in any attribution-respecting way, and need not even be of consistent sign for nearby inputs. The fact that attention does admit it is again a direct consequence of its being a kernel smoother. In the Nadaraya–Watson form the weights normalize for the same reason and serve the same role.

positional

induction

previous token

Three causal attention matrices over a 12-token sequence, each induced by a different kernel structure. Row i shows what token i attends to; brighter cells are higher α_ij. Left: a positional kernel that decays with distance. Center: an induction-like pattern where each token looks back to positions following earlier occurrences of the previous token. Right: a previous-token kernel. These are not three different layers — they are three different heads. The kernel is what tells you which one you are reading.

3. Kernels carry locality. Kernel smoothers, by construction, weight nearby points more heavily than distant ones, and the notion of “nearby” is whatever the kernel says it is. In attention, “nearby” is high $q_i \cdot k_j$ , which the model is free to shape during training. The upshot is that attention behaves, by default, like a content-addressable nearest-neighbor retrieval over the sequence: $j^{*}_i = \operatorname*{arg\,max}_j q_i \cdot k_j$ in the sharp limit. This is exactly how it gets used in many of the algorithms that mechanistic interpretability has uncovered. Induction heads retrieve previous occurrences of a token. Name-mover heads retrieve antecedents. Positional heads retrieve fixed offsets. None of these descriptions requires anything beyond the kernel-smoother view; they are different specializations of “look up the most similar previous token under this kernel.”

And, when you symmetrize, an RKHS

There is one further connection, weaker but worth stating, that completes the picture. If one is willing to symmetrize the kernel, replacing the query/key asymmetry with a single symmetric, positive-definite kernel $K(z, z')$ acting on a shared representation (as several variants in the literature do), then attention lives inside a reproducing kernel Hilbert space

\mathcal{H}_K = \overline{\mathrm{span}}\{K(\cdot, z) : z \in \mathbb{R}^d\}.

The Han et al. analysis makes this explicit: attention can be read as kernel density estimation, and KDE itself is a kernel regression problem in an RKHS.

The reason this matters for explanation, beyond the formal pleasure of having a Hilbert space, is that the RKHS view supplies two further objects that a generic layer does not have.

The first is a function-level norm. The function $f \in \mathcal{H}_K$ that the layer computes has a well-defined $\|f\|_{\mathcal{H}_K}$ , and this norm controls smoothness, generalization, and the per-input complexity of the prediction in ways that are classical and quantitative.

The second is a basis. The kernel sections $\{K(\cdot, k_j)\}$ are basis elements of $\mathcal{H}_K$ , and the function the layer computes is a finite expansion

f(\cdot) = \sum_j \beta_j\, K(\cdot, k_j)

in this basis with explicit coefficients. “Which token does this output depend on, and by how much?” becomes a literal question about coefficients in a fixed basis, not an interpretive one.

kernel width σ 0.13 show basis sections

A function (dashed) approximated as a finite sum of kernel sections K(·, k_j) centered at fixed anchor points. The orange curve is the sum; the lighter strokes are the individual α_j K(·, k_j). Shrinking the kernel width makes each section more local and the explanation more token-specific; widening it pools more sections together. This is what the inside of a kernel-machine layer looks like.

Even where the strict RKHS structure breaks (because the attention kernel is non-symmetric), the kernel-smoother view retains a geometry on tokens: a learned distance, an explicit similarity score. It is this geometry that makes attention legible. The MLP, as we now argue, has no geometry at all.

The MLP has no kernel, and that is the whole story

We can now state plainly the side of the story that the interpretability literature usually phrases as “MLPs are hard,” because the kernel reading makes the obstruction concrete.

A position-wise MLP is, by definition,

\mathrm{MLP}(x) = W_2\, \sigma(W_1 x + b_1) + b_2,

typically with $\sigma$ a ReLU or GELU.

There is no kernel $K(x, x')$ in this expression, learned or otherwise. There is no similarity score one can point at. There is no normalized weighting of inputs into outputs. There is no notion of which directions in $\mathbb{R}^d$ the layer treats as “nearby.” There is no Hilbert space in which the function $\mathrm{MLP}$ lives with a controllable norm.

The pre-activations $(W_1 x + b_1)_i = w_i \cdot x + b_i$ are linear features of the input, but a linear feature is not a kernel: it scores in one direction only, $w_i \in \mathbb{R}^d$ , and supplies no geometry on pairs $(x, x')$ . The ReLU nonlinearity then composes with the second linear map $W_2$ , which mixes neurons arbitrarily and erases any chance that an individual unit corresponds to a human-readable feature.

This is, in a precise sense, the structural origin of the now-standard observations about MLP neurons. They are polysemantic because the parameterization does not reward monosemy. They are distributed because the parameterization does not pick out a privileged basis. They are basis-dependent in their feature decomposition because no canonical basis is on offer. The phenomenon called “superposition” is just the absence of a geometry. There is no metric on input space that the layer respects, and so features have nowhere natural to live except in arbitrary linear combinations of activations.

By contrast, attention does carry a geometry: the kernel. And so attention exposes its features in that geometry, in the only basis the model already uses, namely the token basis. The asymmetry is not that one layer is more powerful than the other; it is that one layer is a kernel machine and the other is not.

But isn’t this just the price of universal approximation?

The natural objection here is that universal approximation forces this state of affairs. A function class powerful enough to approximate arbitrary continuous functions cannot afford the rigidity of a kernel expansion.

The objection conflates the function class with the parameterization. Classical kernel methods are themselves universal in the relevant sense. An RKHS $\mathcal{H}_K$ with a sufficiently rich kernel (for example, a characteristic kernel) is dense in $C(\mathcal{X})$ on a compact set, and yet every predictor in $\mathcal{H}_K$ remains a finite expansion

f = \sum_i \alpha_i\, K(\cdot, x_i)

in kernel sections with closed-form norm $\|f\|^2_{\mathcal{H}_K} = \alpha^\top K \alpha$ .

The reason a transformer MLP is not such an object is not that the function class is too rich. It is that the architecture has chosen the cheapest possible primitive (an affine map followed by a pointwise nonlinearity) and accepted the loss of structure as the cost. This is a design decision, not a theorem.

Recent work makes the point constructively. The Yat kernel,

k_{b,\varepsilon}(w, x) = \frac{(w^\top x + b)^2}{\|x - w\|^2 + \varepsilon}, \qquad b \ge 0,\ \varepsilon > 0,

is a hidden-unit primitive that is a Mercer kernel for $b \ge 0$ , dominates a scaled inverse-multiquadric in the Loewner order so that its RKHS is universal and characteristic, and yields a layer that is by construction a finite learned-center kernel expansion

f(x) = \sum_{i=1}^{n} \alpha_i\, k_{b,\varepsilon}(w_i, x)

with closed-form RKHS norm $\alpha^\top K \alpha$ .

I am not arguing here that Yat-style MLPs are a practical replacement for the transformer FFN. That is an empirical question and not the subject of this piece. The point is to close the explanatory loop. The opacity of standard MLPs is not the price of expressivity. It is the price of giving up the kernel.

What this does and does not claim

Attention weights are not the same as explanations, and the literature contains a well-known back-and-forth on exactly how far one can trust them. What the kernel reading provides is not faithfulness but affordances: the structural objects (pairwise scores, normalized contributions, a geometry, sometimes an RKHS) that any honest explanation has to be grounded in.

The MLP lacks those objects natively. Until it has them, every explanation of MLP behavior has to import its geometry from elsewhere, whether through a sparse dictionary, a probe, or a trained decoder. Until the kernel is restored, the work of explaining MLPs will continue to be the work of supplying, after the fact, the structure that attention has carried all along.

References inline. The kernel-smoother view of attention is due to Tsai et al., with related developments by Song et al., Choromanski et al., Katharopoulos et al., and Han et al.. The Yat kernel is from Bouhsine, 2026. The superposition framing is from Elhage et al., and the mechanistic interpretability descriptions are from the circuits thread and related work.

Cite as

Bouhsine, T. (2026, 2026). Attention is Explainable Because it is a Kernel. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/attention-is-a-kernel/

BibTeX

@misc{bouhsine2026attentionisakernel,
  author       = {Bouhsine, Taha},
  title        = {Attention is Explainable Because it is a Kernel},
  year         = {2026},
  month        = {may},
  howpublished = {\url{https://tahabouhsine.com/blog/attention-is-a-kernel/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance. arXiv:2605.03262.

BibTeX

@article{bouhsine2026260503262,
  author        = {Bouhsine, T.},
  title         = {A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance},
  year          = {2026},
  eprint        = {2605.03262},
  archivePrefix = {arXiv}
}

The asymmetry that the architecture does not advertise#

Attention as a kernel smoother#

Three affordances kernel structure gives you#