Self Attention

Self-attention as a learned bilinear relation and a Nadaraya–Watson kernel smoother: why Q and K projections matter and how heads become readable.

5 posts tagged #self-attention.

Jun 4, 2026

Q and K Projections in JAX/Flax NNX

A runnable companion to Why Attention Needs Q and K Projections: build scaled dot-product attention with separate query and key projections in Flax NNX, pull the bilinear form B = W_Q W_Kᵀ out of the module, split it into a symmetric metric and an antisymmetric directed part, wire a toy induction head, add RoPE, and measure the low-rank budget and the gauge freedom, all in plain JAX.
Jun 4, 2026

Why Attention Needs Q and K Projections

The dot product in attention is not enough by itself. Without learned query and key projections, attention can only compare tokens in the residual stream’s native geometry. With a shared projection it learns a symmetric metric. With separate Q and K projections, the score becomes a learned bilinear form x_iᵀW_QW_Kᵀx_j: directional, role-aware, low-rank, and different per head. That bilinearity is what lets attention ask one kind of question and let tokens advertise another kind of answer.
May 31, 2026

Cheap Attention: Linear-Time Kernel Approximation

A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.
May 14, 2026

Self-Attention as Kernel Regression in JAX/Flax NNX

A runnable companion to Attention is Explainable Because it is a Kernel: build scaled dot-product attention from scratch in Flax NNX, prove in code that it is exactly a Nadaraya–Watson kernel smoother, watch the separate q/k projections break positive-definiteness numerically, swap the exp-dot-product kernel for Gaussian, Yat, and linear kernels to see which keep the weights a convex partition of unity, read the temperature as a kernel bandwidth, and train a single head end-to-end to route to a marked token.
May 14, 2026

Attention is Explainable Because it is a Kernel

Self-attention in transformers is a Nadaraya–Watson kernel smoother. That fact, and not "we visualize the matrix", is why attention heads are readable while MLPs are not.