Performer
Performers and FAVOR+: positive random features that make softmax attention linear-time while staying an unbiased kernel estimate.
-
Cheap Attention in JAX/Flax NNX
A runnable companion to Cheap Attention: implement positive-feature linear attention in JAX and Flax NNX, watch the all-pairs ledger turn into a shared feature state, and see exactly where the N×N matrix disappears.
-
Cheap Attention: Linear-Time Kernel Approximation
A 128K-token context creates billions of pairwise questions per attention head. But the N×N matrix is not the essence of attention; it is the receipt for an infinite feature map we never wrote down. Approximate that feature map with random features, reassociate the sum, and softmax attention becomes linear-time kernel attention. The whole argument is built from live in-browser visualizations.