Latent on the Spectrum: Why Cats Sit Closer to Dogs Than to Cars

June 7, 2026 · 19 min read

#ml #representation-learning #neural-collapse #simplex-etf #latent-space #hierarchy #label-structure #contrastive-learning #kernel-methods #spectral-embedding #taxonomy

Part 6 of 8Geometry of Representations

1Activations Are Bad for Geometry
2Opposite Is Not Different: The Cosine-Similarity Bug in CLIP and Contrastive Learning
3Not All Infinities Are Equal: The Cross-Entropy Asymmetry Behind Hallucination
4Untangling the Moons: A Visual History of Contrastive Learning
5What Makes a Good Latent Space? The Welch Bound and the Simplex
6Latent on the Spectrum: Why Cats Sit Closer to Dogs Than to Carsyou are here
7The Three States of Information
8Distillation Is a Geometry, Not an Answer Key

Runnable JAX companionLatent on the Spectrum, in JAXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

A follow-up to What Makes a Good Latent Space?. There the answer was the regular simplex, every class equally far from every other. That answer is exactly right, and quietly wrong. It is right for classes that are strangers, and real classes are not strangers. The fix turns out to be spectral, and it’s more beautiful than the simplex.

The Welch-bound post ended on a clean note: a good latent space packs its class codes as far apart as geometry allows, and for $C$ classes that fit the dimension the optimum is the regular simplex, every pair of codes at the same angle, cosine $-1/(C-1)$ . Every class gets the same deal.

“The same deal” should make you suspicious. It is fair only if the classes are equivalent, if the only thing you know is that they differ. Take three real labels:

cat, dog, car.

The simplex demands $\langle\text{cat},\text{dog}\rangle = \langle\text{cat},\text{car}\rangle = \langle\text{dog},\text{car}\rangle$ , all three pairs at one angle. But a cat and a dog are both animals; a car is not. Any space that puts cat exactly as far from dog as from car has thrown away something true. The simplex is the perfect codebook for strangers, and these classes are not strangers.

A codebook is an encoding of a similarity

So what is the right target, if “equally far apart” is wrong? Stop thinking of the codebook as “points to spread out” and start thinking of it as an encoding of a similarity matrix. You have a desired similarity between every pair of classes, $S^\star_{ij}$ , high for cat–dog, low for cat–car. The codebook is the set of unit vectors whose Gram matrix (their pairwise cosines) best matches $S^\star$ :

\min_{\;\lVert\mathbf{e}_i\rVert=1}\ \big\lVert\, \mathbf{E}^\top\mathbf{E} - S^\star \,\big\rVert_F^2 .

The Welch simplex is one special case of this, the case $S^\star =$ “every off-diagonal equal,” the structureless target. The label kernel is the identity: no class resembles any other. Feed in a different $S^\star$ and the codebook follows it. This is the old, exactly-right idea of kernel-target alignment (Cristianini et al., 2002): the ideal embedding kernel is the label kernel.

Steer it. Below, nine codes descend toward a target similarity you control. With structure at 0 the target is flat and they settle into the even simplex, equidistant strangers. Turn it up and the codes split into superclasses: the within-group pull keeps cat and dog close while the across-group push sends car to the far side. The middle panel is the kernel you’re dialing; ignore the third panel for one more paragraph.

The spectral law: the codebook is the top modes of the kernel

That third panel is the whole post. A kernel’s eigenspectrum tells you everything about how it wants to be embedded:

A flat spectrum on the centered label subspace, all non-mean modes equal, is a kernel with no preferred direction among classes. (A regular-simplex Gram, once centered, has one zero mean-direction eigenvalue and $C-1$ equal nonzero ones, so “flat” means flat across those $C-1$ class modes, not all $C$ .) Every class mode matters the same. That is the white kernel, and its embedding is the maximally even one: the simplex (a tight frame is exactly the flat-spectrum condition from the Welch post). Slide structure to 0 above and watch all the bars snap to the same height.
A peaked spectrum, a few large eigenvalues, the rest small, is a kernel with dominant structure. The top mode is the coarsest distinction (animals vs vehicles); the next modes are finer. Slide structure up and watch one bar shoot up.

The simplex was never the universal answer. It was the answer for the one spectrum that has no shape.

And the shape is not a metaphor. Diagonalise the kernel, keep its top two modes, and look: the codebook has a geometry you can see, and it changes with the spectrum. Independent labels give the even spread of the simplex. A taxonomy gives tight clusters, a simplex of simplexes. But the most telling case is a graded label kernel, where similarity falls off smoothly with how related two classes are (cat–dog–wolf–…–trout, each close to its neighbours, far from the ends). Embed that and the classes neither scatter nor clump. They fall onto a curved arch, the famous horseshoe of multidimensional scaling (Diaconis, Goel & Holmes, 2008): the first eigenmode recovers the latent ordering, the second bends it into a curve. The classes literally lie on a spectrum, a one-dimensional manifold curled into two dimensions. “Latent on the spectrum” is not a figure of speech; it is what the eigenvectors do.

Why is “keep the strongest modes” exactly right, and not just a heuristic? Ignore the unit-norm renormalisation for a moment: the best rank- $d$ approximation to the Gram matrix is $U_d\Lambda_d U_d^\top$ , and a coordinate matrix that realises it is $\Lambda_d^{1/2}U_d^\top$ , the top- $d$ eigenvectors of $S^\star$ , scaled by the square roots of their eigenvalues. If we want cosine codes we normalise those vectors afterward, which makes the exact Eckart–Young optimum a design principle rather than the literal constrained minimiser. This is classical multidimensional scaling, the same spectral fact behind kernel PCA and Laplacian eigenmaps: to embed a similarity, diagonalise it and keep the strongest modes.

Your dimension budget buys eigenmodes

If the codebook is the top- $d$ eigenmodes, then the embedding dimension is a budget: $d$ dimensions buy you the $d$ strongest relationships in the label kernel, and the rest are discarded. A latent space is a lossy compression of a similarity.

This reframes the over-complete regime of the Welch post. Watch the budget at work: a structured (hierarchical) kernel concentrates almost everything into a couple of modes, so even one or two dimensions reconstruct it well. The flat simplex kernel has no dominant mode, every eigenvalue is equal, so there is nothing to compress, and you need the full $C-1$ dimensions to represent it. Structure is what makes a latent space compressible.

The residual makes the compression literal. Keeping $d$ modes does not make a smaller picture of the same kernel; it leaves a specific error pattern behind. If the discarded modes were fine-grained within-family distinctions, the residual lights up inside families. If the kernel is flat, every omitted mode matters equally.

Where does the structure even come from? The data.

You might still object: maybe in practice labels really are independent, and the simplex is what you get. It isn’t, and the reason is that the structure doesn’t come from the labels at all. It comes from the data.

Cats and dogs look alike. Their images occupy overlapping regions of feature space long before any label is attached. So a representation feels two forces at once: the data pulls perceptually similar classes together, and the supervised loss pushes every class apart to make them classifiable. The geometry you end up with is the negotiation between them, and crucially, the one-hot label only ever supplies the second force. Left to itself it whitens the kernel toward the simplex.

This is exactly what Müller, Kornblith & Hinton (2019) found for label smoothing: nudging targets toward uniform tightens each class and “erases information in the logits about resemblances between classes.” Smoothing turns up the simplex force and grinds the data’s structure away. It is also why dark knowledge works (Hinton, Vinyals & Dean, 2015): a teacher’s soft outputs, “mostly dog, a little cat, never car”, are the surviving structure that the hard label tried to delete.

Turn the supervision up below and watch the codebook tear itself off the data:

So far the spectrum belonged to a target label kernel. Now the object changes: we ask what spectrum the trained features keep after the data force and supervision force have negotiated.

The spectrum of information: the prototypes are only the top

But where do the features live? Everything so far has been about the prototypes, the $C$ class codes, and a real feature isn’t a prototype; it’s a point near one, leaning toward its neighbours. So the codebook is only part of the story. The rest, the part that actually carries information, is the spectrum below the prototypes.

Make it exact. Decompose the feature covariance the way Fisher and neural collapse both do:

\Sigma_{\text{total}} = \underbrace{\Sigma_B}_{\text{between prototypes}} \;+\; \underbrace{\Sigma_W}_{\text{within / residual}} .

$\Sigma_B$ has rank $\le C-1$ and is spanned by the prototypes, its geometry is the codebook, the simplex or Welch frame or structured kernel we’ve been steering. $\Sigma_W$ is everything else: the within-class variation, the fine distinctions, the gradations between classes. Stack them and a representation’s eigenspectrum splits into two regimes, the top $C-1$ modes are the prototype frame (the separation channel), and the tail is the information. The split is clean only when the between-class variance dominates, as it does near neural collapse; when within-class variance is large the two regimes overlap and “top $C-1$ ” is an approximation, not a hard cut.

Press Play and watch what training does to it. As the network collapses, $\Sigma_W \to 0$ , the celebrated neural collapse, the information tail is ground to nothing. The clusters get tighter, the prototypes lock into their frame, and the spectrum the features lived in is erased. Drag label structure to shape the prototype modes: flat for a simplex, peaked for a taxonomy.

Call the trade-off separate or represent: a space can spend its spectrum pushing classes apart, or keep it to describe what the classes contain, and the tension is readable off a single picture: how much spectrum you leave standing below the prototypes. Pure separation, a fully collapsed simplex, is a representation that has thrown its information away.

Information is a coordinate between prototypes

So where, exactly, does that information live? Between the prototypes. Project a feature onto the prototype frame and you get its soft assignment, how much it leans toward each code. A feature halfway between cat and dog reads “70% cat, 30% dog, 0% car.” That vector is not noise around the label; it is the information, and it is exactly the dark knowledge a teacher distills (Hinton et al., 2015). In this frame view, the information shows up as the coefficients: the soft mixture of nearby prototypes. (When the prototypes are overcomplete or non-orthogonal, those coefficients are pinned down only once you fix a decoding rule, a dual frame, or a logits/softmax convention.)

Watch a feature wander the frame and read its coordinates:

Colour is useful, but the actual object is a coefficient vector. Read it as logits, soft labels, or frame coefficients, the point is the same: the information is the mixture.

And here is where Welch closes the loop. The prototypes pack on the sphere subject to the Welch bound, that floor on their coherence is what fixes the basis vectors of this between-space. Two regimes, both from the first post:

Under-complete ( $C \le d+1$ ): the simplex spends exactly $C-1$ dimensions on prototypes and leaves $d-(C-1)$ dimensions free for information, a clean split into a label channel and an information channel.
Over-complete ( $C > d$ ): the prototypes now span the whole space, so every feature is a frame expansion $\mathbf{z} = \sum_c a_c\,\boldsymbol{\mu}_c$ , and the coefficients $a_c$ are the information. The Welch bound on the prototypes is the floor on how well-conditioned that expansion is: the lower the coherence, the more stably the features can be written in the codebook.

So Welch was never only about packing the labels. It sets the geometry of the channel the information rides on. The prototypes are the top of the spectrum; the features live in the rest; and the bound governs both.

Hierarchy is a spectral cascade, and the field already builds it

What does a real taxonomy, animals then mammals then breeds, ask of the spectrum? Nothing new. A hierarchy is just a particular spectrum: the coarsest split is the top eigenmode, the next level the next modes, and so on, a cascade from general to specific. The codebook that matches it is a simplex of simplexes: superclasses arranged as a simplex, each refined into a smaller one.

This is not hypothetical. Hierarchy-aware frames (Liang et al., ICCV 2023) replace the fixed simplex-ETF classifier with a target geometry where related classes sit closer, and show a network collapsing to that frame makes less severe mistakes. In the crowded regime, vocabularies, retrieval, faces, where there are far more classes than dimensions, generalized neural collapse (Jiang et al., ICML 2024) proves the class means form a “softmax code” maximising the minimum one-vs-rest margin, an object tied directly to the Tammes/Welch packing from the previous post. The packing carries over; structure bends it.

One geometric caveat worth a sentence: deep trees don’t fit in flat Euclidean space. A tree’s leaves grow exponentially with depth while Euclidean volume grows polynomially, so something has to give. Hyperbolic space, negative curvature, exponential volume, fits them naturally; Nickel & Kiela (2017) showed Poincaré embeddings capture hierarchy and similarity together at a fraction of the dimensions. If your label kernel is a deep taxonomy, the right codebook may not live on a sphere at all.

Why it’s worth the trouble: better mistakes

The clean payoff. A structured codebook fails toward the right neighbours. Here is the same nearest-code classifier under the same feature noise, run two ways, once on a simplex, once on a structured codebook:

That last detail is a real trade-off: the simplex often wins raw top-1 accuracy, because maximal margin is maximal margin. But its mistakes are catastrophic, cat→car, because it has no notion of a nearer wrong answer. This is the whole point of Bertinetto et al.’s “Making Better Mistakes” (CVPR 2020): a hierarchy-aware geometry turns absurd errors into forgivable ones. And the same structure is what a retrieval index, a transfer head, or a few-shot learner leans on, the cat-is-like-dog prior that makes downstream learning cheap. It is also why you should expect a trained vision-language space to put cat near dog and far from car: the encoder learned from data in which cats and dogs look alike and share contexts, so that structure survives into the embedding. Such a space has not failed to reach the simplex. It is correctly encoding a label spectrum.

The caveats

A reframing, not a free lunch, and it carries choices:

Where does $S^\star$ come from? A taxonomy, soft labels from a teacher, co-occurrence counts, the text embedding of the class names, each is a modelling decision, and a wrong similarity is worse than none.
It’s a trade-off, not a strict win. Maximal margin (the simplex) versus fidelity to structure, the structure slider is literally that dial, and you saw it cost top-1 accuracy. Leave it at zero and you’re back to strangers; push it too far and you sacrifice separation.
This is the inter-class story. How class codes relate to one another is a different axis from the within-class manifold. Both matter; we touched only the first.

The picture, completed

Two posts, one picture. The Welch-bound post asked how to pack the prototypes and answered with the simplex and its over-complete frame. This post gave that packing structure: match the label kernel, and the simplex is just its flat-spectrum corner. And the final act says the prototypes are only the top of the spectrum. The features live in the modes below them, the continuum between the codes where the information rides, and the Welch bound that fixed the packing also fixes the geometry of that channel.

A latent space is a spectrum. The top is the label frame, prototypes packed as far apart as the Welch bound allows, shaped by whatever structure the label kernel carries. The bottom is the information, the residual the features spend between prototypes, the dark knowledge, the fine distinctions. A good latent space packs the top tightly and leaves the bottom standing; collapse the bottom and you have separation with nothing left to say. The regular simplex is the special, beautiful, structureless top, the codebook for classes that have nothing to do with each other.

Cats end up closer to dogs than to cars. And between them is where the information lives.

Cite as

Bouhsine, T. (2026, June 7). Latent on the Spectrum: Why Cats Sit Closer to Dogs Than to Cars. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/latent-on-the-spectrum/

BibTeX

@misc{bouhsine2026latentonthespectrum,
  author       = {Bouhsine, Taha},
  title        = {Latent on the Spectrum: Why Cats Sit Closer to Dogs Than to Cars},
  year         = {2026},
  month        = {jun},
  howpublished = {\url{https://tahabouhsine.com/blog/latent-on-the-spectrum/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

References

Welch, L. R. (1974). Lower Bounds on the Maximum Cross Correlation of Signals. IEEE Transactions on Information Theory 20(3), 397–399.doi:10.1109/TIT.1974.1055219
Cristianini, N., Shawe-Taylor, J., Elisseeff, A., Kandola, J. (2002). On Kernel-Target Alignment. NIPS 2001.
Diaconis, P., Goel, S., Holmes, S. (2008). Horseshoes in Multidimensional Scaling and Local Kernel Methods. Annals of Applied Statistics 2(3), 777–807.arXiv:0811.1477
Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531
Nickel, M., Kiela, D. (2017). Poincaré Embeddings for Learning Hierarchical Representations. NeurIPS 2017.arXiv:1705.08039
Müller, R., Kornblith, S., Hinton, G. (2019). When Does Label Smoothing Help?. NeurIPS 2019.arXiv:1906.02629
Papyan, V., Han, X. Y., Donoho, D. L. (2020). Prevalence of Neural Collapse During the Terminal Phase of Deep Learning Training. Proceedings of the National Academy of Sciences 117(40).arXiv:2008.08186
Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N. A. (2020). Making Better Mistakes: Leveraging Class Hierarchies with Deep Networks. CVPR 2020.arXiv:1912.09393
Liang, T., et al. (2023). Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing Mistake Severity. ICCV 2023.arXiv:2303.05689
Jiang, J., Zhou, J., Wang, P., Qu, Q., Mixon, D., You, C., Zhu, Z. (2024). Generalized Neural Collapse for a Large Number of Classes. ICML 2024.arXiv:2310.05351

A codebook is an encoding of a similarity#

The spectral law: the codebook is the top modes of the kernel#

Your dimension budget buys eigenmodes#

Where does the structure even come from? The data.#

The spectrum of information: the prototypes are only the top#

Information is a coordinate between prototypes#

Hierarchy is a spectral cascade, and the field already builds it#

Why it’s worth the trouble: better mistakes#

The caveats#

The picture, completed#