The Three States of Information

June 7, 2026 · 10 min read

#ml #training-dynamics #representation-learning #neural-collapse #loss-landscape #phase-transitions #contrastive #simplex #information

Part 7 of 8Geometry of Representations

Runnable JAX companionThe Three States of Information, in JAXPrefer to read the code? This post has a hands-on JAX / Flax NNX implementation.Open the JAX companion

Watch a network train for long enough and the loss curve stops looking like a smooth slide. It looks like a staircase: long flat stretches where nothing seems to happen, broken by sudden drops. The usual reaction is “it’s stuck.” It is not stuck. It is reorganizing.

The cleanest way to see why is to stop watching the loss and watch the representation, the geometry of the activations the network is shaping. That geometry moves through three states, and the move from one to the next is exactly the plateau.

Three states, like matter

A pot of steam, a glass of water, a cube of ice: the same molecules, three degrees of order. Training runs that movie in reverse for a representation, from disorder to order, and the analogy is older than it looks: the statistical mechanics of learning has long read training as an ordering process, with order parameters and a falling representation entropy. None of the three states below is a new phenomenon; the contribution here is to name them cleanly and let you watch the transitions.

What is the thing that freezes? By “information” here I mean label-relevant structure in representation space: not Shannon information in the abstract, but the geometry that makes classes recoverable.

Random: the gas. At initialization the representation is high-entropy and isotropic: points spread with no relation to their class, pairwise similarities look like noise. There is information in the labels, but none of it is in the geometry yet.

Organized: the liquid. Partway through training, same-class points pull together into clusters. There is now local order (you can tell which points belong together), but the clusters themselves are not yet arranged. They can sit at arbitrary, even overlapping, angles. The representation has organized within classes without organizing between them. This “low-order structure first” is the distributional simplicity bias: networks fit the mean and covariance of the data before its higher-order correlations, a bias documented from small CNNs to ImageNet models and LLMs (Refinetti et al., 2023; Belrose et al., 2024).

Structured: the crystal. At convergence the clusters settle into a globally ordered arrangement: maximally separated, equiangular, a simplex. This is the neural collapse endpoint the Welch-bound post built up in detail (Papyan, Han & Donoho, 2020). Global order, minimal redundancy.

The correspondence is not decoration; each state has an order parameter you can measure on a real run:

state	phase	order parameter
random	gas	within-class variance ≈ total variance; class-mean cosines look like noise
organized	liquid	within-class variance small (local order); class means still bunched
structured	crystal	class-mean cosines lock onto $-1/(C-1)$ , the simplex (global order)

The figure below is a supervised contrastive encoder trained live in your browser. Press play and watch the representation crystallize, clusters gathering and then spreading into a simplex, while a structure metric climbs alongside the loss. (A random projection already preserves some of the blobs’ structure, so depending on the roll the run may open partway into the story.)

That two-step mirrors the alignment and uniformity decomposition of contrastive learning (Wang & Isola, 2020): alignment builds the local order of the organized state, uniformity the global order of the structured one. But the decomposition names two properties, not a schedule. In the run above alignment wins the race, so the states arrive in their canonical gas-liquid-crystal order: clusters first, then the slow symmetry-breaking into the simplex. Nothing forces that order. The companion trains the same two forces under strong augmentations and watches the race flip: the spread resolves fast while the spreading flings positive pairs apart, and local order arrives last. The states are geometries, not a fixed itinerary; what every route shares is the endpoint, uniform and aligned at once.

The plateau is the phase transition

Why would a reorganization read as a flat line in the loss? Because the representation does not move between states at a constant rate. It stalls near saddle points of the loss: configurations that are flat in most directions, where the gradient is tiny. The network creeps along until some symmetry breaks and it falls toward the next state. That creep is the plateau; the symmetry-breaking is the drop.

This shows up most starkly in the simplest possible model, a deep linear network, where the dynamics can be solved exactly (Saxe, McClelland & Ganguli, 2013). Such a network learns the structure of its target one mode at a time, strongest first, and each mode switches on only after a delay spent near a saddle. The result is a loss that falls in clean steps: one plateau, one drop, per mode.

Is that staircase a quirk of linear toys? Chase the shape into other settings and it keeps turning up, which is the tell that it belongs to learning, not to any one architecture. From small initialization, linear and near-linear networks exhibit saddle-to-saddle dynamics: training hops between saddles of increasing rank, each hop adding one direction of structure, and in the vanishing-init limit the trajectory becomes literally piecewise-constant (Saxe et al., 2013; Pesme & Flammarion, 2023). The information bottleneck tells a similar two-phase story, a fast fitting phase followed by a slow phase often described as compression, though whether that second phase is a genuine information-theoretic effect is contested (Shwartz-Ziv & Tishby, 2017; Saxe et al., 2018), so take it as a suggestive picture rather than a law.

And then there is the case that stretches the plateau to absurdity. Train a small transformer on modular arithmetic and it memorizes the training set almost immediately, while test accuracy sits at chance. And sits there. For thousands of steps the curve flatlines, long enough that any reasonable person would kill the run as a memorizer that will never generalize. Wait instead, and the curve snaps upward: the model abruptly generalizes, long after it looked finished. That is grokking (Power et al., 2022), and the mechanistic post-mortem found exactly what the three-states picture predicts: the generalizing structure had been forming gradually under the flat curve the whole time, with a final cleanup phase removing the memorizing circuit (Nanda et al., 2023).

In every one of these, the flat stretch is not idle time. It is the representation doing the slow work of reorganizing, work that is invisible in the loss until the new structure is complete enough to pay off all at once.

Two caveats. First, this clean three-stage path is regime-dependent: it is vivid when the network actually learns features (the “rich” regime, typically from small initialization) and nearly absent in the “lazy” regime, where the representation barely moves and the model behaves like a fixed kernel. Second, not every plateau is a saddle: a flat stretch can also be a low-curvature valley, a vanishing gradient from saturated units, or simply an interval where the active-neuron pattern stops changing (Ainsworth & Shin, 2021). “Near a saddle” is the cleanest mechanism, not the only one.

The “study the phases through the linear case” move is more than a teaching device. A recent position paper argues that layerwise-linear models already reproduce neural collapse, the lazy/rich split, emergence, and grokking, and should be solved first to understand them (Nam et al., 2025). And a parallel program, developmental interpretability, built on singular learning theory, makes the stages precise: it detects the discrete phase transitions of training from the geometry of the loss landscape via the local learning coefficient (Hoogland et al., 2024). The three states are the cartoon; these are the instruments.

Reading a training curve

Once you see plateaus as transitions between the three states, a few practical readings follow.

A plateau is a question, not a failure. Before killing a run that has flattened, ask which transition it is stuck on. Is the representation random (still no clusters, so maybe the signal is too weak), or organized but not structured (clusters formed but not yet separated, often a symmetry waiting to break)? The two call for different fixes.

Structure can lead the loss. As the first figure shows, the geometry often reorganizes during the plateau, before the loss moves. If you only watch the loss you will miss it; a cheap structure probe sees the transition coming. Concretely, during a plateau log three numbers: intra-class variance, mean inter-class cosine similarity, and the distance from the class-mean Gram matrix to a simplex ETF. If intra-class variance is still falling, the network is organizing; if the class means are drifting toward equiangularity, it is structuring.

The destination is a simplex, in the clean limit. The structured state is not arbitrary. In the balanced supervised-classification limit it is the maximally-separated, equiangular arrangement: the simplex ETF of neural collapse, the saturating configuration of the Welch bound. More generally the endpoint is a globally constrained code whose geometry reflects the objective, the labels, and the data: hierarchy, class imbalance, and multi-label structure all bend it away from the plain simplex. But the kind of endpoint is the same one these posts keep arriving at from different directions: a good latent space is a structured one, and organizing randomness is the process of getting there.

The point

Training is not a smooth descent; it is a sequence of reorganizations. Information starts random, becomes organized, and ends structured, and the plateaus in the loss are the seams between those states: the flat stretches where the representation rebuilds itself before the loss is allowed to fall. The staircase is not the model failing to learn. It is the model changing phase.

Cite as

Bouhsine, T. (2026, June 7). The Three States of Information. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/three-states-of-information/

BibTeX

@misc{bouhsine2026threestatesofinformation,
  author       = {Bouhsine, Taha},
  title        = {The Three States of Information},
  year         = {2026},
  month        = {jun},
  howpublished = {\url{https://tahabouhsine.com/blog/three-states-of-information/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

References

Saxe, A. M., McClelland, J. L., Ganguli, S. (2013). Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. ICLR 2014.arXiv:1312.6120
Papyan, V., Han, X. Y., Donoho, D. L. (2020). Prevalence of Neural Collapse during the Terminal Phase of Deep Learning Training. PNAS 117(40).arXiv:2008.08186
Shwartz-Ziv, R., Tishby, N. (2017). Opening the Black Box of Deep Neural Networks via Information. arXiv preprint.arXiv:1703.00810
Power, A., Burns, Y., Edwards, H., Babuschkin, I., Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint.arXiv:2201.02177
Pesme, S., Flammarion, N. (2023). Saddle-to-Saddle Dynamics in Diagonal Linear Networks. NeurIPS 2023.arXiv:2304.00488
Ainsworth, M., Shin, Y. (2021). Plateau Phenomenon in Gradient Descent Training of ReLU Networks: Explanation, Quantification and Avoidance. SIAM J. Scientific Computing.arXiv:2007.07213
Refinetti, M., Ingrosso, A., Goldt, S. (2023). Neural Networks Trained with SGD Learn Distributions of Increasing Complexity. ICML 2023.arXiv:2211.11567
Belrose, N., et al. (2024). Neural Networks Learn Statistics of Increasing Complexity. ICML 2024.arXiv:2402.04362
Wang, T., Isola, P. (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ICML 2020.arXiv:2005.10242
Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., Cox, D. D. (2018). On the Information Bottleneck Theory of Deep Learning. ICLR 2018.
Nanda, N., Chan, L., Lieberum, T., Smith, J., Steinhardt, J. (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR 2023.arXiv:2301.05217
Hoogland, J., Wang, G., Farrugia-Roberts, M., Carroll, L., Wei, S., Murfet, D. (2024). The Developmental Landscape of In-Context Learning. arXiv preprint.arXiv:2402.02364
Nam, Y., Lee, S. H., Domine, C. C. J., Park, Y., London, C., Choi, W., Goring, N., Lee, S. (2025). Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking). ICML 2025.arXiv:2502.21009

Three states, like matter#

The plateau is the phase transition#

Reading a training curve#

The point#