The Three States of Information
#ml#training-dynamics#representation-learning#neural-collapse#loss-landscape#phase-transitions#contrastive#simplex#information
Watch a network train for long enough and the loss curve stops looking like a smooth slide. It looks like a staircase: long flat stretches where nothing seems to happen, broken by sudden drops. The usual reaction is “it’s stuck.” It is not stuck. It is reorganizing.
The cleanest way to see why is to stop watching the loss and watch the representation, the geometry of the activations the network is shaping. That geometry moves through three states, and the move from one to the next is exactly the plateau.
By “information” here I mean label-relevant structure in representation space: not Shannon information in the abstract, but the geometry that makes classes recoverable.
Three states, like matter
Borrow the picture from phases of matter. It is an old analogy in the statistical mechanics of learning, where training is read as an ordering process with its own order parameters and a falling representation entropy. None of the three states below is a new phenomenon; the contribution here is to name them cleanly and let you watch the transitions.
Random: the gas. At initialization the representation is high-entropy and isotropic: points spread with no relation to their class, pairwise similarities look like noise. There is information in the labels, but none of it is in the geometry yet.
Organized: the liquid. Partway through training, same-class points pull together into clusters. There is now local order (you can tell which points belong together), but the clusters themselves are not yet arranged. They can sit at arbitrary, even overlapping, angles. The representation has organized within classes without organizing between them. This “low-order structure first” is the distributional simplicity bias: networks fit the mean and covariance of the data before its higher-order correlations, a bias documented from small CNNs to ImageNet models and LLMs (Refinetti et al., 2023; Belrose et al., 2024).
Structured: the crystal. At convergence the clusters settle into a globally ordered arrangement: maximally separated, equiangular, a simplex. This is the neural collapse regime (Papyan, Han & Donoho, 2020), where class means converge to a simplex equiangular tight frame, the same maximally-spread configuration the Welch bound describes. Global order, minimal redundancy.
The figure below is a supervised contrastive encoder trained live in your browser. Press play and watch the representation pass through all three (scattered, then clustered, then crystallized into a simplex) while a structure metric climbs alongside the loss.
That two-step mirrors the standard decomposition of contrastive learning into alignment and uniformity. Wang & Isola (2020) show the contrastive loss splits into alignment, pulling positive pairs together (the organized state), and uniformity, spreading features evenly on the sphere (the structured state). That decomposition is about which two properties the loss optimizes, not a guaranteed training-time order; but in the run above you can watch alignment resolve first, then uniformity slowly break the symmetry into a simplex.
The plateau is the phase transition
Here is the part that connects the two pictures, in the feature-learning regimes this post is about. The representation does not move between states at a constant rate. It stalls near saddle points of the loss: configurations that are flat in most directions, where the gradient is tiny. The network creeps along until some symmetry breaks and it falls toward the next state. That creep is the plateau; the symmetry-breaking is the drop.
This shows up most starkly in the simplest possible model, a deep linear network, where the dynamics can be solved exactly (Saxe, McClelland & Ganguli, 2013). Such a network learns the structure of its target one mode at a time, strongest first, and each mode switches on only after a delay spent near a saddle. The result is a loss that falls in clean steps: one plateau, one drop, per mode.
The same shape recurs across very different settings, which is the tell that it is a property of learning, not of any one architecture:
- Saddle-to-saddle dynamics in linear and small networks: from small initialization, training hops between saddles of increasing rank, each hop adding one direction of structure. In the vanishing-init limit the trajectory becomes literally piecewise-constant, jumping from saddle to saddle (Saxe et al., 2013; Pesme & Flammarion, 2023).
- The information bottleneck’s two phases (Shwartz-Ziv & Tishby, 2017): a fast fitting phase, then a slow phase often described as compression. Whether that second phase is a genuine information-theoretic effect is contested (Saxe et al., 2018), so take it as a suggestive picture, not a law.
- Grokking (Power et al., 2022): the extreme case, where the plateau lasts so long the model looks merely memorized, then a late transition snaps it into a generalizing solution. Mechanistically the structure forms gradually underneath the flat curve, and a final cleanup removes the memorizing part (Nanda et al., 2023).
In every one, the flat stretch is not idle time. It is the representation doing the slow work of reorganizing, work that is invisible in the loss until the new structure is complete enough to pay off all at once.
Two honest caveats. First, this clean three-stage path is regime-dependent: it is vivid when the network actually learns features (the “rich” regime, typically from small initialization) and nearly absent in the “lazy” regime, where the representation barely moves and the model behaves like a fixed kernel. Second, not every plateau is a saddle: a flat stretch can also be a low-curvature valley, a vanishing gradient from saturated units, or simply an interval where the active-neuron pattern stops changing (Ainsworth & Shin, 2021). “Near a saddle” is the cleanest mechanism, not the only one.
The “study the phases through the linear case” move is more than a teaching device. A recent line of work argues that layerwise-linear models already reproduce neural collapse, the lazy/rich split, emergence, and grokking, and should be solved first to understand them. And a parallel program, developmental interpretability, built on singular learning theory, makes the stages precise: it detects the discrete phase transitions of training from the geometry of the loss landscape via the local learning coefficient (Hoogland et al., 2024). The three states are the cartoon; these are the instruments.
Reading a training curve
Once you see plateaus as transitions between the three states, a few practical readings follow.
A plateau is a question, not a failure. Before killing a run that has flattened, ask which transition it is stuck on. Is the representation random (still no clusters, so maybe the signal is too weak), or organized but not structured (clusters formed but not yet separated, often a symmetry waiting to break)? The two call for different fixes.
Structure can lead the loss. As the first figure shows, the geometry often reorganizes during the plateau, before the loss moves. If you only watch the loss you will miss it; a cheap structure probe sees the transition coming. Concretely, during a plateau log three numbers: intra-class variance, mean inter-class cosine similarity, and the distance from the class-mean Gram matrix to a simplex ETF. If intra-class variance is still falling, the network is organizing; if the class means are drifting toward equiangularity, it is structuring.
The destination is a simplex, in the clean limit. The structured state is not arbitrary. In the balanced supervised-classification limit it is the maximally-separated, equiangular arrangement: the simplex ETF of neural collapse, the saturating configuration of the Welch bound. More generally the endpoint is a globally constrained code whose geometry reflects the objective, the labels, and the data: hierarchy, class imbalance, and multi-label structure all bend it away from the plain simplex. But the kind of endpoint is the same one these posts keep arriving at from different directions: a good latent space is a structured one, and organizing randomness is the process of getting there.
The point
Training is not a smooth descent; it is a sequence of reorganizations. Information starts random, becomes organized, and ends structured, and the plateaus in the loss are the seams between those states: the flat stretches where the representation rebuilds itself before the loss is allowed to fall. The staircase is not the model failing to learn. It is the model changing phase.
Cite as
Bouhsine, T. (). The Three States of Information. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/three-states-of-information/
BibTeX
@misc{bouhsine2026threestatesofinformation,
author = {Bouhsine, Taha},
title = {The Three States of Information},
year = {2026},
month = {jun},
howpublished = {\url{https://tahabouhsine.com/blog/three-states-of-information/}},
note = {Blog post, Records of the !mmortal Data Scientist}
} References
- (2013). Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. ICLR 2014.arXiv:1312.6120
- (2020). Prevalence of Neural Collapse during the Terminal Phase of Deep Learning Training. PNAS 117(40).arXiv:2008.08186
- (2017). Opening the Black Box of Deep Neural Networks via Information. arXiv preprint.arXiv:1703.00810
- (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv preprint.arXiv:2201.02177
- (2023). Saddle-to-Saddle Dynamics in Diagonal Linear Networks. NeurIPS 2023.arXiv:2304.00488
- (2021). Plateau Phenomenon in Gradient Descent Training of ReLU Networks: Explanation, Quantification and Avoidance. SIAM J. Scientific Computing.arXiv:2007.07213
- (2023). Neural Networks Trained with SGD Learn Distributions of Increasing Complexity. ICML 2023.arXiv:2211.11567
- (2024). Neural Networks Learn Statistics of Increasing Complexity. ICML 2024.arXiv:2402.04362
- (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ICML 2020.arXiv:2005.10242
- (2018). On the Information Bottleneck Theory of Deep Learning. ICLR 2018.
- (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR 2023.arXiv:2301.05217
- (2024). The Developmental Landscape of In-Context Learning. arXiv preprint.arXiv:2402.02364