Not All Infinities Are Equal

· 12 min read

#ml#information-theory#loss-functions#interpretability

Cross-entropy diverges. Everyone knows this. The standard story stops at “if the supports don’t overlap, the loss is infinite.” That story is incomplete in three ways that turn out to matter.

First, the divergence condition is weaker than “disjoint support.” H(p,q)=+H(p, q) = +\infty requires only that there exist a single coordinate ii with pi>0p_i > 0 and qi=0q_i = 0 — the supports can overlap on 99% of mass and the loss is still infinite if even one violating coordinate is left uncovered. Worse, the rate of divergence is not uniform. It is controlled by the total violating mass S(p,q)S(p, q), which gives a continuous spectrum of “infinities” by severity.

Second, cross-entropy is asymmetric. H(p,q)H(q,p)H(p, q) \neq H(q, p), and the asymmetry has a precise operational meaning: missing truth (qi=0q_i = 0 where pi>0p_i > 0) is an infinite penalty; adding falsehood (qi>0q_i > 0 where pi=0p_i = 0) is a zero penalty. The loss enforces coverage with infinite force and imposes no constraint on precision.

Third, KL divergence is undefined at independence. Disjoint support is the probabilistic analog of vector orthogonality. Vector geometry has a perfectly well-defined distance at orthogonal — 2\sqrt{2} on the unit sphere. Probability geometry has a singularity. The configuration that, geometrically, represents “no shared information” is the configuration where the information-theoretic distance metric breaks.

Those three statements are proven facts about the loss. They compose, I will argue, into a unifying reading of three of the best-known failure modes of modern machine learning: hallucination, the modality gap in CLIP and SigLIP, and the extreme batch sizes required by InfoNCE-style contrastive losses. The composition is interpretive, not a derivation — each phenomenon has competing explanations in the literature, and the singularity-shadow reading is one factor among them rather than a sole cause. Where I make that move below I flag it as a hypothesis; where the underlying math is a theorem I state it as one. The aim is to give the three phenomena a single mathematical reference point, not to displace the existing accounts of any of them.

The singularity is not binary

The textbook condition for H(p,q)=+H(p, q) = +\infty is “if the supports are disjoint.” The actual condition is strictly weaker. Define the violating set

Vp,q={i:pi>0 and qi=0},V_{p, q} = \{ i : p_i > 0 \text{ and } q_i = 0 \},

and the violating mass S(p,q)=iVp,qpiS(p, q) = \sum_{i \in V_{p,q}} p_i. Then

H(p,q)=+    Vp,q.H(p, q) = +\infty \iff V_{p, q} \neq \varnothing.

A single uncovered coordinate is sufficient. Two distributions p=(0.5,0.3,0.2)p = (0.5, 0.3, 0.2) and q=(0.7,0.3,0)q = (0.7, 0.3, 0) overlap on 80% of pp‘s mass and still give H(p,q)=+H(p, q) = +\infty. The textbook framing of “disjoint support” is presenting a sufficient condition as if it were necessary and sufficient.

But the binary infinite/finite picture is itself misleading. The path to infinity is continuous. If we smooth qq uniformly, q(ε)=(1ε)q+εuq^{(\varepsilon)} = (1 - \varepsilon) q + \varepsilon u with uu uniform on X=n|\mathcal{X}| = n symbols, then as ε0+\varepsilon \to 0^+,

H(p,q(ε))=S(p,q)(logε)+S(p,q)logn+Cp,q+O(ε).H(p, q^{(\varepsilon)}) = S(p, q) \cdot (-\log \varepsilon) + S(p, q) \log n + C_{p,q} + O(\varepsilon).

The slope of the divergence in logε-\log \varepsilon is exactly the violating mass S(p,q)S(p, q). The “infinity” of cross-entropy is not a single rate — it is a real-valued function of how much truth qq is missing.

curves at S = 0.1, 0.5, 1.0 for reference
Cross-entropy under uniform smoothing q^(ε) for varying violating mass S(p, q). The slope of the divergence in (−log ε) is exactly S. A model that misses 1% of the truth diverges 100× slower than one that misses everything. The S = 0 case is the flat finite line — no violating coordinates, no divergence. Drag the slider to see how the rate scales with S.

A model that fails to assign mass to a coordinate ii where pi=0.01p_i = 0.01 is, in the limit, infinitely wrong. A model that fails on a coordinate where pi=0.9p_i = 0.9 is one hundred times more wrong, in the precise sense that its loss diverges a hundred times faster as you smooth toward zero. The textbook view collapses these together into “the loss is infinite.” The view that lets you reason about which model is worse, and by how much, distinguishes them by S(p,q)S(p, q).

Missing truth is infinite. Adding falsehood is free.

Cross-entropy is not symmetric: in general H(p,q)H(q,p)H(p, q) \neq H(q, p), and the asymmetry has a precise consequence at the boundary of the simplex. For each coordinate:

qi=0,  pi>0    pilogqi=+(infinite penalty),pi=0,  qi>0    pilogqi=0(zero penalty).\begin{aligned} q_i = 0, \; p_i > 0 \;&\Longrightarrow\; -p_i \log q_i = +\infty \quad \text{(infinite penalty)}, \\ p_i = 0, \; q_i > 0 \;&\Longrightarrow\; -p_i \log q_i = 0 \quad \text{(zero penalty)}. \end{aligned}

The first case is the model failing to cover a real outcome. The second is the model assigning probability to something that cannot happen. Cross-entropy treats these radically differently: the first is unbounded, the second is exactly zero.

preset:
drag the top of any q bar to adjust
H(p, q)
H(q, p)
violating mass S(p, q)
Truth p in muted outline, model q in accent. Drag the top of any q bar, or use the presets. 'match' makes q = p — both cross-entropies are finite and equal to H(p). 'miss truth on C' sets q_C = 0 while p_C > 0 — H(p, q) blows up, H(q, p) stays finite. 'add falsehood on D' gives q mass on a coordinate where p has zero — H(p, q) is essentially unchanged, but H(q, p) blows up. Cross-entropy enforces coverage of p with infinite force; it doesn't care about q's precision.

The per-coordinate reading is sharp. In isolation, putting mass on a real outcome the model had assigned zero to is the only way to incur an unbounded penalty, and putting mass on an outcome that cannot happen is, term by term, free.

It is worth being precise about what this is and what it is not. Per-coordinate, the asymmetry is exact. As a description of a training step, it is one half of the story. The other half is the simplex constraint: in a softmax output, mass placed on a falsehood necessarily comes from somewhere, including, in general, from real outcomes. So adding falsehood is not literally free in practice; it competes with coverage through the simplex. The asymmetry is a bias, not an unconstrained optimum. The bias is real and points unambiguously toward overcoverage under uncertainty; the strict “optimal strategy” framing would require that the model’s mass on real outcomes be insensitive to its mass on imaginary ones, which is not how the softmax works.

With that caveat in place, the bias still does most of the explanatory work. A vision-language model that describes ten objects in an image containing five real objects pays a far smaller marginal cross-entropy for distributing some of its probability over the five hallucinated descriptions than it would pay for zeroing out any of the five real ones. The per-coordinate calculus shows up at every training step, in every gradient, and is consistent with the overgenerate-rather-than-undergenerate pattern observed in instruction-tuned models. Pre-training spent its gradients on coverage; refusal — declining to assign probability mass to a wrong answer — looks like the behaviour the loss had been steadily discouraging.

Cross-entropy diverges at independence

Both the asymmetry above and the growth-rate story reduce, in the worst case, to the same singularity. The Kullback–Leibler divergence

DKL(pq)=ipilogpiqi=H(p,q)H(p)D_\text{KL}(p \,\|\, q) = \sum_i p_i \log \frac{p_i}{q_i} = H(p, q) - H(p)

inherits the singularity directly. H(p)H(p) is finite for any distribution on a finite alphabet, so DKL(pq)=+D_\text{KL}(p \,\|\, q) = +\infty exactly when H(p,q)H(p, q) is.

The case of disjoint support is the cleanest example, and the most consequential. Disjoint support is the probabilistic analog of vector orthogonality — two distributions that share no events are like two vectors that share no projection. But the two settings handle this case in incompatible ways.

vector space

probability space

Left: two orthogonal unit vectors. The Euclidean distance between them is exactly √2 — finite, well-defined, and reached by orthogonal pairs at no special cost. Right: two distributions p and q. The slider controls how much q's support overlaps with p's. At zero overlap, the supports are disjoint and D_KL(p || q) = +∞. As overlap grows, the divergence becomes finite and decreases smoothly. Geometric independence is a configuration with a distance; probabilistic independence is a configuration where the distance metric breaks.

This is not a notational accident. Cross-entropy and KL were designed to be finite divergences, useful as proxies for “how far apart are these distributions.” But they only work for distributions in the interior of the simplex — distributions with full support. At the boundary of the simplex, where mass goes to zero on some coordinates, the divergence is undefined.

Contrastive losses ignore this. InfoNCE, SimCLR’s loss, CLIP’s loss, and supervised contrastive learning all want negative pairs to be orthogonal in representation space — they all reach for the boundary as their objective. Each one is built on a divergence that does not exist there.

The modality gap as a singularity shadow

CLIP and SigLIP train image and text encoders to produce embeddings in a shared unit sphere. The contrastive objective wants matched image-text pairs close together and mismatched pairs far apart, with “far” operationalised as small (eventually negative) cosine similarity. In the distributional reading of the loss, the objective wants the image- and text-embedding distributions to occupy disjoint regions of the sphere — formally, supp(pimg)supp(ptxt)=\mathrm{supp}(p_\text{img}) \cap \mathrm{supp}(p_\text{txt}) = \varnothing.

But the loss is undefined there. The optimisation cannot reach orthogonal separation; the gradient diverges as the objective approaches the boundary. So it settles for near-orthogonal clusters separated by a residual distance. The well-documented “modality gap” in CLIP and SigLIP — the persistent shift between image and text clusters that no amount of training closes — is consistent with the optimisation keeping a safe distance from a singularity at exactly the configuration the loss is asking it to reach.

This is offered here as a hypothesis, not a proof. The literature has at least three other live explanations of the modality gap that any complete account has to engage with. Liang et al.\ (2022) showed the gap exists at initialisation, before any contrastive training; that part of it is a property of random initial encoders, not of the loss. Deep-net cone effects produce embeddings concentrated in narrow regions of the sphere irrespective of contrastive objectives. Optimisation dynamics around the temperature parameter shape how aggressively the loss pushes negatives across the equator. The singularity-shadow lens is, at most, one factor among these; it offers a unifying reading rather than a sole-cause explanation, and at the level of evidence available now I would not claim more than that.

The batch-size story is similarly a hypothesis-shaped claim. InfoNCE’s gradient on a negative pair is weighted by its softmax probability, and in high dimensions random unit vectors concentrate at cosine zero with variance 1/d1/d. The negatives the loss wants to push away are mostly already near orthogonality, so the gradient on each one is small. Accumulating useful gradient from a flat region is consistent with the regime that demands enormous batches. The catch is that this is one factor among several known to make large batches help contrastive learning: hard-negative mining (large batches see more genuinely hard negatives), gradient variance reduction (large batches give cleaner stochastic gradients), and architectural effects on the temperature schedule all contribute. CLIP’s 32,768-pair batches and the multi-gigabyte similarity matrices are the result of these factors composing; the singularity-flatness reading explains some of that pressure, not all of it.

What the singularity actually says

The three observations compose into a single architectural statement. Cross-entropy is a finite-distribution divergence with three structural properties:

  1. It diverges as soon as qq leaves any of pp‘s support uncovered, at a rate determined by the missing mass.
  2. It treats coverage failures and precision failures infinitely asymmetrically.
  3. It is undefined at the configuration that represents genuine independence.

A practitioner can do one of three things with this. Live with it, and apply the standard mitigations: label smoothing pulls qq off the simplex boundary, temperature scaling slows the gradient near the singularity, support regularization explicitly penalizes the violating mass. Avoid the boundary, and design objectives that don’t ask for what cross-entropy cannot deliver — SigLIP’s bias parameter is exactly this move, since it requires only that mismatched similarities sit below some threshold, not at the limit. Change the divergence: Jensen–Shannon is bounded by log2\log 2 everywhere; Wasserstein measures distance under transport rather than overlap; MMD operates in an RKHS and is finite for any pair of distributions. Each removes the singularity at a different structural cost.

The proven part of the argument is small and load-bearing: the singularity exists, it is weaker than disjoint support, its severity is a continuous function of violating mass, and its asymmetric structure is a precise per-coordinate fact. The interpretive part — the link from these properties to hallucination, the modality gap, and contrastive batch sizes — is a unifying lens, not a derivation. The lens is, I think, a useful one: it puts three failure modes that are usually discussed separately under a single mathematical structure, and it makes specific predictions about which interventions should help and which should not. But each of those phenomena has competing explanations in the literature, and a complete account of any of them has to do more than point at the singularity.

What is not an option is pretending the singularity isn’t there. The choice is not whether the loss has a singularity. The choice is whether one knows where it is, and how much of the model’s behaviour one is willing to read through that lens.

Cite as

Bouhsine, T. (). Not All Infinities Are Equal. Records of the !mmortal Data Scientist. https://tahabouhsine.com/blog/not-all-infinities-are-equal/

BibTeX
@misc{bouhsine2026notallinfinitiesareequal,
  author       = {Bouhsine, Taha},
  title        = {Not All Infinities Are Equal},
  year         = {2026},
  month        = {feb},
  howpublished = {\url{https://tahabouhsine.com/blog/not-all-infinities-are-equal/}},
  note         = {Blog post, Records of the !mmortal Data Scientist}
}

For the underlying paper

Bouhsine, T. (2026). On the Singularity Structure of Information-Theoretic Losses. Unpublished manuscript. [PDF]

BibTeX
@unpublished{bouhsine2026onthe,
  author = {Bouhsine, T.},
  title  = {On the Singularity Structure of Information-Theoretic Losses},
  year   = {2026},
  note   = {Unpublished manuscript}
}