No principled definition of what makes a valid thought representation
No intrinsic evaluation โ only downstream benchmark accuracy, which conflates representation quality with model capacity
Unknown whether current latent reasoning methods actually encode meaningful thought
๐ฏ encodes sufficient statistics of P(Y|X) โ the minimal information needed to generate the right answer. No interpretability required; only functional validity.
๐ฏ must functionally substitute the output prefix. Conditioning on ๐ฏ should yield the same distribution over the output suffix as conditioning on the prefix itself.
"Yes. 13 is prime because no integer from 2โ12 divides it. Therefore, 13 is prime." Split this into a prefix and a suffix. Replace the prefix embeddings inside the LM with projected ๐ฏ and measure KL(P(suffix | prefix) โ P(suffix | ๐ฏ)). If ๐ฏ is causally valid, the suffix distribution stays the same โ KL โ 0.
๐ฏ must discard irrelevant input information. It retains only what the output actually depends on โ filtering out everything else.
"Hamlet was written by Shakespeare around 1600. Is 13 prime?" โ the output addresses only primality. A good ๐ฏ discards the literary detail (high residual entropy of X given ๐ฏ and Y) while preserving the math question (low prediction error of Y from ๐ฏ). Encoding both topics redundantly violates Minimality.
Inputs with non-overlapping output spaces must produce resolvable representations. A bounded-capacity discriminator trained on ๐ฏ pairs must tell them apart above chance.
"Is 13 prime?" (Yes) and "Is 14 prime?" (No) have disjoint answer spaces. A discriminator trained on their representation pairs must classify them above chance. We test two regimes: same-task (two questions from one task โ hardest) and cross-task (questions from different domains โ easier). Collapse on same-task is the critical failure mode.
๐ฏ must capture the distributional character of P(Y|X). Semantically equivalent outputs should yield similar representations; when outputs are uncertain, ๐ฏ must not collapse to one mode.
"Yes. 13 is prime because no integer from 2โ12 divides it." and "Yes, 13 has no divisors other than 1 and itself." say the same thing โ ๐ฏ should be approximately the same for both. Mode-collapse resistance: if the model sometimes answers Yes and sometimes No to an ambiguous question, ๐ฏ must reflect that uncertainty rather than locking onto one answer.
Every candidate representation knows what kind of task it's on.
But none can tell which question within that task.
"Is 13 prime?" vs "Is 14 prime?" โ all candidates collapse to chance. Only the Output Embedding oracle reaches 63โ73%.
Math vs spatial reasoning โ all candidates distinguish task type near perfectly.