Research

Formalizing
Latent Thoughts

Four Axioms for Evaluating Thought Representations in LLMs

Fahd Seddik  ·  Fatemeh Fard University of British Columbia

Code
Scroll
The Problem

LLMs are thinking in silence.
But are they really thinking?

Explicit CoT
"Is 13 prime?"
โ†’
Yes. 13 is prime because no divisor...
โ†’
"Yes. 13 is prime."
Latent Thought
"Is 13 prime?"
โ†’
๐’ฏ
โ†’
"Yes. 13 is prime."
Does ๐’ฏ actually encode the reasoning โ€” or is it a lucky bypass?
01

No principled definition of what makes a valid thought representation

02

No intrinsic evaluation โ€” only downstream benchmark accuracy, which conflates representation quality with model capacity

03

Unknown whether current latent reasoning methods actually encode meaningful thought

The Framework

A thought representation
mediates input and output.

X
Input
"Is 13 prime, and why?"
encode
๐’ฏ
Thought
Any thought extractor output
decode
Y
Output
"Yes. 13 is prime because no divisor exists."

๐’ฏ encodes sufficient statistics of P(Y|X) โ€” the minimal information needed to generate the right answer. No interpretability required; only functional validity.

The Axioms

Any valid ๐’ฏ must satisfy
four functional properties.

01
Causality
output prefix LM P(Y | tokens) ๐’ฏ LM P(Y | ๐’ฏ) โ‰ˆ same โ‰ˆ

๐’ฏ must functionally substitute the output prefix. Conditioning on ๐’ฏ should yield the same distribution over the output suffix as conditioning on the prefix itself.

Example The model generates "Yes. 13 is prime because no integer from 2โ€“12 divides it. Therefore, 13 is prime." Split this into a prefix and a suffix. Replace the prefix embeddings inside the LM with projected ๐’ฏ and measure KL(P(suffix | prefix) โ€– P(suffix | ๐’ฏ)). If ๐’ฏ is causally valid, the suffix distribution stays the same โ€” KL โ‰ˆ 0.
KL substitution error ↓ lower is better
02
Minimality
X (full input) "Hamlet was written by Shakespeare..." โ† irrelevant ๐’ฏ primality only Y answer

๐’ฏ must discard irrelevant input information. It retains only what the output actually depends on โ€” filtering out everything else.

Example Input: "Hamlet was written by Shakespeare around 1600. Is 13 prime?" โ€” the output addresses only primality. A good ๐’ฏ discards the literary detail (high residual entropy of X given ๐’ฏ and Y) while preserving the math question (low prediction error of Y from ๐’ฏ). Encoding both topics redundantly violates Minimality.
Minimality gap ฮ”IB ↑ higher is better
03
Separability
dim 2 dim 1 13 prime? โœ“ Yes 14 prime? โœ— No decision boundary

Inputs with non-overlapping output spaces must produce resolvable representations. A bounded-capacity discriminator trained on ๐’ฏ pairs must tell them apart above chance.

Example "Is 13 prime?" (Yes) and "Is 14 prime?" (No) have disjoint answer spaces. A discriminator trained on their representation pairs must classify them above chance. We test two regimes: same-task (two questions from one task โ€” hardest) and cross-task (questions from different domains โ€” easier). Collapse on same-task is the critical failure mode.
Discriminator accuracy ↑ higher is better
04
Stability
Mode Collapse โœ— one answer Stable โœ“ reflects uncertainty

๐’ฏ must capture the distributional character of P(Y|X). Semantically equivalent outputs should yield similar representations; when outputs are uncertain, ๐’ฏ must not collapse to one mode.

Example Lexical invariance: "Yes. 13 is prime because no integer from 2โ€“12 divides it." and "Yes, 13 has no divisors other than 1 and itself." say the same thing โ€” ๐’ฏ should be approximately the same for both. Mode-collapse resistance: if the model sometimes answers Yes and sometimes No to an ambiguous question, ๐’ฏ must reflect that uncertainty rather than locking onto one answer.
Distributional Consistency Score (DCS) ↑ higher is better
The Key Finding

Representational Collapse

Every candidate representation knows what kind of task it's on.
But none can tell which question within that task.

Within-task accuracy
0% 50% 73% 100% 0% 50% 100% random All candidates Output Embedding (oracle) โญ ideal Cross-task accuracy
โœ—
Within-task: 50โ€“55%

"Is 13 prime?" vs "Is 14 prime?" โ€” all candidates collapse to chance. Only the Output Embedding oracle reaches 63โ€“73%.

โœ“
Cross-task: 95โ€“99%

Math vs spatial reasoning โ€” all candidates distinguish task type near perfectly.

This failure is invisible to accuracy benchmarks โ€” Spearman ฯ = 0.10 (p = 0.31) between within-task discrimination and downstream pass@1. Models answer correctly while representations collapse on question identity.
Results at a Glance

Audited across five models
and twenty-three reasoning tasks.

5Open-weight LLMs
23Reasoning tasks (BBEH)
21Candidate representations
4Intrinsic axes measured
Source LLMs
Llama-3.1-8B-InstructDense ยท Instruct
Llama-3.3-70B-InstructDense ยท Instruct
DeepSeek-R1-Distill-Qwen-32BDense ยท Reasoning-distilled
Skywork-OR1-32BDense ยท Native RL
GPT-OSS-20BSparse MoE
Candidate Families
Last Input Token (LIT)Hidden state at final prompt position
Soft Thinking (ST)Weighted token embedding combination
Soft Thinking + Gumbel (STN)ST with sampling noise
Latent Thinking (LT)Recurrent hidden state (COCONUT-like)
Output Embedding (OE)Upper-bound anchor (output access)
Causality
Partial
All candidates 3.8โ€“5.4 nats KL (vs. random โ‰ˆ 9.4 nats) โ€” yet no candidate consistently beats the plain Input Embedding across all 5 LLMs
Minimality
Mixed
ฮ”IB range โˆ’0.40 to +0.37 nats โ€” ST/STN at or above Input Embedding; LIT below; no candidate provides systematic compression advantage across LLMs
Separability
Critical Failure
Within-task accuracy 50โ€“55% (random = 50%) across all candidates and all 5 LLMs. Output Embedding (oracle) reaches 63โ€“73%. Cross-task saturates at 95โ€“99% โ€” candidates distinguish task type but not question identity.
Stability
Partial
DCS AUROC 0.84โ€“0.97 at ฯ„=0.9 โ€” degrades as steps grow (ST@128: 0.85 vs. ST@1: 0.94 on Llama-70B); Input Embedding matches or exceeds iterative candidates
F1
No candidate satisfies all four axioms simultaneously. Across 4 axes and 5 LLMs, no iterative method consistently beats the Input Embedding reference (OE aside: 16/20 cells; all others: 2โ€“13/20).
F2
Per-question identity collapses. Representations distinguish task type reliably (95โ€“99%) but cannot distinguish between two questions within the same task (50โ€“55% โ‰ˆ chance). The failure is uniform across all candidate families and architectures.
F3
More thinking steps degrade representations. ST, STN, and LT all worsen as step count grows from 1 to 128 on Stability and Causality โ€” adding budget does not improve representational quality.
F4
The failure is structural, not scale-dependent. Consistent across dense (8Bโ€“70B), reasoning-distilled, RL-trained, and sparse-MoE architectures โ€” indicating a fundamental gap, not a parameter-count artifact.
F5
Accuracy benchmarks miss these failures. Spearman ฯ = 0.10 between within-task separability and downstream pass@1 โ€” models answer correctly while representations carry no per-question structure.