Formalizing Latent Thoughts

The Problem

LLMs are thinking in silence.
But are they really thinking?

Explicit CoT

"Is 13 prime?"

→

Yes. 13 is prime because no divisor...

→

"Yes. 13 is prime."

Latent Thought

"Is 13 prime?"

→

𝒯

→

"Yes. 13 is prime."

Does 𝒯 actually encode the reasoning — or is it a lucky bypass?

01

No principled definition of what makes a valid thought representation

02

No intrinsic evaluation — only downstream benchmark accuracy, which conflates representation quality with model capacity

03

Unknown whether current latent reasoning methods actually encode meaningful thought

The Framework

A thought representation
mediates input and output.

X

Input

"Is 13 prime, and why?"

encode

𝒯

Thought

Any thought extractor output

decode

Y

Output

"Yes. 13 is prime because no divisor exists."

𝒯 encodes sufficient statistics of P(Y|X) — the minimal information needed to generate the right answer. No interpretability required; only functional validity.

The Axioms

Any valid 𝒯 must satisfy
four functional properties.

01

Causality

𝒯 must functionally substitute the output prefix. Conditioning on 𝒯 should yield the same distribution over the output suffix as conditioning on the prefix itself.

Example The model generates "Yes. 13 is prime because no integer from 2–12 divides it. Therefore, 13 is prime." Split this into a prefix and a suffix. Replace the prefix embeddings inside the LM with projected 𝒯 and measure KL(P(suffix | prefix) ‖ P(suffix | 𝒯)). If 𝒯 is causally valid, the suffix distribution stays the same — KL ≈ 0.

KL substitution error ↓ lower is better

02

Minimality

𝒯 must discard irrelevant input information. It retains only what the output actually depends on — filtering out everything else.

Example Input: "Hamlet was written by Shakespeare around 1600. Is 13 prime?" — the output addresses only primality. A good 𝒯 discards the literary detail (high residual entropy of X given 𝒯 and Y) while preserving the math question (low prediction error of Y from 𝒯). Encoding both topics redundantly violates Minimality.

Minimality gap Δ_IB ↑ higher is better

03

Separability

Inputs with non-overlapping output spaces must produce resolvable representations. A bounded-capacity discriminator trained on 𝒯 pairs must tell them apart above chance.

Example "Is 13 prime?" (Yes) and "Is 14 prime?" (No) have disjoint answer spaces. A discriminator trained on their representation pairs must classify them above chance. We test two regimes: same-task (two questions from one task — hardest) and cross-task (questions from different domains — easier). Collapse on same-task is the critical failure mode.

Discriminator accuracy ↑ higher is better

04

Stability

𝒯 must capture the distributional character of P(Y|X). Semantically equivalent outputs should yield similar representations; when outputs are uncertain, 𝒯 must not collapse to one mode.

Example Lexical invariance: "Yes. 13 is prime because no integer from 2–12 divides it." and "Yes, 13 has no divisors other than 1 and itself." say the same thing — 𝒯 should be approximately the same for both. Mode-collapse resistance: if the model sometimes answers Yes and sometimes No to an ambiguous question, 𝒯 must reflect that uncertainty rather than locking onto one answer.

Distributional Consistency Score (DCS) ↑ higher is better

The Key Finding

Representational Collapse

Every candidate representation knows what kind of task it's on.
But none can tell which question within that task.

Within-task accuracy

✗

Within-task: 50–55%

"Is 13 prime?" vs "Is 14 prime?" — all candidates collapse to chance. Only the Output Embedding oracle reaches 63–73%.

✓

Cross-task: 95–99%

Math vs spatial reasoning — all candidates distinguish task type near perfectly.

This failure is invisible to accuracy benchmarks — Spearman ρ = 0.10 (p = 0.31) between within-task discrimination and downstream pass@1. Models answer correctly while representations collapse on question identity.

Results at a Glance

Audited across five models
and twenty-three reasoning tasks.

5Open-weight LLMs

23Reasoning tasks (BBEH)

21Candidate representations

4Intrinsic axes measured

Source LLMs

Llama-3.1-8B-InstructDense · Instruct

Llama-3.3-70B-InstructDense · Instruct

DeepSeek-R1-Distill-Qwen-32BDense · Reasoning-distilled

Skywork-OR1-32BDense · Native RL

GPT-OSS-20BSparse MoE

Candidate Families

Last Input Token (LIT)Hidden state at final prompt position

Soft Thinking (ST)Weighted token embedding combination

Soft Thinking + Gumbel (STN)ST with sampling noise

Latent Thinking (LT)Recurrent hidden state (COCONUT-like)

Output Embedding (OE)Upper-bound anchor (output access)

Causality

Partial

All candidates 3.8–5.4 nats KL (vs. random ≈ 9.4 nats) — yet no candidate consistently beats the plain Input Embedding across all 5 LLMs

Minimality

Mixed

Δ_IB range −0.40 to +0.37 nats — ST/STN at or above Input Embedding; LIT below; no candidate provides systematic compression advantage across LLMs

Separability

Critical Failure

Within-task accuracy 50–55% (random = 50%) across all candidates and all 5 LLMs. Output Embedding (oracle) reaches 63–73%. Cross-task saturates at 95–99% — candidates distinguish task type but not question identity.

Stability

Partial

DCS AUROC 0.84–0.97 at τ=0.9 — degrades as steps grow (ST@128: 0.85 vs. ST@1: 0.94 on Llama-70B); Input Embedding matches or exceeds iterative candidates

F1

No candidate satisfies all four axioms simultaneously. Across 4 axes and 5 LLMs, no iterative method consistently beats the Input Embedding reference (OE aside: 16/20 cells; all others: 2–13/20).

F2

Per-question identity collapses. Representations distinguish task type reliably (95–99%) but cannot distinguish between two questions within the same task (50–55% ≈ chance). The failure is uniform across all candidate families and architectures.

F3

More thinking steps degrade representations. ST, STN, and LT all worsen as step count grows from 1 to 128 on Stability and Causality — adding budget does not improve representational quality.

F4

The failure is structural, not scale-dependent. Consistent across dense (8B–70B), reasoning-distilled, RL-trained, and sparse-MoE architectures — indicating a fundamental gap, not a parameter-count artifact.

F5

Accuracy benchmarks miss these failures. Spearman ρ = 0.10 between within-task separability and downstream pass@1 — models answer correctly while representations carry no per-question structure.

FormalizingLatent Thoughts

LLMs are thinking in silence.But are they really thinking?

A thought representationmediates input and output.

Any valid 𝒯 must satisfyfour functional properties.

Representational Collapse

Audited across five modelsand twenty-three reasoning tasks.

Formalizing
Latent Thoughts

LLMs are thinking in silence.
But are they really thinking?

A thought representation
mediates input and output.

Any valid 𝒯 must satisfy
four functional properties.

Audited across five models
and twenty-three reasoning tasks.