
Excited to share the most inspiring work Iāve been part of this year:
"Learning to Reason without External Rewards"
TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n
The journey began last fall when undergrad Zhewei reached out to collaborate on research. We started from two key observations:
1. In exams, people are often more accurate on questions they feel confident about. Could LLMs also exhibit this āconfidence ā correctnessā pattern?
2. In test-time reasoning, techniques like long CoT or parallel scaling (e.g., majority voting) are common. But how should we choose among diverse outputs, especially in open-ended tasks like code generation?
(2/n)
We explored how to scale best-of-n selection. Existing heuristics like entropy and perplexity had issues: sensitivity to output length, bias, and poor scaling with more samples.
Then we had a key insight: measure how far each tokenās distribution is from uniform. The KL divergence KL(UāP) quantifies how confidently a model predicts each token. We call this self-certainty. It's the reverse of entropyāmode-seeking rather than mode-covering.
We found self-certainty to be a highly effective signal:
1. When answers are known, it outperforms majority voting via weighted voting.
2. When answers are unknown, it still scales robustly with n.
(3/n)
This led to our first paper in February:
"Scalable Best-of-N Selection via Self-Certainty"
(4/n)
But we didnāt stop there. A natural next question emerged: if self-certainty is a good evaluation signal, can it also be used as a reward to train the model?
Research works similarlyāwe build confidence through exploration and reflection. Can LLMs do the same?
(5/n)
This inspired our new paradigm: Reinforcement Learning from Internal Feedback (RLIF). (6/n)
Our method, Intuitor, uses self-certainty as the reward signal for reinforcement learningāno external supervision required. (7/n)
And it works. Intuitor matches the performance of GRPO (which uses rule-based rewards) on math tasks and generalizes even better to code generation.
It learns structured reasoningāplanning ahead, decomposing problems, and even following instructionsāall from intrinsic feedback.
(8/n)
On a personal note, this project gave me a lot of confidence, especially as we saw other concurrent work emerge (TTRL, entropy-based RL, semantic entropy + answers, etc.). Itās clear RLIF is a promising direction.
(9/n)
Looking ahead, RLIF raises many open questions:
- Why does it work? What tasks benefit most?
- Can it scale to larger models? How does it relate to hallucination or memorization?
- Can RLIF complement RLHF or RLVR in real-world deployments?
- How well does it work for agentic tasks?
(10/n)
We're just scratching the surfaceālots more to explore. Weād love to hear your thoughts or collaborate!
Special thanks to Zhewei for leading both projects with me, Aosong for the support, and
@Sergey Levine and
@Dawn Song for their invaluable guidance!
(11/n)