Thread Reader
Xuandong Zhao

Xuandong Zhao
@xuandongzhao

May 27
12 tweets
Tweet

šŸš€ Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n

The journey began last fall when undergrad Zhewei reached out to collaborate on research. We started from two key observations: 1. In exams, people are often more accurate on questions they feel confident about. Could LLMs also exhibit this ā€œconfidence ā‰ˆ correctnessā€ pattern? 2. In test-time reasoning, techniques like long CoT or parallel scaling (e.g., majority voting) are common. But how should we choose among diverse outputs, especially in open-ended tasks like code generation? (2/n)
We explored how to scale best-of-n selection. Existing heuristics like entropy and perplexity had issues: sensitivity to output length, bias, and poor scaling with more samples. Then we had a key insight: measure how far each token’s distribution is from uniform. The KL divergence KL(U‖P) quantifies how confidently a model predicts each token. We call this self-certainty. It's the reverse of entropy—mode-seeking rather than mode-covering. We found self-certainty to be a highly effective signal: 1. When answers are known, it outperforms majority voting via weighted voting. 2. When answers are unknown, it still scales robustly with n. (3/n)
This led to our first paper in February: "Scalable Best-of-N Selection via Self-Certainty" (4/n)
But we didn’t stop there. A natural next question emerged: if self-certainty is a good evaluation signal, can it also be used as a reward to train the model? Research works similarly—we build confidence through exploration and reflection. Can LLMs do the same? (5/n)
This inspired our new paradigm: Reinforcement Learning from Internal Feedback (RLIF). (6/n)
Our method, Intuitor, uses self-certainty as the reward signal for reinforcement learning—no external supervision required. (7/n)
And it works. Intuitor matches the performance of GRPO (which uses rule-based rewards) on math tasks and generalizes even better to code generation. It learns structured reasoning—planning ahead, decomposing problems, and even following instructions—all from intrinsic feedback. (8/n)
On a personal note, this project gave me a lot of confidence, especially as we saw other concurrent work emerge (TTRL, entropy-based RL, semantic entropy + answers, etc.). It’s clear RLIF is a promising direction. (9/n)
Looking ahead, RLIF raises many open questions: - Why does it work? What tasks benefit most? - Can it scale to larger models? How does it relate to hallucination or memorization? - Can RLIF complement RLHF or RLVR in real-world deployments? - How well does it work for agentic tasks? (10/n)
We're just scratching the surface—lots more to explore. We’d love to hear your thoughts or collaborate! Special thanks to Zhewei for leading both projects with me, Aosong for the support, and @Sergey Levine and @Dawn Song for their invaluable guidance! (11/n)
Xuandong Zhao

Xuandong Zhao

@xuandongzhao
Postdoc@UC Berkeley CS; Research: ML, NLP, AI Safety
Follow on š•
Missing some tweets in this thread? Or failed to load images or videos? You can try to .