Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n

The journey began last fall when undergrad Zhewei reached out to collaborate on research. We started from two key observations: 1. In exams, people are often more accurate on questions they feel confident about. Could LLMs also exhibit this “confidence ≈ correctness” pattern? 2. In test-time reasoning, techniques like long CoT or parallel scaling (e.g., majority voting) are common. But how should we choose among diverse outputs, especially in open-ended tasks like code generation? (2/n)

We explored how to scale best-of-n selection. Existing heuristics like entropy and perplexity had issues: sensitivity to output length, bias, and poor scaling with more samples. Then we had a key insight: measure how far each token’s distribution is from uniform. The KL divergence KL(U‖P) quantifies how confidently a model predicts each token. We call this self-certainty. It's the reverse of entropy—mode-seeking rather than mode-covering. We found self-certainty to be a highly effective signal: 1. When answers are known, it outperforms majority voting via weighted voting. 2. When answers are unknown, it still scales robustly with n. (3/n)

This led to our first paper in February: "Scalable Best-of-N Selection via Self-Certainty" (4/n)

But we didn’t stop there. A natural next question emerged: if self-certainty is a good evaluation signal, can it also be used as a reward to train the model? Research works similarly—we build confidence through exploration and reflection. Can LLMs do the same? (5/n)

This inspired our new paradigm: Reinforcement Learning from Internal Feedback (RLIF). (6/n)

Our method, Intuitor, uses self-certainty as the reward signal for reinforcement learning—no external supervision required. (7/n)

And it works. Intuitor matches the performance of GRPO (which uses rule-based rewards) on math tasks and generalizes even better to code generation. It learns structured reasoning—planning ahead, decomposing problems, and even following instructions—all from intrinsic feedback. (8/n)

On a personal note, this project gave me a lot of confidence, especially as we saw other concurrent work emerge (TTRL, entropy-based RL, semantic entropy + answers, etc.). It’s clear RLIF is a promising direction. (9/n)

Looking ahead, RLIF raises many open questions: - Why does it work? What tasks benefit most? - Can it scale to larger models? How does it relate to hallucination or memorization? - Can RLIF complement RLHF or RLVR in real-world deployments? - How well does it work for agentic tasks? (10/n)

We're just scratching the surface—lots more to explore. We’d love to hear your thoughts or collaborate! Special thanks to Zhewei for leading both projects with me, Aosong for the support, and @Sergey Levine and @Dawn Song for their invaluable guidance! (11/n)

paper: arxiv.org/abs/2505.19590

code: (open-r1 and verl versions) github.com/sunblaze-ucb/I (12/12)

Contribute to sunblaze-ucb/Intuitor development by creating an account on GitHub.

github.com/sunblaze-ucb/I…

GitHub - sunblaze-ucb/Intuitor

Xuandong Zhao

@xuandongzhao

Postdoc@UC Berkeley CS; Research: ML, NLP, AI Safety

Follow on 𝕏

twitter-thread.com/t/1927270931874910259

Xuandong Zhao@xuandongzhao

Excited to share the most inspiring work I’ve been part of this year: "Learning to Reason without External Rewards" TL;DR: We show that LLMs can learn complex reasoning without access to ground-truth answers, simply by optimizing their own internal sense of confidence. 1/n

Xuandong Zhao

Xuandong Zhao
@xuandongzhao