Papers
arxiv:2506.14245

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Published on Jun 17
· Submitted by shun-zheng on Jun 18
Authors:
,
,
,
,
,
,

Abstract

RLVR advances machine reasoning by incentivizing correct and logical thought chains, addressing limitations identified by a more precise evaluation metric, $CoT$-$Pass@K$.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

Community

Paper author Paper submitter

We present a theoretical framework and empirical evidence demonstrating that reinforcement learning with verifiable rewards (RLVR) implicitly incentivizes correct reasoning in large language models (LLMs). This insight resolves a key debate in the field: whether RLVR-driven improvements extend beyond the inherent capabilities of base LLMs. While prevailing assumptions attribute gains in Pass@1 solely to the original Pass@K performance of pretrained models, our findings reveal that RLVR actively promotes deeper reasoning as training progresses.

yeah would be worthwhile investigating what all the pass@x results of SOTA reasoning models contained in their actual CoT in hindsight.

·

Post-RLVR or distillation reasoning models generally demonstrate significantly higher probabilities of correct CoT reasoning compared to base models or instruction models.

Regarding SOTA reasoning models, most of their CoTs are correct actually.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.14245 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.14245 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14245 in a Space README.md to link it from this page.

Collections including this paper 1