arxiv:2507.14843

The Invisible Leash: Why RLVR May Not Escape Its Origin

Published on Jul 20

· Submitted by

fangwu97 on Jul 22

#3 Paper of the day

Upvote

Authors:

Weihao Xuan ,

Abstract

Theoretical and empirical analysis reveals that Reinforcement Learning with Verifiable Rewards (RLVR) enhances precision but narrows exploration, limiting its ability to discover novel solutions.

AI-generated summary

Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

View arXiv page View PDF Add to collection

Community

fangwu97

Paper submitter about 21 hours ago

•

edited about 21 hours ago

🚀 New Paper Alert: "The Invisible Leash of RLVR"

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful tool for improving reasoning accuracy in large models. But does it truly extend reasoning capabilities—or just reweight what the base model already knows?

Our new study explores this question through theory and large-scale experiments. We show that RLVR operates within the support of the base model—unable to reach novel completions with zero initial probability. While it improves precision (e.g., pass@1), this comes at a cost: entropy–reward tradeoffs often lead to exploration collapse, shrinking the model’s effective solution space.

We also uncover a surprising phenomenon: RLVR can increase token-level entropy (more local uncertainty) while reducing answer-level entropy (less global diversity)—revealing an "invisible leash" on generative diversity.

🧠 If we're to push beyond current reasoning limits, we may need explicit exploration, diversity-promoting objectives, or hybrid fine-tuning strategies.

WizarrdZZH

about 19 hours ago

Thank you for your innovative work and valuable contributions in this paper. Regarding the inductive step in Thm 2.2 's proof of Appendix A1, we sincerely seek further clarification on one subtle point: The derivation from "y* contributes no gradient" to the conclusion that the updated model satisfies πθ'(y*|x) = 0 is not immediately obvious. Would you be kind enough to provide a more detailed explanation of this logical connection?

fangwu97

about 7 hours ago

Thank you for your thoughtful question and for engaging deeply with the proof in Appendix A.1. You're absolutely right that the inductive step in Theorem 2.2 hinges on a subtle but important point regarding the gradient behavior.

To clarify: the key idea is that if a candidate completion ( y^* \notin \mathrm{supp}(q) ), then its probability under the base model is ( q(y^*|x) = 0 ), and under typical gradient-based updates (e.g., policy gradient or GRPO), its contribution to the gradient is zero. That is, the loss function (which involves terms like ( \log \pi_\theta(y|x) )) is undefined or yields zero gradient at points where ( \pi_\theta(y|x) = 0 ), and the update rule does not increase probability mass on such completions unless they receive nonzero gradient signal.

Since ( y^* ) receives zero gradient at each update step—and we assume standard initialization where ( \pi^{(0)}(y^*|x) = 0 ) (i.e., follows ( q ))—it remains at zero probability under all subsequent updates. That is, for all ( t ), ( \pi^{(t)}(y^*|x) = 0 ). Thus, the updated policy ( \pi_{\theta'} ) still satisfies ( \pi_{\theta'}(y^*|x) = 0 ), completing the inductive argument.

We will update the appendix to better articulate this assumption and its implications. Thank you again for pointing this out.