Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
Abstract
Low-probability Regularization (Lp-Reg) enhances exploration in Reinforcement Learning with Verifiable Rewards (RLVR) by preserving valuable low-probability tokens, leading to improved performance in complex reasoning tasks.
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textit{reasoning sparks}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of reasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Rethinking Entropy Regularization in Large Reasoning Models (2025)
- EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget (2025)
- CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning (2025)
- Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? (2025)
- DCPO: Dynamic Clipping Policy Optimization (2025)
- Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning (2025)
- Quantile Advantage Estimation for Entropy-Safe Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper