Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting
Abstract
LENS modifies GRPO by assigning confidence-dependent rewards to incorrect responses, improving efficiency and performance in reinforcement learning with verifiable rewards.
Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as Likelihood Estimation with Negative Samples (LENS). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.
Community
A new algorithm to make use of negative groups in GRPO.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NGRPO: Negative-enhanced Group Relative Policy Optimization (2025)
- ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs (2025)
- Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse (2025)
- RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization (2025)
- Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense (2025)
- $\lambda$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences (2025)
- Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Using negative(bad) data is always what I have been thinking about..
I like this
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper