Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Abstract
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR
Community
What’s more, we also find that {π1, . . . , π16}
gets better results than the subset consisting of 16 randomly sampled examples, showing that choosing based on historical variance score can perform
better than random sampling
I wonder if it means the performance heavily relies on data selection. But if the selection is based on training on full dataset, it still cost much before 1-shot training. Is there a simpler but effective way to select examples?
Hi, thanks for your interest to our work! Yeah we still need some data selection to achieve better results, although relatively random samples always can still achieve large improvement in 1-shot RLVR (maybe 5 -> 10% drop on MATH500 and 2 -> 3% drop on average compared to the best example). One good thing we have observed is that pi_1 selected by the historical variance score from Qwen2.5-Math-1.5B can work for both Qwe2.5-Math-7B, Llama-3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-1.5B, so maybe we can use some proxy model for selecting.
Anyway, we think our current data selection is obviously not optimal, hope the future work can get better data selection algorithm for RLVR!
Thanks for sharing our work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (2025)
- Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2025)
- TTRL: Test-Time Reinforcement Learning (2025)
- Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model (2025)
- Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning (2025)
- FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models (2025)
- Understanding R1-Zero-Like Training: A Critical Perspective (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Fascinating paper @ypwang61 !!
One thing I was curious about: have you looked into why training with pi_1 + pi_13 yield better results than pi_1 + pi_2? Is it more about diversity or complementary reasoning patterns between the examples? Would love to understand in more depth.
Thanks for your interest! Yeah this is a great question, in my opinion that combining examples with better performance will be better, for example, you can see that pi_1 \approx pi_13 > pi_2 in 1-shot RLVR performance, so combining them may result in better performance. Similarly, pi_1+...+pi_16 works better than random 16 data.
Diversity should be somewhat important I think, but there should be tons of ablation study worth for trying while all of them takes resource to do so we haven't try it. In general I think there should be better data selection method for future works
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper