Papers
arxiv:2504.20571

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Published on Apr 29
· Submitted by akhaliq on Apr 30
#1 Paper of the day
Authors:
,
,
,

Abstract

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR

Community

Paper submitter

Screenshot 2025-04-29 at 10.16.13 PM.png

What’s more, we also find that {π1, . . . , π16}
gets better results than the subset consisting of 16 randomly sampled examples, showing that choosing based on historical variance score can perform
better than random sampling

I wonder if it means the performance heavily relies on data selection. But if the selection is based on training on full dataset, it still cost much before 1-shot training. Is there a simpler but effective way to select examples?

·

Hi, thanks for your interest to our work! Yeah we still need some data selection to achieve better results, although relatively random samples always can still achieve large improvement in 1-shot RLVR (maybe 5 -> 10% drop on MATH500 and 2 -> 3% drop on average compared to the best example). One good thing we have observed is that pi_1 selected by the historical variance score from Qwen2.5-Math-1.5B can work for both Qwe2.5-Math-7B, Llama-3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-1.5B, so maybe we can use some proxy model for selecting.

Anyway, we think our current data selection is obviously not optimal, hope the future work can get better data selection algorithm for RLVR!

·
Paper author

Thanks for sharing our work!

This comment has been hidden (marked as Off-Topic)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Fascinating paper @ypwang61 !!

One thing I was curious about: have you looked into why training with pi_1 + pi_13 yield better results than pi_1 + pi_2? Is it more about diversity or complementary reasoning patterns between the examples? Would love to understand in more depth.

·
Paper author

Thanks for your interest! Yeah this is a great question, in my opinion that combining examples with better performance will be better, for example, you can see that pi_1 \approx pi_13 > pi_2 in 1-shot RLVR performance, so combining them may result in better performance. Similarly, pi_1+...+pi_16 works better than random 16 data.
Diversity should be somewhat important I think, but there should be tons of ablation study worth for trying while all of them takes resource to do so we haven't try it. In general I think there should be better data selection method for future works

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.20571 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.20571 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.20571 in a Space README.md to link it from this page.

Collections including this paper 16