arxiv:2506.07527

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Published on Jun 9

· Submitted by

RoadQAQ on Jun 10

Upvote

Authors:

Lu Ma ,

Abstract

ReLIFT, a method combining reinforcement learning and supervised fine-tuning, enhances large language model reasoning by addressing limitations of RL through interleaved training, improving performance across benchmarks with minimal data.

AI-generated summary

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

View arXiv page View PDF GitHub repository Add to collection

Community

RoadQAQ

Paper author Paper submitter 2 days ago

•

edited 2 days ago

Promoting Our Recent Work!

This article essentially focuses on how to overcome the inherent limitations of reinforcement learning (RL) and develop new training paradigms. We present an initial exploration in this work and will continue to advance this line of research. All code and models have been fully open-sourced!

In recent years, large language models (LLMs) have made significant progress in reasoning abilities, largely thanks to RLHF (Reinforcement Learning from Human Feedback). However, existing RL methods are essentially “in-distribution optimizers”—they mainly improve model performance on problems within the scope of existing knowledge, making it difficult to surpass the capability ceiling of the base model. As a result, RL struggles to facilitate the acquisition of new knowledge and the development of higher-order reasoning skills.

Supervised Fine-Tuning (SFT) is widely used in LLMs to introduce new knowledge and reasoning patterns through high-quality demonstration data. SFT is particularly effective at improving model performance on problems beyond its original capabilities, especially for smaller models. However, SFT heavily relies on high-quality demonstration data and generally underperforms RL in terms of out-of-distribution (OOD) generalization. The respective strengths and weaknesses of RL and SFT inspire an important research direction: how to effectively combine the two approaches to enhance both reasoning and generalization abilities, while reducing dependence on expensive demonstration data, thereby breaking through existing cognitive bottlenecks.

In this work, we systematically analyze the dynamic behaviors of RL and SFT during training. Our experiments show that RL excels at consolidating and improving performance within the model’s existing capability range, while SFT is more effective at driving progress on more challenging problems. Notably, SFT may lead to performance degradation on simple problems and tends to generate more verbose answers, whereas RL offers limited improvement on difficult problems. Based on these findings, we propose a new training method—ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). During RL training, ReLIFT dynamically collects hard problems that the model struggles with, obtains high-quality chain-of-thought (CoT) demonstrations for these cases, and alternates between RL and SFT training to fully leverage their complementary strengths.

Experiments on five mathematical benchmarks and one out-of-distribution benchmark show that ReLIFT achieves a new SOTA accuracy of 51.1% on the Qwen2.5-Math-7B model, a 5.2 percentage point improvement over the strongest zero-RL baseline. Moreover, ReLIFT requires only 13% of the detailed demonstration data to outperform pure RL and pure SFT methods, and it significantly reduces the length of generated answers (about one-tenth that of SFT), greatly improving reasoning efficiency and practicality. Additionally, ReLIFT demonstrates superior generalization and stability even on smaller and weaker base models.

In summary, ReLIFT effectively overcomes the fundamental limitations of RL, offering efficiency, scalability, and strong generalization. It provides new insights and evidence for the continued advancement of reasoning abilities in large language models.

Our ultimate goal is to design a data engine where data annotation and model optimization proceed simultaneously, continuously pushing the boundaries of model capabilities!

librarian-bot

about 9 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.07527 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.07527 in a Space README.md to link it from this page.