Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Abstract
ReLIFT, a method combining reinforcement learning and supervised fine-tuning, enhances large language model reasoning by addressing limitations of RL through interleaved training, improving performance across benchmarks with minimal data.
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
Community
Promoting Our Recent Work!
This article essentially focuses on how to overcome the inherent limitations of reinforcement learning (RL) and develop new training paradigms. We present an initial exploration in this work and will continue to advance this line of research. All code and models have been fully open-sourced!
In recent years, large language models (LLMs) have made significant progress in reasoning abilities, largely thanks to RLHF (Reinforcement Learning from Human Feedback). However, existing RL methods are essentially “in-distribution optimizers”—they mainly improve model performance on problems within the scope of existing knowledge, making it difficult to surpass the capability ceiling of the base model. As a result, RL struggles to facilitate the acquisition of new knowledge and the development of higher-order reasoning skills.
Supervised Fine-Tuning (SFT) is widely used in LLMs to introduce new knowledge and reasoning patterns through high-quality demonstration data. SFT is particularly effective at improving model performance on problems beyond its original capabilities, especially for smaller models. However, SFT heavily relies on high-quality demonstration data and generally underperforms RL in terms of out-of-distribution (OOD) generalization. The respective strengths and weaknesses of RL and SFT inspire an important research direction: how to effectively combine the two approaches to enhance both reasoning and generalization abilities, while reducing dependence on expensive demonstration data, thereby breaking through existing cognitive bottlenecks.
In this work, we systematically analyze the dynamic behaviors of RL and SFT during training. Our experiments show that RL excels at consolidating and improving performance within the model’s existing capability range, while SFT is more effective at driving progress on more challenging problems. Notably, SFT may lead to performance degradation on simple problems and tends to generate more verbose answers, whereas RL offers limited improvement on difficult problems. Based on these findings, we propose a new training method—ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). During RL training, ReLIFT dynamically collects hard problems that the model struggles with, obtains high-quality chain-of-thought (CoT) demonstrations for these cases, and alternates between RL and SFT training to fully leverage their complementary strengths.
Experiments on five mathematical benchmarks and one out-of-distribution benchmark show that ReLIFT achieves a new SOTA accuracy of 51.1% on the Qwen2.5-Math-7B model, a 5.2 percentage point improvement over the strongest zero-RL baseline. Moreover, ReLIFT requires only 13% of the detailed demonstration data to outperform pure RL and pure SFT methods, and it significantly reduces the length of generated answers (about one-tenth that of SFT), greatly improving reasoning efficiency and practicality. Additionally, ReLIFT demonstrates superior generalization and stability even on smaller and weaker base models.
In summary, ReLIFT effectively overcomes the fundamental limitations of RL, offering efficiency, scalability, and strong generalization. It provides new insights and evidence for the continued advancement of reasoning abilities in large language models.
Our ultimate goal is to design a data engine where data annotation and model optimization proceed simultaneously, continuously pushing the boundaries of model capabilities!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (2025)
- Learning to Reason under Off-Policy Guidance (2025)
- Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start (2025)
- AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (2025)
- QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning (2025)
- TTRL: Test-Time Reinforcement Learning (2025)
- Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper