Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Abstract
Slow-Fast Policy Optimization (SFPO) enhances reinforcement learning training in large language models by improving stability, reducing rollouts, and accelerating convergence compared to Group Relative Policy Optimization (GRPO).
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 fewer rollouts and a 4.19 reduction in wall-clock time to match GRPO's best accuracy.
Community
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address the above limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93x fewer rollouts and a 4.19x reduction in wall-clock time to match GRPOโs best accuracy.
๐ป Github: https://github.com/Urheen/SFPO
๐ Website: https://zkbig.github.io/Slow_Fast_Policy_Optimization.github.io/
๐ ArXiv: https://arxiv.org/abs/2510.04072
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts (2025)
- Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs? (2025)
- ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning (2025)
- HAEPO: History-Aggregated Exploratory Policy Optimization (2025)
- Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse (2025)
- Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning (2025)
- Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper