Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
Abstract
BFS-Prover-V2 addresses scaling challenges in automated theorem proving by integrating a multi-turn off-policy RL framework and a planner-enhanced multi-agent search architecture, achieving state-of-the-art results on formal mathematics benchmarks.
The integration of Large Language Models (LLMs) into automated theorem proving has shown immense promise, yet is fundamentally constrained by challenges in scaling up both training-time reinforcement learning (RL) and inference-time compute. This paper introduces BFS-Prover-V2, a system designed to address this dual scaling problem. We present two primary innovations. The first is a novel multi-turn off-policy RL framework for continually improving the performance of LLM step-prover at training time. This framework, inspired by the principles of AlphaZero, utilizes a multi-stage expert iteration pipeline featuring adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term RL in LLM-based agents. The second innovation is a planner-enhanced multi-agent search architecture that scales reasoning capabilities at inference time. This architecture employs a general reasoning model as a high-level planner to iteratively decompose complex theorems into a sequence of simpler subgoals. This hierarchical approach substantially reduces the search space, enabling a team of parallel prover agents to collaborate efficiently by leveraging a shared proof cache. We demonstrate that this dual approach to scaling yields state-of-the-art results on established formal mathematics benchmarks. BFS-Prover-V2 achieves 95.08\% and 41.4\% on the MiniF2F and ProofNet test sets respectively. While demonstrated in the domain of formal mathematics, the RL and inference techniques presented in this work are of broader interest and may be applied to other domains requiring long-horizon multi-turn reasoning and complex search.
Community
Proposes scalable multi-turn off-policy RL and multi-agent tree search to enhance LLM step-proving capabilities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning (2025)
- Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents (2025)
- Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning (2025)
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting (2025)
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning (2025)
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL (2025)
- PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper