DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Abstract
DeepSearch integrates Monte Carlo Tree Search into RLVR training to enhance exploration and credit assignment, achieving state-of-the-art performance with reduced computational cost.
Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
Community
🔥 Concise & Promotional
🚀 DeepSearch-1.5B sets a new SOTA in math reasoning for 1.5B LMs:
✅ 62.95% avg accuracy (+1.25% over prior best)
✅ 5.7× fewer GPU hours than extended training
Key idea: bring MCTS into training, not just inference, for systematic exploration & better credit assignment.
👉 Paper: https://arxiv.org/pdf/2509.25454
👉 Model: https://huggingface.co/fangwu97/DeepSearch-1.5B
🧠Technical & Insightful
We introduce DeepSearch, a framework that integrates Monte Carlo Tree Search into RLVR training.
Unlike existing methods that restrict search to inference, DeepSearch systematically explores reasoning paths during training—achieving fine-grained credit assignment, efficient supervision, and robust exploration.
📊 Results:
-- 62.95% avg accuracy on math reasoning (SOTA for 1.5B models)
-- Outperforms Nemotron-Reasoning-Qwen-1.5B v2 by +1.25%
-- Uses 5.7× fewer GPU hours than depth-scaled training
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism (2025)
- From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision (2025)
- Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models (2025)
- TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling (2025)
- CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning (2025)
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models (2025)
- MG2FlowNet: Accelerating High-Reward Sample Generation via Enhanced MCTS and Greediness Control (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Does using the replay buffer risk turning the method into off-policy, and do you have ablations isolating its gain?
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper