ReasonFlux-PRM

Code | Paper

We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling.

Model Type Size Capabilities Use Cases Download
ReasonFlux-PRM PRM 7B • Trajectory-aware scoring
• Online/Offline supervision
• Dense process rewards
Data selection, RL training, Test-time scaling 🤗 7B
ReasonFlux-PRM PRM 1.5B • Lightweight scoring
• Efficient inference
• Edge deployment
Resource-constrained applications 🤗 1.5B
ReasonFlux-PRM-Qwen-2.5 End-to-End Trained Policy Model 7B • Long CoT reasoning
• Solving complex tasks and problems
Math and Science Reasoning 🤗 7B

Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k Trajectory–Response pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO.

Citation

@article{zou2025reasonfluxprm,
  title={ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs},
  author={Zou, Jiaru and Yang, Ling and Gu, Jingwen and Qiu, Jiahao and Shen, Ke and He, Jingrui and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.18896},
  year={2025}
}
Downloads last month
8
Safetensors
Model size
7.08B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support