arxiv:2502.10325

Process Reward Models for LLM Agents: Practical Framework and Directions

Published on Feb 14

Authors:

Abstract

We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2502.10325 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2502.10325 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2502.10325 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.