Agentic Reinforced Policy Optimization (ARPO)

This repository contains a model checkpoint for Agentic Reinforced Policy Optimization (ARPO), a novel agentic Reinforcement Learning (RL) algorithm designed for training multi-turn Large Language Model (LLM)-based agents.

The model was presented in the paper Agentic Reinforced Policy Optimization (arXiv: 2507.19849).

✨ Overview

ARPO addresses the challenge of inadequately balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. Through preliminary experiments, it was observed that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage.

By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Notably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments.

intro

In the figure (left), the initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.
In the figure (right), ARPO's performance is validated across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

📣 Latest News

[July 29, 2025]: 📄 Our paper is now available on arXiv and Hugging Face daily paper.
[July 25, 2025]: 🔥 We released all our ARPO model checkpoints (3B~14B) and datasets (SFT, RL, Evaluation). Checkout 🤗ARPO Collection here. We will keep update it!
[July 25, 2025]: 🚀 Full codebase released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models. We have implemented extensive tool-call acceleration and memory optimization during RL training.

🔗 Links

Paper (Hugging Face): Agentic Reinforced Policy Optimization
Paper (arXiv): https://arxiv.org/abs/2507.19849
GitHub Repository: https://github.com/dongguanting/ARPO
Hugging Face Model Collection: ARPO Models
Hugging Face Dataset Collection: ARPO Datasets

⚡ Quick Start

This model can be loaded and used with the transformers library. Below is a basic example for text generation and multi-turn interaction. For more advanced usage, including multi-tool agentic RL training and evaluation, please refer to the official GitHub repository.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
# Replace "dongguanting/Qwen3-8B-ARPO-DeepSearch" with the specific model ID you want to use
model_id = "dongguanting/Qwen3-8B-ARPO-DeepSearch" # Example from the ARPO collection
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Adjust dtype based on model requirements and hardware
    device_map="auto",          # Automatically maps the model to available devices (e.g., GPU)
    trust_remote_code=True,
)

# Prepare your conversational input
# The model supports multi-turn interactions and tool calls through its chat template.
messages = [
    {"role": "user", "content": "What is the capital of France? And what is the population of that city?"},
]

# Apply the chat template and tokenize
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate a response
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")]
)

# Decode and print the generated text
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

📄 Citation

If you find this work helpful, please cite our paper:

@misc{dong2025arpo,
      title={Agentic Reinforced Policy Optimization}, 
      author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
      year={2025},
      eprint={2507.19849},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19849}, 
}
@article{dong2025toolstar,
  author       = {Guanting Dong and
                  Yifei Chen and
                  Xiaoxi Li and
                  Jiajie Jin and
                  Hongjin Qian and
                  Yutao Zhu and
                  Hangyu Mao and
                  Guorui Zhou and
                  Zhicheng Dou and
                  Ji{-}Rong Wen},
  title        = {Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement
                  Learning},
  journal      = {CoRR},
  volume       = {abs/2505.16410},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2505.16410},
  doi          = {10.48550/ARXIV.2505.16410},
  eprinttype    = {arXiv},
  eprint       = {2505.16410},
  timestamp    = {Thu, 26 Jun 2025 07:49:34 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2505-16410.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

🤝 Acknowledgements

This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

📄 License

This project is released under the MIT License.

📞 Contact

For any questions or feedback, please reach out to us at [email protected].

Downloads last month: 36

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for dongguanting/Qwen2.5-7B-ARPO

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(805)

this model

Quantizations

2 models

Datasets used to train dongguanting/Qwen2.5-7B-ARPO

Collection including dongguanting/Qwen2.5-7B-ARPO

ARPO

Collection

The official datasets and model checkpoints of ARPO • 9 items • Updated 12 days ago • 5