Text Generation
Transformers
Safetensors
English
qwen3
conversational
text-generation-inference

Agentic Reinforced Policy Optimization (ARPO)

This repository contains a model checkpoint for Agentic Reinforced Policy Optimization (ARPO), a novel agentic Reinforcement Learning (RL) algorithm designed for training multi-turn Large Language Model (LLM)-based agents.

The model was presented in the paper Agentic Reinforced Policy Optimization (arXiv: 2507.19849).

✨ Overview

ARPO addresses the challenge of inadequately balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. Through preliminary experiments, it was observed that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage.

By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Notably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments.

intro

  • In the figure (left), the initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLM’s reasoning process.
  • In the figure (right), ARPO's performance is validated across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.

πŸ“£ Latest News

  • [July 29, 2025]: πŸ“„ Our paper is now available on arXiv and Hugging Face daily paper.
  • [July 25, 2025]: πŸ”₯ We released all our ARPO model checkpoints (3B~14B) and datasets (SFT, RL, Evaluation). Checkout πŸ€—ARPO Collection here. We will keep update it!
  • [July 25, 2025]: πŸš€ Full codebase released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models. We have implemented extensive tool-call acceleration and memory optimization during RL training.

πŸ”— Links

⚑ Quick Start

This model can be loaded and used with the transformers library. Below is a basic example for text generation and multi-turn interaction. For more advanced usage, including multi-tool agentic RL training and evaluation, please refer to the official GitHub repository.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
# Replace "dongguanting/Qwen3-8B-ARPO-DeepSearch" with the specific model ID you want to use
model_id = "dongguanting/Qwen3-8B-ARPO-DeepSearch" # Example from the ARPO collection
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Adjust dtype based on model requirements and hardware
    device_map="auto",          # Automatically maps the model to available devices (e.g., GPU)
    trust_remote_code=True,
)

# Prepare your conversational input
# The model supports multi-turn interactions and tool calls through its chat template.
messages = [
    {"role": "user", "content": "What is the capital of France? And what is the population of that city?"},
]

# Apply the chat template and tokenize
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate a response
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")]
)

# Decode and print the generated text
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

πŸ“„ Citation

If you find this work helpful, please cite our paper:

@misc{dong2025arpo,
      title={Agentic Reinforced Policy Optimization}, 
      author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
      year={2025},
      eprint={2507.19849},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19849}, 
}

🀝 Acknowledgements

This training implementation builds upon Tool-Star, Llama Factory, verl and ReCall. For evaluation, we rely on WebThinker, HIRA, WebSailor, Search-o1, and FlashRAG. The Python interpreter design references ToRA and ToRL, while our models are trained using Qwen2.5. We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.

πŸ“„ License

This project is released under the MIT License.

πŸ“ž Contact

For any questions or feedback, please reach out to us at [email protected].

Downloads last month
2
Safetensors
Model size
8.19B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dongguanting/Qwen2.5-7B-ARPO

Base model

Qwen/Qwen2.5-3B
Finetuned
(663)
this model

Datasets used to train dongguanting/Qwen2.5-7B-ARPO

Collection including dongguanting/Qwen2.5-7B-ARPO