Enhance model card for ARPO checkpoint with comprehensive details (#1)
Browse files- Enhance model card for ARPO checkpoint with comprehensive details (75361c42398aed651d818fd1e15bacf548eb9c8a)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,8 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
2 |
-
|
3 |
|
4 |
-
|
|
|
|
|
5 |
|
6 |
-
|
7 |
|
8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
pipeline_tag: text-generation
|
4 |
+
library_name: transformers
|
5 |
+
datasets:
|
6 |
+
- dongguanting/ARPO-SFT-54K
|
7 |
+
- dongguanting/ARPO-RL-Reasoning-10K
|
8 |
+
- dongguanting/ARPO-RL-DeepSearch-1K
|
9 |
+
language: en
|
10 |
+
base_model:
|
11 |
+
- Qwen/Qwen2.5-3B-Instruct
|
12 |
+
- Qwen/Qwen2.5-7B-Instruct
|
13 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
14 |
+
- Qwen/Qwen3-8B-Instruct
|
15 |
+
- Qwen/Qwen3-14B-Instruct
|
16 |
+
---
|
17 |
|
18 |
+
# Agentic Reinforced Policy Optimization (ARPO)
|
19 |
|
20 |
+
<p align="center">
|
21 |
+
<img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px">
|
22 |
+
</p>
|
23 |
|
24 |
+
This repository contains a model checkpoint for **Agentic Reinforced Policy Optimization (ARPO)**, a novel agentic Reinforcement Learning (RL) algorithm designed for training multi-turn Large Language Model (LLM)-based agents.
|
25 |
|
26 |
+
The model was presented in the paper [Agentic Reinforced Policy Optimization](https://huggingface.co/papers/2507.19849) (arXiv: [2507.19849](https://arxiv.org/abs/2507.19849)).
|
27 |
+
|
28 |
+
## β¨ Overview
|
29 |
+
|
30 |
+
ARPO addresses the challenge of inadequately balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. Through preliminary experiments, it was observed that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage.
|
31 |
+
|
32 |
+
By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Notably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments.
|
33 |
+
|
34 |
+
<p align="center">
|
35 |
+
<img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" />
|
36 |
+
</p>
|
37 |
+
|
38 |
+
* In the figure (left), the initial tokens generated by the LLM after receiving each round of tool-call feedback consistently exhibit a high entropy. This indicates that external tool-call significantly introduces uncertainty into the LLMβs reasoning process.
|
39 |
+
* In the figure (right), ARPO's performance is validated across 13 datasets. Notably, Qwen3-14B with ARPO excelled in Pass@5, achieving 61.2% on GAIA and 24.0% on HLE, while requiring only about half the tool calls compared to GRPO during training.
|
40 |
+
|
41 |
+
## π£ Latest News
|
42 |
+
|
43 |
+
* **[July 29, 2025]**: π Our paper is now available on **[arXiv](https://arxiv.org/abs/2507.19849)** and **[Hugging Face](https://huggingface.co/papers/2507.19849)** daily paper.
|
44 |
+
* **[July 25, 2025]**: π₯ We released all our **ARPO model checkpoints (3B~14B)** and **datasets (SFT, RL, Evaluation)**. Checkout **[π€ARPO Collection](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)** here. We will keep update it!
|
45 |
+
* **[July 25, 2025]**: π Full codebase released. ARPO supports multi-tool agentic RL training for the Qwen2.5, 3 and Llama3 models. We have implemented extensive tool-call acceleration and memory optimization during RL training.
|
46 |
+
|
47 |
+
## π Links
|
48 |
+
|
49 |
+
* **Paper (Hugging Face)**: [Agentic Reinforced Policy Optimization](https://huggingface.co/papers/2507.19849)
|
50 |
+
* **Paper (arXiv)**: [https://arxiv.org/abs/2507.19849](https://arxiv.org/abs/2507.19849)
|
51 |
+
* **GitHub Repository**: [https://github.com/dongguanting/ARPO](https://github.com/dongguanting/ARPO)
|
52 |
+
* **Hugging Face Model Collection**: [ARPO Models](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
|
53 |
+
* **Hugging Face Dataset Collection**: [ARPO Datasets](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
|
54 |
+
|
55 |
+
## β‘ Quick Start
|
56 |
+
|
57 |
+
This model can be loaded and used with the `transformers` library. Below is a basic example for text generation and multi-turn interaction. For more advanced usage, including multi-tool agentic RL training and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO).
|
58 |
+
|
59 |
+
```python
|
60 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
61 |
+
import torch
|
62 |
+
|
63 |
+
# Load the model and tokenizer
|
64 |
+
# Replace "dongguanting/Qwen3-8B-ARPO-DeepSearch" with the specific model ID you want to use
|
65 |
+
model_id = "dongguanting/Qwen3-8B-ARPO-DeepSearch" # Example from the ARPO collection
|
66 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
67 |
+
model = AutoModelForCausalLM.from_pretrained(
|
68 |
+
model_id,
|
69 |
+
torch_dtype=torch.bfloat16, # Adjust dtype based on model requirements and hardware
|
70 |
+
device_map="auto", # Automatically maps the model to available devices (e.g., GPU)
|
71 |
+
trust_remote_code=True,
|
72 |
+
)
|
73 |
+
|
74 |
+
# Prepare your conversational input
|
75 |
+
# The model supports multi-turn interactions and tool calls through its chat template.
|
76 |
+
messages = [
|
77 |
+
{"role": "user", "content": "What is the capital of France? And what is the population of that city?"},
|
78 |
+
]
|
79 |
+
|
80 |
+
# Apply the chat template and tokenize
|
81 |
+
text = tokenizer.apply_chat_template(
|
82 |
+
messages,
|
83 |
+
tokenize=False,
|
84 |
+
add_generation_prompt=True
|
85 |
+
)
|
86 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
87 |
+
|
88 |
+
# Generate a response
|
89 |
+
outputs = model.generate(
|
90 |
+
**inputs,
|
91 |
+
max_new_tokens=512,
|
92 |
+
do_sample=True,
|
93 |
+
temperature=0.6,
|
94 |
+
top_p=0.95,
|
95 |
+
eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")]
|
96 |
+
)
|
97 |
+
|
98 |
+
# Decode and print the generated text
|
99 |
+
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
|
100 |
+
print(response)
|
101 |
+
|
102 |
+
```
|
103 |
+
|
104 |
+
## π Citation
|
105 |
+
|
106 |
+
If you find this work helpful, please cite our paper:
|
107 |
+
```bibtex
|
108 |
+
@misc{dong2025arpo,
|
109 |
+
title={Agentic Reinforced Policy Optimization},
|
110 |
+
author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
|
111 |
+
year={2025},
|
112 |
+
eprint={2507.19849},
|
113 |
+
archivePrefix={arXiv},
|
114 |
+
primaryClass={cs.LG},
|
115 |
+
url={https://arxiv.org/abs/2507.19849},
|
116 |
+
}
|
117 |
+
```
|
118 |
+
|
119 |
+
## π€ Acknowledgements
|
120 |
+
|
121 |
+
This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
|
122 |
+
|
123 |
+
## π License
|
124 |
+
|
125 |
+
This project is released under the [MIT License](https://opensource.org/licenses/MIT).
|
126 |
+
|
127 |
+
## π Contact
|
128 |
+
|
129 |
+
For any questions or feedback, please reach out to us at [[email protected]]([email protected]).
|