SII-Enigma commited on
Commit
3bb5610
·
verified ·
1 Parent(s): ce2e62a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -16
README.md CHANGED
@@ -1,31 +1,64 @@
1
  ---
2
  tags:
3
  - qwen2.5
4
- - rl
5
- - fine-tuned
6
- language:
7
- - zh
8
- - en
9
  license: apache-2.0
10
  base_model: Qwen/Qwen2.5-7B-Instruct
11
  ---
12
 
13
- # Qwen2.5-7B-Ins-SFT-GRPO
 
 
 
 
 
 
 
 
 
14
 
15
- Training Method: GRPO
16
- Base model: Qwen/Qwen2.5-7B-Instruct
17
 
18
  ## Inference Example
19
 
 
 
20
  ```python
21
- from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
22
 
23
- model_name = "SII-Enigma/Qwen2.5-7B-Ins-SFT-GRPO"
24
- tokenizer = AutoTokenizer.from_pretrained(model_name)
25
- model = AutoModelForCausalLM.from_pretrained(model_name)
26
 
27
- inputs = tokenizer("Hello", return_tensors="pt")
28
- outputs = model.generate(**inputs)
29
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
30
- print(response)
31
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
  - qwen2.5
4
+ - RL
5
+ - reasoning
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
 
8
  license: apache-2.0
9
  base_model: Qwen/Qwen2.5-7B-Instruct
10
  ---
11
 
12
+ # Introduction
13
+
14
+ **AMPO**, a novel framework that intelligently leverages guidance from multiple, diverse teacher models, intervening only when the on-policy model fails. Our two core contributions, Adaptive Multi-Guidance Replacement and Comprehension-based Guidance Selection, ensure that this external knowledge is used both efficiently and effectively.
15
+
16
+ [![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.02227) [![Github](https://img.shields.io/badge/AMPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/SII-Enigma/AMPO)
17
+
18
+ ### Key Highlights:
19
+ - **Adaptive Multi-Guidance Replacement**: Minimizes intervention by providing external guidance only upon complete on-policy failure, preserving self-discovery while enhancing reasoning efficiency.
20
+ - **Comprehension-based Guidance Selection**: Improves learning effectiveness by guiding the model to assimilate the most comprehensible external solutions, demonstrably boosting performance.
21
+ - **Superior Performance:** Achieves better performance and efficiency compared to using RL or SFT alone.
22
 
 
 
23
 
24
  ## Inference Example
25
 
26
+ Here’s an example of using AMPO for inference:
27
+
28
  ```python
29
+ from transformers import AutoTokenizer
30
+ from vllm import LLM, SamplingParams
31
+
32
+ model_path = "SII-Enigma/Qwen2.5-7B-Ins-GRPO"
33
+
34
+ question = "which number is larger? 9.11 or 9.9?"
35
 
36
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
37
+ messages = [{"role": "user", "content": question}]
38
+ chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
39
 
40
+ llm = LLM(model=model_path)
41
+ params = SamplingParams(temperature=0.6, max_tokens=8192)
42
+ outputs = llm.generate([chat], params)
43
+ print(outputs[0].outputs[0].text)
44
  ```
45
+
46
+ # Acknowledgement
47
+
48
+ AMPO builds upon [LUFFY](https://github.com/ElliottYan/LUFFY), [veRL](https://github.com/volcengine/verl), [RLPR](https://github.com/OpenBMB/RLPR) and utilizes [vLLM](https://github.com/vllm-project/vllm) for inference. We utilize [Math-Verify](https://github.com/huggingface/Math-Verify) for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones.
49
+
50
+
51
+
52
+ # Citation
53
+ If you find our model, data, or evaluation code useful, please kindly cite our paper:
54
+ ```bib
55
+ @misc{yuan2025teacheradaptivemultiguidancepolicy,
56
+ title={More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration},
57
+ author={Xiaoyang Yuan and Yujuan Ding and Yi Bin and Wenqi Shao and Jinyu Cai and Jingkuan Song and Yang Yang and Heng Tao Shen},
58
+ year={2025},
59
+ eprint={2510.02227},
60
+ archivePrefix={arXiv},
61
+ primaryClass={cs.CL},
62
+ url={https://arxiv.org/abs/2510.02227},
63
+ }
64
+ ```