Enhance model card with metadata, paper link, and usage example (#1)
Browse files- Enhance model card with metadata, paper link, and usage example (7b7070dcb74aaaad2c510724886b0c7c8ebd630e)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1 +1,117 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
pipeline_tag: text-generation
|
4 |
+
library_name: transformers
|
5 |
+
datasets:
|
6 |
+
- Kwai-Klear/RLEP_dataset
|
7 |
+
- BytedTsinghua-SIA/DAPO-Math-17k
|
8 |
+
base_model: Qwen/Qwen2.5-Math-7B
|
9 |
+
---
|
10 |
+
|
11 |
+
# RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
|
12 |
+
|
13 |
+
This repository contains the `qwen2.5-math-rlep` model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://huggingface.co/papers/2507.07451).
|
14 |
+
|
15 |
+
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. **RLEP** -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
|
16 |
+
|
17 |
+
[[Paper](https://huggingface.co/papers/2507.07451)] [[Code](https://github.com/Kwai-Klear/RLEP)] [[Checkpoints](https://huggingface.co/Kwai-Klear/qwen2.5-math-rlep)] [[Dataset](https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset)]
|
18 |
+
|
19 |
+
<p align="center">
|
20 |
+
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/rlep_method.png" width="85%" alt="RLEP Method Overview">
|
21 |
+
</p>
|
22 |
+
|
23 |
+
## ✨ Key Highlights
|
24 |
+
|
25 |
+
* **Rapid early gains**: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
|
26 |
+
* **Higher final performance**: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
|
27 |
+
|
28 |
+
<p align="center">
|
29 |
+
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/exp_acc.png" width="85%" alt="RLEP Experimental Accuracy">
|
30 |
+
</p>
|
31 |
+
|
32 |
+
## 🚀 Quick Start (Inference)
|
33 |
+
|
34 |
+
You can use the RLEP model for accelerated text generation by leveraging its custom `EaModel` class. Ensure you have the `rlep` package and its `vllm` dependencies installed as per the official repository.
|
35 |
+
|
36 |
+
First, install the necessary packages by cloning the repository and installing its dependencies:
|
37 |
+
```bash
|
38 |
+
git clone https://github.com/Kwai-Klear/RLEP.git
|
39 |
+
cd RLEP
|
40 |
+
pip3 install -e .[vllm]
|
41 |
+
```
|
42 |
+
|
43 |
+
Then, you can use the model in your Python code:
|
44 |
+
|
45 |
+
```python
|
46 |
+
import torch
|
47 |
+
from transformers import AutoTokenizer
|
48 |
+
from eagle.model.ea_model import EaModel
|
49 |
+
from fastchat.model import get_conversation_template
|
50 |
+
|
51 |
+
# Define paths for your base model and RLEP model checkpoint
|
52 |
+
# This model is based on Qwen2.5-Math-7B.
|
53 |
+
base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
|
54 |
+
rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
|
55 |
+
|
56 |
+
# Load the RLEP-enhanced model
|
57 |
+
# trust_remote_code=True might be necessary depending on your environment
|
58 |
+
model = EaModel.from_pretrained(
|
59 |
+
base_model_path=base_model_path,
|
60 |
+
ea_model_path=rlep_model_path,
|
61 |
+
torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
|
62 |
+
low_cpu_mem_usage=True,
|
63 |
+
device_map="auto",
|
64 |
+
total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
|
65 |
+
)
|
66 |
+
model.eval()
|
67 |
+
|
68 |
+
# Example usage for text generation:
|
69 |
+
user_message = "What is the capital of France?"
|
70 |
+
|
71 |
+
# Get conversation template for your base model.
|
72 |
+
# Adjust "qwen2" if your base model uses a different chat format.
|
73 |
+
conv = get_conversation_template("qwen2")
|
74 |
+
conv.append_message(conv.roles[0], user_message)
|
75 |
+
conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
|
76 |
+
|
77 |
+
prompt = conv.get_prompt()
|
78 |
+
input_ids = model.tokenizer([prompt]).input_ids
|
79 |
+
input_ids = torch.as_tensor(input_ids).cuda()
|
80 |
+
|
81 |
+
# Generate text using the RLEP-accelerated generation method
|
82 |
+
output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
|
83 |
+
output = model.tokenizer.decode(output_ids[0])
|
84 |
+
|
85 |
+
print(output)
|
86 |
+
```
|
87 |
+
|
88 |
+
## Evaluation Results
|
89 |
+
|
90 |
+
We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
|
91 |
+
|
92 |
+
| | AIME-2024 | AIME-2025 | AMC-2023 |
|
93 |
+
|-------------------|-----------|-----------|----------|
|
94 |
+
| DAPO | 32.6 | 18.9 | 77.5 |
|
95 |
+
| DAPO-nodyn-bs64 | 37.4 | 19.4 | 77.3 |
|
96 |
+
| **RLEP** | **38.5** | **21.3** | **83.0** |
|
97 |
+
|
98 |
+
## Citation
|
99 |
+
|
100 |
+
If you find our paper or code helpful, we would appreciate it if you could cite our work:
|
101 |
+
|
102 |
+
```bibtex
|
103 |
+
@misc{zhang2025rlepreinforcementlearningexperience,
|
104 |
+
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
|
105 |
+
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
|
106 |
+
year={2025},
|
107 |
+
eprint={2507.07451},
|
108 |
+
archivePrefix={arXiv},
|
109 |
+
primaryClass={cs.CL},
|
110 |
+
url={https://arxiv.org/abs/2507.07451},
|
111 |
+
}
|
112 |
+
```
|
113 |
+
|
114 |
+
## Acknowledgement
|
115 |
+
|
116 |
+
We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
|
117 |
+
Many thanks to the open-sourced works and the broader community for making these resources available!
|