Text Generation
Transformers
Safetensors
qwen2
conversational
text-generation-inference
hongzhizhang nielsr HF Staff commited on
Commit
e36c4c5
·
verified ·
1 Parent(s): 13d9724

Enhance model card with metadata, paper link, and usage example (#1)

Browse files

- Enhance model card with metadata, paper link, and usage example (7b7070dcb74aaaad2c510724886b0c7c8ebd630e)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +117 -1
README.md CHANGED
@@ -1 +1,117 @@
1
- See our paper for details [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://arxiv.org/abs/2507.07451).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ datasets:
6
+ - Kwai-Klear/RLEP_dataset
7
+ - BytedTsinghua-SIA/DAPO-Math-17k
8
+ base_model: Qwen/Qwen2.5-Math-7B
9
+ ---
10
+
11
+ # RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
12
+
13
+ This repository contains the `qwen2.5-math-rlep` model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://huggingface.co/papers/2507.07451).
14
+
15
+ Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. **RLEP** -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
16
+
17
+ [[Paper](https://huggingface.co/papers/2507.07451)] [[Code](https://github.com/Kwai-Klear/RLEP)] [[Checkpoints](https://huggingface.co/Kwai-Klear/qwen2.5-math-rlep)] [[Dataset](https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset)]
18
+
19
+ <p align="center">
20
+ <img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/rlep_method.png" width="85%" alt="RLEP Method Overview">
21
+ </p>
22
+
23
+ ## ✨ Key Highlights
24
+
25
+ * **Rapid early gains**: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
26
+ * **Higher final performance**: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
27
+
28
+ <p align="center">
29
+ <img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/exp_acc.png" width="85%" alt="RLEP Experimental Accuracy">
30
+ </p>
31
+
32
+ ## 🚀 Quick Start (Inference)
33
+
34
+ You can use the RLEP model for accelerated text generation by leveraging its custom `EaModel` class. Ensure you have the `rlep` package and its `vllm` dependencies installed as per the official repository.
35
+
36
+ First, install the necessary packages by cloning the repository and installing its dependencies:
37
+ ```bash
38
+ git clone https://github.com/Kwai-Klear/RLEP.git
39
+ cd RLEP
40
+ pip3 install -e .[vllm]
41
+ ```
42
+
43
+ Then, you can use the model in your Python code:
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoTokenizer
48
+ from eagle.model.ea_model import EaModel
49
+ from fastchat.model import get_conversation_template
50
+
51
+ # Define paths for your base model and RLEP model checkpoint
52
+ # This model is based on Qwen2.5-Math-7B.
53
+ base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
54
+ rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
55
+
56
+ # Load the RLEP-enhanced model
57
+ # trust_remote_code=True might be necessary depending on your environment
58
+ model = EaModel.from_pretrained(
59
+ base_model_path=base_model_path,
60
+ ea_model_path=rlep_model_path,
61
+ torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
62
+ low_cpu_mem_usage=True,
63
+ device_map="auto",
64
+ total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
65
+ )
66
+ model.eval()
67
+
68
+ # Example usage for text generation:
69
+ user_message = "What is the capital of France?"
70
+
71
+ # Get conversation template for your base model.
72
+ # Adjust "qwen2" if your base model uses a different chat format.
73
+ conv = get_conversation_template("qwen2")
74
+ conv.append_message(conv.roles[0], user_message)
75
+ conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
76
+
77
+ prompt = conv.get_prompt()
78
+ input_ids = model.tokenizer([prompt]).input_ids
79
+ input_ids = torch.as_tensor(input_ids).cuda()
80
+
81
+ # Generate text using the RLEP-accelerated generation method
82
+ output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
83
+ output = model.tokenizer.decode(output_ids[0])
84
+
85
+ print(output)
86
+ ```
87
+
88
+ ## Evaluation Results
89
+
90
+ We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
91
+
92
+ | | AIME-2024 | AIME-2025 | AMC-2023 |
93
+ |-------------------|-----------|-----------|----------|
94
+ | DAPO | 32.6 | 18.9 | 77.5 |
95
+ | DAPO-nodyn-bs64 | 37.4 | 19.4 | 77.3 |
96
+ | **RLEP** | **38.5** | **21.3** | **83.0** |
97
+
98
+ ## Citation
99
+
100
+ If you find our paper or code helpful, we would appreciate it if you could cite our work:
101
+
102
+ ```bibtex
103
+ @misc{zhang2025rlepreinforcementlearningexperience,
104
+ title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
105
+ author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
106
+ year={2025},
107
+ eprint={2507.07451},
108
+ archivePrefix={arXiv},
109
+ primaryClass={cs.CL},
110
+ url={https://arxiv.org/abs/2507.07451},
111
+ }
112
+ ```
113
+
114
+ ## Acknowledgement
115
+
116
+ We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
117
+ Many thanks to the open-sourced works and the broader community for making these resources available!