license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
datasets:
- Kwai-Klear/RLEP_dataset
- BytedTsinghua-SIA/DAPO-Math-17k
base_model: Qwen/Qwen2.5-Math-7B
RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
This repository contains the qwen2.5-math-rlep
model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning.
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. RLEP -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
[Paper] [Code] [Checkpoints] [Dataset]
✨ Key Highlights
- Rapid early gains: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
- Higher final performance: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
🚀 Quick Start (Inference)
You can use the RLEP model for accelerated text generation by leveraging its custom EaModel
class. Ensure you have the rlep
package and its vllm
dependencies installed as per the official repository.
First, install the necessary packages by cloning the repository and installing its dependencies:
git clone https://github.com/Kwai-Klear/RLEP.git
cd RLEP
pip3 install -e .[vllm]
Then, you can use the model in your Python code:
import torch
from transformers import AutoTokenizer
from eagle.model.ea_model import EaModel
from fastchat.model import get_conversation_template
# Define paths for your base model and RLEP model checkpoint
# This model is based on Qwen2.5-Math-7B.
base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
# Load the RLEP-enhanced model
# trust_remote_code=True might be necessary depending on your environment
model = EaModel.from_pretrained(
base_model_path=base_model_path,
ea_model_path=rlep_model_path,
torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
low_cpu_mem_usage=True,
device_map="auto",
total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
)
model.eval()
# Example usage for text generation:
user_message = "What is the capital of France?"
# Get conversation template for your base model.
# Adjust "qwen2" if your base model uses a different chat format.
conv = get_conversation_template("qwen2")
conv.append_message(conv.roles[0], user_message)
conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
prompt = conv.get_prompt()
input_ids = model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()
# Generate text using the RLEP-accelerated generation method
output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
output = model.tokenizer.decode(output_ids[0])
print(output)
Evaluation Results
We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
AIME-2024 | AIME-2025 | AMC-2023 | |
---|---|---|---|
DAPO | 32.6 | 18.9 | 77.5 |
DAPO-nodyn-bs64 | 37.4 | 19.4 | 77.3 |
RLEP | 38.5 | 21.3 | 83.0 |
Citation
If you find our paper or code helpful, we would appreciate it if you could cite our work:
@misc{zhang2025rlepreinforcementlearningexperience,
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
year={2025},
eprint={2507.07451},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.07451},
}
Acknowledgement
We conducted our experiments with the VERL framework and the Qwen2.5-7B-Math model, using the dataset and training scripts provided by DAPO. Many thanks to the open-sourced works and the broader community for making these resources available!