This is the EDGE-GRPO-Qwen-7B model presented in the paper EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity.
π° News
- [2025-07] π Our arXiv paper EDGE-GRPO is released!
About
Large Language Models have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal. In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts Entropy-Driven Advantage and Guided Error Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach.

π οΈ Installation
Clone this repository and navigate to the folder
git clone https://github.com/ZhangXJ199/EDGE-GRPO.git cd EDGE-GRPO
Create a conda environment, activate it and install Packages
conda create -n edge_grpo python=3.10 -y conda activate edge_grpo pip install -r requirements.txt
π Results
Performance comparison of different methods on three benchmarks during training steps. Our method consistently outperforms the vanilla GRPO and the variant with forced reflection throughout the training process.

Pass@1 performance comparison across various mathematical evaluation benchmarks. The results below are from 1 epoch of training on DeepScaleR-Hard-1K. The number of samples in each benchmark is indicated in parentheses. The results are evaluated under the setting of temperature = 0.1. The best results are indicated by boldface.

Changes in Entropy and Advantage Variance During Training
This figure compares the training dynamics of three methodsβVanilla GRPO, GRPO + Forced Reflection, and EDGE-GRPO (Ours)βwith respect to two key metrics:
Left (Mean Entropy): Our method maintains consistently higher policy entropy during training, indicating stronger exploration ability and response diversity, which helps prevent premature convergence.
Right (Advantages Variance): EDGE-GRPO significantly outperforms the baselines by preserving higher intra-group advantage variance, effectively mitigating the advantage collapse problem and providing more informative gradient signals.
π Citation
If you find our work interesting and helpful, please consider giving our repo a star. Additionally, if you would like to cite our work, please use the following format:
@misc{zhang2025edgegrpoentropydrivengrpoguided,
title={EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity},
author={Xingjian Zhang and Siwei Wen and Wenjun Wu and Lei Huang},
year={2025},
eprint={2507.21848},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.21848},
}
π¨ Contact
If you have any questions or suggestions, please feel free to contact us at [email protected]
.
β€οΈ Community efforts
- This repository is based on trl project.
- The implementation of evaluation refers to the understand-r1-zero project. Great work!
- Downloads last month
- 10