EDGE-GRPO

arXiv Paper Github Huggingface GitHub issues GitHub Stars

This is the EDGE-GRPO-Qwen-7B model presented in the paper EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity.

πŸ“° News

  • [2025-07] πŸŽ‰ Our arXiv paper EDGE-GRPO is released!

About

Large Language Models have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal. In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts Entropy-Driven Advantage and Guided Error Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach.

framework

πŸ› οΈ Installation

  1. Clone this repository and navigate to the folder

    git clone https://github.com/ZhangXJ199/EDGE-GRPO.git
    cd EDGE-GRPO
    
  2. Create a conda environment, activate it and install Packages

    conda create -n edge_grpo python=3.10 -y
    conda activate edge_grpo
    pip install -r requirements.txt
    

πŸ“Š Results

Performance comparison of different methods on three benchmarks during training steps. Our method consistently outperforms the vanilla GRPO and the variant with forced reflection throughout the training process.

framework

Pass@1 performance comparison across various mathematical evaluation benchmarks. The results below are from 1 epoch of training on DeepScaleR-Hard-1K. The number of samples in each benchmark is indicated in parentheses. The results are evaluated under the setting of temperature = 0.1. The best results are indicated by boldface.

framework

Changes in Entropy and Advantage Variance During Training

This figure compares the training dynamics of three methodsβ€”Vanilla GRPO, GRPO + Forced Reflection, and EDGE-GRPO (Ours)β€”with respect to two key metrics:

  • Left (Mean Entropy): Our method maintains consistently higher policy entropy during training, indicating stronger exploration ability and response diversity, which helps prevent premature convergence.

  • Right (Advantages Variance): EDGE-GRPO significantly outperforms the baselines by preserving higher intra-group advantage variance, effectively mitigating the advantage collapse problem and providing more informative gradient signals.

    framework

πŸ“ Citation

If you find our work interesting and helpful, please consider giving our repo a star. Additionally, if you would like to cite our work, please use the following format:

@misc{zhang2025edgegrpoentropydrivengrpoguided,
      title={EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity}, 
      author={Xingjian Zhang and Siwei Wen and Wenjun Wu and Lei Huang},
      year={2025},
      eprint={2507.21848},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.21848}, 
}

πŸ“¨ Contact

If you have any questions or suggestions, please feel free to contact us at [email protected].

❀️ Community efforts

  • This repository is based on trl project.
  • The implementation of evaluation refers to the understand-r1-zero project. Great work!
Downloads last month
10
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Zhang199/EDGE-GRPO-Qwen-7B

Quantizations
2 models

Collection including Zhang199/EDGE-GRPO-Qwen-7B