RLinf
/

RLinf-math-7B

@@ -62,7 +62,7 @@ model-index:
 </div>
 ## Model Description
-The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields performance better than AReaL.
 We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
@@ -77,23 +77,33 @@ We trained and evaluated two models using RLinf:
 ### Benchmark Results
-|                                          | AIME24 | AIME25 | GPQA-diamond |
-| ---------------------------------------- | ------ | ------ | ------------ |
-| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)            | 28.33  | 24.90  | 27.45        |
-| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)            | 40.41  | 30.93  | 27.54        |
-| [AReaL-1.5B](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3)        |  40.73  | 31.56  |  28.1        |
-| AReaL-1.5B-retrain*       |  44.42  | 34.27  |  33.81       |
-| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B)                          | **48.44**  | **35.63**  | **38.46**        |
-|                             | AIME24 | AIME25 | GPQA-diamond |
-| --------------------------- | ------ | ------ | ------------ |
-| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90  | 40.20  | 45.48        |
-| [AReaL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B)                    | 62.82  | 47.29  | 46.54        |
-| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B)               | **68.33**  | **52.19**  | **48.18**        |
 \* We retrain the model using the default settings for 600 steps.
 ## How to Use
 Example with Hugging Face `transformers`:

 </div>
 ## Model Description
+The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
 We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
 ### Benchmark Results
+**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
+| Model                                      | AIME 24   | AIME 25   | GPQA-diamond | Average   |
+| ------------------------------------------ | --------- | --------- | ------------ | --------- |
+| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33     | 24.90     | 27.45        | 26.89     |
+| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B)                             | 37.80     | 30.42     | 32.11        | 33.44     |
+| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)                    | 40.41     | 30.93     | 27.54        | 32.96     |
+| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3)                 | 40.73     | 31.56     | 28.10        | 33.46     |
+| AReaL-1.5B-retrain*                        | 44.42     | 34.27     | 33.81        | 37.50     |
+| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3)                          | 43.65     | 32.49     | 35.00        | 37.05     |
+| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B)                           | **48.44** | **35.63** | **38.46**    | **40.84** |
 \* We retrain the model using the default settings for 600 steps.
+**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
+| Model                                    | AIME 24   | AIME 25   | GPQA-diamond | Average   |
+| ---------------------------------------- | --------- | --------- | ------------ | --------- |
+| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)  | 54.90     | 40.20     | 45.48        | 46.86     |
+| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B)                           | 61.66     | 49.38     | 46.93        | 52.66     |
+| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B)                           | 66.87     | 52.49     | 44.43        | 54.60     |
+| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview)                    | **68.55** | 51.24     | 43.88        | 54.56     |
+| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B)                   | 67.30     | **55.00** | 45.57        | 55.96     |
+| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B)                            | 68.33     | 52.19     | **48.18**    | **56.23** |
 ## How to Use
 Example with Hugging Face `transformers`: