Update README.md
Browse files
README.md
CHANGED
|
@@ -62,7 +62,7 @@ model-index:
|
|
| 62 |
</div>
|
| 63 |
|
| 64 |
## Model Description
|
| 65 |
-
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields performance
|
| 66 |
|
| 67 |
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
|
| 68 |
|
|
@@ -77,23 +77,33 @@ We trained and evaluated two models using RLinf:
|
|
| 77 |
|
| 78 |
### Benchmark Results
|
| 79 |
|
| 80 |
-
|
| 81 |
-
| ---------------------------------------- | ------ | ------ | ------------ |
|
| 82 |
-
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 |
|
| 83 |
-
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 |
|
| 84 |
-
| [AReaL-1.5B](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.1 |
|
| 85 |
-
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 |
|
| 86 |
-
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** |
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
| [
|
| 92 |
-
| [
|
| 93 |
-
| [
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
\* We retrain the model using the default settings for 600 steps.
|
| 96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
## How to Use
|
| 98 |
Example with Hugging Face `transformers`:
|
| 99 |
|
|
|
|
| 62 |
</div>
|
| 63 |
|
| 64 |
## Model Description
|
| 65 |
+
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
|
| 66 |
|
| 67 |
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
|
| 68 |
|
|
|
|
| 77 |
|
| 78 |
### Benchmark Results
|
| 79 |
|
| 80 |
+
**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
|
| 83 |
+
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
|
| 84 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 |
|
| 85 |
+
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 |
|
| 86 |
+
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 |
|
| 87 |
+
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 |
|
| 88 |
+
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
|
| 89 |
+
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 |
|
| 90 |
+
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** |
|
| 91 |
|
| 92 |
\* We retrain the model using the default settings for 600 steps.
|
| 93 |
|
| 94 |
+
**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
|
| 95 |
+
|
| 96 |
+
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
|
| 97 |
+
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
|
| 98 |
+
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 |
|
| 99 |
+
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 |
|
| 100 |
+
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 |
|
| 101 |
+
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 |
|
| 102 |
+
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 |
|
| 103 |
+
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** |
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
|
| 107 |
## How to Use
|
| 108 |
Example with Hugging Face `transformers`:
|
| 109 |
|