File size: 6,318 Bytes
e091fc7
fd79a44
 
e091fc7
 
fd79a44
e091fc7
 
fd79a44
 
 
 
 
e091fc7
 
 
 
fd79a44
e091fc7
fd79a44
 
e091fc7
fd79a44
 
e091fc7
fd79a44
e091fc7
fd79a44
 
e091fc7
fd79a44
 
e091fc7
fd79a44
e091fc7
fd79a44
 
e091fc7
fd79a44
 
e091fc7
 
 
 
 
 
fd79a44
e091fc7
 
 
 
 
cececd3
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
4d9797b
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
4d9797b
e091fc7
4d9797b
 
 
 
 
 
 
 
 
e091fc7
 
 
4d9797b
 
 
 
 
fd79a44
4d9797b
fd79a44
4d9797b
fd79a44
4d9797b
fd79a44
4d9797b
fd79a44
4d9797b
 
 
 
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fd79a44
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
language:
- en
license: mit
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
tags:
- RLinf
- reinforcement-learning
model-index:
- name: RLinf-math-7B
  results:
  - task:
      type: math
    dataset:
      name: AIME24
      type: aime_2024
    metrics:
    - type: accuracy
      value: 68.328125
  - task:
      type: math
    dataset:
      name: AIME25
      type: aime_2025
    metrics:
    - type: accuracy
      value: 52.19375
  - task:
      type: stem
    dataset:
      name: GPQA-diamond
      type: gpqa_diamond
    metrics:
    - type: accuracy
      value: 48.178124999999994
---

<div align="center">
  <img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>

The model was presented in the paper [RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training](https://huggingface.co/papers/2510.06710).

<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->
</div>

<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>

[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.


<div align="center">
  <img src="overview.png" alt="RLinf-overview" width="600"/>
</div>

## Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

## Evaluation and Results
We trained and evaluated two models using RLinf:

- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
  - Recommended sampling settings:  `temperature = 0.6`, `top_p = 0.95`

- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
  - Recommended sampling settings:  `temperature = 1.0`, `top_p = 0.95`

### Benchmark Results

**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL. 

| Model                                      | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33     | 24.90     | 27.45        | 26.89     |
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B)                             | 37.80     | 30.42     | 32.11        | 33.44     |
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)                    | 40.41     | 30.93     | 27.54        | 32.96     |
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3)                 | 40.73     | 31.56     | 28.10        | 33.46     |
| AReaL-1.5B-retrain*                        | 44.42     | 34.27     | 33.81        | 37.50     |
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3)                          | 43.65     | 32.49     | 35.00        | 37.05     |
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B)                           | **48.44** | **35.63** | **38.46**    | **40.84** |

\* We retrain the model using the default settings for 600 steps. 

**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL. 

| Model                                    | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)  | 54.90     | 40.20     | 45.48        | 46.86     |

| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B)                           | 61.66     | 49.38     | 46.93        | 52.66     |

| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B)                           | 66.87     | 52.49     | 44.43        | 54.60     |

| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview)                    | **68.55** | 51.24     | 43.88        | 54.56     |

| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B)                   | 67.30     | **55.00** | 45.57        | 55.96     |

| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B)                            | 68.33     | 52.19     | **48.18**    | **56.23** |



## How to Use
Example with Hugging Face `transformers`:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,   # recommended for 7B
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License
This code repository and the model weights are licensed under the MIT License.