File size: 6,318 Bytes
e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 fd79a44 e091fc7 cececd3 e091fc7 4d9797b e091fc7 4d9797b e091fc7 4d9797b e091fc7 4d9797b fd79a44 4d9797b fd79a44 4d9797b fd79a44 4d9797b fd79a44 4d9797b fd79a44 4d9797b e091fc7 fd79a44 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
language:
- en
license: mit
metrics:
- accuracy
pipeline_tag: text-generation
library_name: transformers
tags:
- RLinf
- reinforcement-learning
model-index:
- name: RLinf-math-7B
results:
- task:
type: math
dataset:
name: AIME24
type: aime_2024
metrics:
- type: accuracy
value: 68.328125
- task:
type: math
dataset:
name: AIME25
type: aime_2025
metrics:
- type: accuracy
value: 52.19375
- task:
type: stem
dataset:
name: GPQA-diamond
type: gpqa_diamond
metrics:
- type: accuracy
value: 48.178124999999994
---
<div align="center">
<img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>
The model was presented in the paper [RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training](https://huggingface.co/papers/2510.06710).
<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> -->
</div>
<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>
[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
<div align="center">
<img src="overview.png" alt="RLinf-overview" width="600"/>
</div>
## Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
## Evaluation and Results
We trained and evaluated two models using RLinf:
- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
- Recommended sampling settings: `temperature = 0.6`, `top_p = 0.95`
- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
- Recommended sampling settings: `temperature = 1.0`, `top_p = 0.95`
### Benchmark Results
**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 |
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 |
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 |
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 |
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 |
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** |
\* We retrain the model using the default settings for 600 steps.
**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 |
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 |
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 |
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 |
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 |
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** |
## How to Use
Example with Hugging Face `transformers`:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=1.0, # recommended for 7B
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## License
This code repository and the model weights are licensed under the MIT License. |