File size: 5,715 Bytes
0641739 3afd206 0641739 3afd206 0641739 e3d06cc 0641739 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-8B-Base
datasets:
- Kwai-Klear/KlearReasoner-MathSub-30K
- Kwai-Klear/KlearReasoner-CodeSub-15K
metrics:
- accuracy
---
# β¨ Klear-Reasoner-8B-SFT
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
| Resource | Link |
|---|---|
| π Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
| π€ Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
| π€ Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| π€ Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| π€ Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| π Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| π§ Contact | [email protected] |
## π Overview
<div align="center">
<img src="main_result.png" width="100%"/>
<sub>Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).</sub>
</div>
Klear-Reasoner is an 8-billion-parameter reasoning model that achieves **SOTA** performance on challenging **math and coding benchmarks**:
| Benchmark | AIME 2024 | AIME 2025 | LiveCodeBench V5 | LiveCodeBench V6 |
|---|---|---|---|---|
| **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
The model combines:
1. **Quality-centric long CoT SFT** β distilled from DeepSeek-R1-0528.
2. **Gradient-Preserving Clipping Policy Optimization (GPPO)** β a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
---
### Evaluation
When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. **Evaluation is coming soon, stay tuned.**
## π Benchmark Results (Pass@1)
| Model | AIME2024<br>avg@64 | AIME2025<br>avg@64 | HMMT2025<br>avg@64 | LCB V5<br>avg@8 | LCB V6<br>avg@8 |
|-------|--------------------|--------------------|--------------------|-----------------|-----------------|
| AReal-boba-RL-7B | 61.9 | 48.3 | 29.4 | 34.3 | 31.0β |
| MiMo-7B-RL | 68.2 | 55.4 | 35.7 | 57.8 | 49.3 |
| Skywork-OR1-7B | 70.2 | 54.6 | 35.7 | 47.6 | 42.7 |
| AceReason-Nemotron-1.1-7B | 72.6 | 64.8 | 42.9 | 57.2 | 52.1 |
| POLARIS-4B-Preview | 81.2 | _79.4_ | 58.7 | 58.5β | 53.0β |
| Qwen3-8B | 76.0 | 67.3 | 44.7β | 57.5 | 48.4β |
| Deepseek-R1-0528-Distill-8B | _86.0_ | 76.3 | 61.5 | 61.0β | 51.6β |
| OpenReasoning-Nemotron-7B | 84.7 | 78.2 | 63.5 | _65.6_β | _56.3_β |
| Klear-Reasoner-8B-SFT | 75.6 | 70.1 | 57.6 | 58.5 | 49.6 |
| Klear-Reasoner-8B | 83.2 | 75.6 | 60.3 | 61.6 | 53.1 |
| *w/ 64K Inference Budget* | **90.5** | **83.2** | **70.8** | **66.0** | **58.1** |
> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
---
## π§ͺ Training
### Configure the experimental environment
```bash
git clone https://github.com/suu990901/Klear_Reasoner
cd Klear_Reasoner
pip install -r requirements.txt
```
For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
### Using Ray for Multi-Node Training
For multi-node trainingββ, ensure ββall nodes are started and connected via Rayββ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
#### Step 1: Start Ray on the Head Node (node0)
On the first node (typically called `node0`), run:
```bash
ray start --head --dashboard-host=0.0.0.0
```
Get the IP address of the master node.
```bash
MASTER_IP=$(hostname -I | awk '{print $1}')
```
#### Step 2: Connect Other Nodes (e.g., node1)
On each additional worker node (e.g., `node1`), run the following, replacing the IP with that of your head node:
```bash
ray start --address=\"$MASTER_IP:6379\"
```
### RL Training
Run the following script on the master node to start the training task.
```bash
bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RL
```
In the startup script, you need to set the following variables:
```bash
YOUR_MODEL_PATH="<your_model_path>"
CKPTS_SAVE_DIR="<ckpts_save_path>"
YOUR_TRAIN_FILE="<train_data_path>"
YOUR_TEST_FILE="<test_data_path>"
```
## π€ Citation
If you find this work helpful, please cite our paper:
```bibtex
@misc{su2025klearreasoneradvancingreasoningcapability,
title={Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization},
author={Zhenpeng Su and Leiyu Pan and Xue Bai and Dening Liu and Guanting Dong and Jiaming Huang and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2508.07629},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.07629},
}
``` |