File size: 3,454 Bytes

af4b778
 
 
 
503cc64
af4b778
 
 
 
 
503cc64
af4b778
503cc64
af4b778
 
cce6a73
 
af4b778
 
503cc64
 
af4b778
503cc64
 
af4b778
 
503cc64
af4b778
503cc64
 
af4b778
 
 
 
 
 
 
 
 
 
 
 
 
 
 
503cc64
 
af4b778
503cc64
 
 
af4b778

# Online-DPO-R1
* **Blog**: https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175
* **Authors**: 
* **Code**: https://github.com/RLHFlow/Online-DPO-R1

## Introduction
We release unofficial checkpoints for PPO, iterative DPO and rejection sampling (RAFT) trained from Qwen2.5-MATH-7B-base with rule-based RL, which are based on the success of Deepseek-R1-Zero and recent replications of PPO approach.
Evaluated on five widely-adopted benchmarks **AIME 2024**, **MATH 500**, **AMC**, **Minerva Math**, **OlympiadBench**, our **iterative DPO** and **RAFT** model achieve 
significant enhancement compared to the base model and are comparable to the PPO approach.
Our models are trained by using the prompt set from the MATH training set and Numina Math.

Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R1) to reproduce the model. Enjoy!

## Model Releases
- [PPO model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-PPO-Zero)
- [Iterative DPO from SFT model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO)
- [Iterative DPO from base model] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-Zero)
- [Iterative DPO with Negative Log-Likelihood (NLL)] (https://huggingface.co/RLHFlow/Qwen2.5-7B-DPO-NLL-Zero)
- [Raft] (https://huggingface.co/RLHFlow/Qwen2.5-7B-RAFT-Zero)


## Dataset 


## Training methods
- We first SFT the base model on the the MATH training set [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).

More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)! 


## Performance
| **Model**                  | **AIME 2024** | **MATH 500** | **AMC** | **Minerva Math** | **OlympiadBench** | **Average** |
|----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
| **Ours**                   |            |           |            |          |               |              |
| RLHFlow/Qwen2.5-7B-PPO-Zero        | **43.3 (+26.6)**   | 79.4 (+27.0)      | **62.5 (+10.0)**      | 33.1   (+20.2)      | 40.7   (+24.3)     | **51.8 (+21.6)** |
| RLHFlow/Qwen2.5-7B-DPO-Zero        | 26.7 (+10.0)       | 76.8 (+24.4)      | **62.5 (+10.0)**      | 30.9 (+18.0)        | 37.9 (+21.5)       | 47.0 (+16.8)     |
| RLHFlow/Qwen2.5-7B-DPO             | 30.0 (+13.3)       | **84.4 (+32.0)**  | **62.5 (+10.0)**      | **33.5 (+20.6)**    | **48.4 (+32.0)**   | **51.8 (+21.6)** |
| RLHFlow/Qwen2.5-7B-RAFT-Zero       | 20.0 (+3.3)        | 77.6 (+25.2)      | 55.0 (+2.5)           | 30.5   (+17.6)      | 38.7   (+22.3)     | 44.4   (+14.2)   |
| **Baselines**             |            |           |            |          |               |              |
| Qwen2.5-Math-7B-Base       | 16.7       | 52.4      | 52.5       | 12.9     | 16.4          | 30.2         |
| Qwen2.5-Math-7B-Base + SFT Warm-up | 20.0       | 73.2      | 62.5       | 30.5     | 35.6          | 44.4         |
| Qwen-2.5-Math-7B-Instruct  | 13.3       | 79.8      | 50.6       | 34.6     | 40.7          | 43.8         |
| Llama-3.1-70B-Instruct     | 16.7       | 64.6      | 30.1       | 35.3     | 31.9          | 35.7         |
| Eurus-2-7B-PRIME           | 26.7       | 79.2      | 57.8       | 38.6     | 42.1          | 48.9         |
| GPT-4o                     | 9.3        | 76.4      | 45.8       | 36.8     | 43.3          | 43.3         |


## Usage



## Citation