Update README.md
Browse files
README.md
CHANGED
@@ -25,8 +25,6 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
28 |
-
- Before the DPO training, we add SFT Warm-up procedure for the base model, which is fine-tuned from [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).
|
29 |
-
|
30 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
31 |
|
32 |
|
@@ -35,7 +33,8 @@ More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-190
|
|
35 |
|----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
|
36 |
| **Ours** | | | | | | |
|
37 |
| RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
|
38 |
-
| RLHFlow/Qwen2.5-7B-DPO-Zero |
|
|
|
39 |
| RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
|
40 |
| **Baselines** | | | | | | |
|
41 |
| Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
|
@@ -50,5 +49,4 @@ More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-190
|
|
50 |
|
51 |
|
52 |
|
53 |
-
## Citation
|
54 |
-
|
|
|
25 |
- Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
|
26 |
Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
|
27 |
Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
|
|
|
|
|
28 |
More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
|
29 |
|
30 |
|
|
|
33 |
|----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
|
34 |
| **Ours** | | | | | | |
|
35 |
| RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
|
36 |
+
| RLHFlow/Qwen2.5-7B-DPO-Zero | 26.7 (+10.0) | 76.8 (+24.4) | **62.5 (+10.0)** | 30.9 (+18.0) | 37.9 (+21.5) | 47.0 (+16.8) |
|
37 |
+
| RLHFlow/Qwen2.5-7B-DPO | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** |
|
38 |
| RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
|
39 |
| **Baselines** | | | | | | |
|
40 |
| Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
|
|
|
49 |
|
50 |
|
51 |
|
52 |
+
## Citation
|
|