Chenlu123 commited on
Commit
548cadf
·
verified ·
1 Parent(s): 801ea85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -25,8 +25,6 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
25
  - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
26
  Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
27
  Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
28
- - Before the DPO training, we add SFT Warm-up procedure for the base model, which is fine-tuned from [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).
29
-
30
  More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
31
 
32
 
@@ -35,7 +33,8 @@ More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-190
35
  |----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
36
  | **Ours** | | | | | | |
37
  | RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
38
- | RLHFlow/Qwen2.5-7B-DPO-Zero | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** |
 
39
  | RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
40
  | **Baselines** | | | | | | |
41
  | Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
@@ -50,5 +49,4 @@ More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-190
50
 
51
 
52
 
53
- ## Citation
54
-
 
25
  - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
26
  Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
27
  Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
 
 
28
  More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
29
 
30
 
 
33
  |----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
34
  | **Ours** | | | | | | |
35
  | RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
36
+ | RLHFlow/Qwen2.5-7B-DPO-Zero | 26.7 (+10.0) | 76.8 (+24.4) | **62.5 (+10.0)** | 30.9 (+18.0) | 37.9 (+21.5) | 47.0 (+16.8) |
37
+ | RLHFlow/Qwen2.5-7B-DPO | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** |
38
  | RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
39
  | **Baselines** | | | | | | |
40
  | Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
 
49
 
50
 
51
 
52
+ ## Citation