Chenlu123 commited on
Commit
801ea85
·
verified ·
1 Parent(s): a82f49f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -24,19 +24,22 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
24
  ## Training methods
25
  - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
26
  Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
27
- Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively.
28
- More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
 
 
29
 
30
 
31
  ## Performance
32
  | **Model** | **AIME 2024** | **MATH 500** | **AMC** | **Minerva Math** | **OlympiadBench** | **Average** |
33
  |----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
34
  | **Ours** | | | | | | |
35
- | RLHFlow/Qwen2.5-7B-PPO-Zero | 43.3 **(+26.6)** | 79.4 **(+27.0)** | 62.5 **(+10.0)** | 33.1 **(+20.2)** | 40.7 **(+24.3)** | 51.8 **(+21.6)** |
36
- | RLHFlow/Qwen2.5-7B-DPO-Zero | 26.8 **(+10.1)** | 76.8 **(+24.4)** | 62.5 **(+10.0)** | 30.9 **(+18.0)** | 37.9 **(+21.5)** | 47.0 **(+16.8)** |
37
- | RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 **(+3.3)** | 77.6 **(+25.2)** | 55.0 **(+2.5)** | 30.5 **(+17.6)** | 38.7 **(+22.3)** | 44.4 **(+14.2)** |
38
  | **Baselines** | | | | | | |
39
  | Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
 
40
  | Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
41
  | Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
42
  | Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
@@ -49,5 +52,3 @@ Moreover, we provide a [detailed recipe](https://github.com/RLHFlow/Online-DPO-R
49
 
50
  ## Citation
51
 
52
-
53
-
 
24
  ## Training methods
25
  - Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs.
26
  Then, we optimize the policy by minimizing the DPO loss and enter the next iteration.
27
+ Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively
28
+ - Before the DPO training, we add SFT Warm-up procedure for the base model, which is fine-tuned from [RLHFlow/qwq_gen_sft_15k](https://huggingface.co/datasets/RLHFlow/qwq_gen_sft_15k).
29
+
30
+ More detailed can be found in our [blog](https://www.notion.so/Online-DPO-R1-1908b9a70e7b80c3bc83f4cf04b2f175)!
31
 
32
 
33
  ## Performance
34
  | **Model** | **AIME 2024** | **MATH 500** | **AMC** | **Minerva Math** | **OlympiadBench** | **Average** |
35
  |----------------------------|---------------|--------------|---------|------------------|-------------------|-------------|
36
  | **Ours** | | | | | | |
37
+ | RLHFlow/Qwen2.5-7B-PPO-Zero | **43.3 (+26.6)** | 79.4 (+27.0) | **62.5 (+10.0)** | 33.1 (+20.2) | 40.7 (+24.3) | **51.8 (+21.6)** |
38
+ | RLHFlow/Qwen2.5-7B-DPO-Zero | 30.0 (+13.3) | **84.4 (+32.0)** | **62.5 (+10.0)** | **33.5 (+20.6)** | **48.4 (+32.0)** | **51.8 (+21.6)** |
39
+ | RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
40
  | **Baselines** | | | | | | |
41
  | Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
42
+ | Qwen2.5-Math-7B-Base + SFT Warm-up | 20.0 | 73.2 | 62.5 | 30.5 | 35.6 | 44.4 |
43
  | Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
44
  | Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
45
  | Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
 
52
 
53
  ## Citation
54