chchen commited on
Commit
a862bf0
·
verified ·
1 Parent(s): 43764cb

End of training

Browse files
README.md CHANGED
@@ -3,9 +3,10 @@ library_name: peft
3
  license: llama3.1
4
  base_model: meta-llama/Llama-3.1-8B-Instruct
5
  tags:
 
 
6
  - trl
7
  - dpo
8
- - llama-factory
9
  - generated_from_trainer
10
  model-index:
11
  - name: Llama-3.1-8B-Instruct-dpo-llama-1000
@@ -17,17 +18,17 @@ should probably proofread and complete it, then remove this comment. -->
17
 
18
  # Llama-3.1-8B-Instruct-dpo-llama-1000
19
 
20
- This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on an unknown dataset.
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.3613
23
- - Rewards/chosen: 1.3392
24
- - Rewards/rejected: -1.7432
25
- - Rewards/accuracies: 0.8400
26
- - Rewards/margins: 3.0824
27
- - Logps/chosen: -9.1017
28
- - Logps/rejected: -41.8256
29
- - Logits/chosen: -0.1378
30
- - Logits/rejected: -0.2410
31
 
32
  ## Model description
33
 
 
3
  license: llama3.1
4
  base_model: meta-llama/Llama-3.1-8B-Instruct
5
  tags:
6
+ - llama-factory
7
+ - lora
8
  - trl
9
  - dpo
 
10
  - generated_from_trainer
11
  model-index:
12
  - name: Llama-3.1-8B-Instruct-dpo-llama-1000
 
18
 
19
  # Llama-3.1-8B-Instruct-dpo-llama-1000
20
 
21
+ This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the answer_llama dataset.
22
  It achieves the following results on the evaluation set:
23
+ - Loss: 0.3077
24
+ - Rewards/chosen: 1.4814
25
+ - Rewards/rejected: -0.7600
26
+ - Rewards/accuracies: 0.8500
27
+ - Rewards/margins: 2.2414
28
+ - Logps/chosen: -7.6796
29
+ - Logps/rejected: -31.9936
30
+ - Logits/chosen: -0.2154
31
+ - Logits/rejected: -0.3106
32
 
33
  ## Model description
34
 
all_results.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.955555555555556,
3
+ "eval_logits/chosen": -0.21535295248031616,
4
+ "eval_logits/rejected": -0.31062132120132446,
5
+ "eval_logps/chosen": -7.679609775543213,
6
+ "eval_logps/rejected": -31.993576049804688,
7
+ "eval_loss": 0.307707816362381,
8
+ "eval_rewards/accuracies": 0.8499999642372131,
9
+ "eval_rewards/chosen": 1.48140287399292,
10
+ "eval_rewards/margins": 2.2414333820343018,
11
+ "eval_rewards/rejected": -0.7600305080413818,
12
+ "eval_runtime": 12.0737,
13
+ "eval_samples_per_second": 8.282,
14
+ "eval_steps_per_second": 4.141,
15
+ "total_flos": 6.741083695664333e+16,
16
+ "train_loss": 0.31093634622437616,
17
+ "train_runtime": 2833.3873,
18
+ "train_samples_per_second": 3.176,
19
+ "train_steps_per_second": 0.198
20
+ }
eval_results.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.955555555555556,
3
+ "eval_logits/chosen": -0.21535295248031616,
4
+ "eval_logits/rejected": -0.31062132120132446,
5
+ "eval_logps/chosen": -7.679609775543213,
6
+ "eval_logps/rejected": -31.993576049804688,
7
+ "eval_loss": 0.307707816362381,
8
+ "eval_rewards/accuracies": 0.8499999642372131,
9
+ "eval_rewards/chosen": 1.48140287399292,
10
+ "eval_rewards/margins": 2.2414333820343018,
11
+ "eval_rewards/rejected": -0.7600305080413818,
12
+ "eval_runtime": 12.0737,
13
+ "eval_samples_per_second": 8.282,
14
+ "eval_steps_per_second": 4.141
15
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.955555555555556,
3
+ "total_flos": 6.741083695664333e+16,
4
+ "train_loss": 0.31093634622437616,
5
+ "train_runtime": 2833.3873,
6
+ "train_samples_per_second": 3.176,
7
+ "train_steps_per_second": 0.198
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1058 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.307707816362381,
3
+ "best_model_checkpoint": "saves/sycophancy/Llama-3.1-8B-Instruct/dpo-llama-1000/train/checkpoint-200",
4
+ "epoch": 9.955555555555556,
5
+ "eval_steps": 50,
6
+ "global_step": 560,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.17777777777777778,
13
+ "grad_norm": 1.8430479764938354,
14
+ "learning_rate": 8.928571428571429e-07,
15
+ "logits/chosen": -0.42922043800354004,
16
+ "logits/rejected": -0.48287302255630493,
17
+ "logps/chosen": -22.499736785888672,
18
+ "logps/rejected": -25.05568504333496,
19
+ "loss": 0.693,
20
+ "rewards/accuracies": 0.4312500059604645,
21
+ "rewards/chosen": 3.8243924791458994e-05,
22
+ "rewards/margins": 0.00029587256722152233,
23
+ "rewards/rejected": -0.00025762844597920775,
24
+ "step": 10
25
+ },
26
+ {
27
+ "epoch": 0.35555555555555557,
28
+ "grad_norm": 1.1751757860183716,
29
+ "learning_rate": 1.7857142857142859e-06,
30
+ "logits/chosen": -0.4168773591518402,
31
+ "logits/rejected": -0.4500039517879486,
32
+ "logps/chosen": -23.05335807800293,
33
+ "logps/rejected": -24.607397079467773,
34
+ "loss": 0.6923,
35
+ "rewards/accuracies": 0.543749988079071,
36
+ "rewards/chosen": 0.0031167108099907637,
37
+ "rewards/margins": 0.001834285445511341,
38
+ "rewards/rejected": 0.0012824248988181353,
39
+ "step": 20
40
+ },
41
+ {
42
+ "epoch": 0.5333333333333333,
43
+ "grad_norm": 2.2308878898620605,
44
+ "learning_rate": 2.6785714285714285e-06,
45
+ "logits/chosen": -0.4415621757507324,
46
+ "logits/rejected": -0.47939205169677734,
47
+ "logps/chosen": -23.05750846862793,
48
+ "logps/rejected": -24.027509689331055,
49
+ "loss": 0.6925,
50
+ "rewards/accuracies": 0.4937500059604645,
51
+ "rewards/chosen": 0.006902683060616255,
52
+ "rewards/margins": 0.0014179922873154283,
53
+ "rewards/rejected": 0.005484690889716148,
54
+ "step": 30
55
+ },
56
+ {
57
+ "epoch": 0.7111111111111111,
58
+ "grad_norm": 1.4753832817077637,
59
+ "learning_rate": 3.5714285714285718e-06,
60
+ "logits/chosen": -0.4203382134437561,
61
+ "logits/rejected": -0.4865742623806,
62
+ "logps/chosen": -21.71892738342285,
63
+ "logps/rejected": -24.279401779174805,
64
+ "loss": 0.6879,
65
+ "rewards/accuracies": 0.6312500238418579,
66
+ "rewards/chosen": 0.019825046882033348,
67
+ "rewards/margins": 0.010711194016039371,
68
+ "rewards/rejected": 0.009113854728639126,
69
+ "step": 40
70
+ },
71
+ {
72
+ "epoch": 0.8888888888888888,
73
+ "grad_norm": 1.723747968673706,
74
+ "learning_rate": 4.464285714285715e-06,
75
+ "logits/chosen": -0.45647165179252625,
76
+ "logits/rejected": -0.4898137152194977,
77
+ "logps/chosen": -22.14567756652832,
78
+ "logps/rejected": -24.119258880615234,
79
+ "loss": 0.6815,
80
+ "rewards/accuracies": 0.6312500238418579,
81
+ "rewards/chosen": 0.04727761819958687,
82
+ "rewards/margins": 0.024313444271683693,
83
+ "rewards/rejected": 0.022964173927903175,
84
+ "step": 50
85
+ },
86
+ {
87
+ "epoch": 0.8888888888888888,
88
+ "eval_logits/chosen": -0.4114196300506592,
89
+ "eval_logits/rejected": -0.4791559875011444,
90
+ "eval_logps/chosen": -21.660146713256836,
91
+ "eval_logps/rejected": -24.039798736572266,
92
+ "eval_loss": 0.6706590056419373,
93
+ "eval_rewards/accuracies": 0.6899999976158142,
94
+ "eval_rewards/chosen": 0.08334928005933762,
95
+ "eval_rewards/margins": 0.048002347350120544,
96
+ "eval_rewards/rejected": 0.03534693643450737,
97
+ "eval_runtime": 12.9721,
98
+ "eval_samples_per_second": 7.709,
99
+ "eval_steps_per_second": 3.854,
100
+ "step": 50
101
+ },
102
+ {
103
+ "epoch": 1.0666666666666667,
104
+ "grad_norm": 1.8534899950027466,
105
+ "learning_rate": 4.999222955002041e-06,
106
+ "logits/chosen": -0.4518943428993225,
107
+ "logits/rejected": -0.4910917282104492,
108
+ "logps/chosen": -22.00179672241211,
109
+ "logps/rejected": -23.545581817626953,
110
+ "loss": 0.6705,
111
+ "rewards/accuracies": 0.6312500238418579,
112
+ "rewards/chosen": 0.106198251247406,
113
+ "rewards/margins": 0.05078822001814842,
114
+ "rewards/rejected": 0.05541003867983818,
115
+ "step": 60
116
+ },
117
+ {
118
+ "epoch": 1.2444444444444445,
119
+ "grad_norm": 2.505237340927124,
120
+ "learning_rate": 4.990486745229364e-06,
121
+ "logits/chosen": -0.42713257670402527,
122
+ "logits/rejected": -0.48525214195251465,
123
+ "logps/chosen": -20.35358428955078,
124
+ "logps/rejected": -23.495656967163086,
125
+ "loss": 0.6328,
126
+ "rewards/accuracies": 0.6875,
127
+ "rewards/chosen": 0.25575509667396545,
128
+ "rewards/margins": 0.14482033252716064,
129
+ "rewards/rejected": 0.11093475669622421,
130
+ "step": 70
131
+ },
132
+ {
133
+ "epoch": 1.4222222222222223,
134
+ "grad_norm": 2.34883713722229,
135
+ "learning_rate": 4.9720770655628216e-06,
136
+ "logits/chosen": -0.40827736258506775,
137
+ "logits/rejected": -0.45998048782348633,
138
+ "logps/chosen": -19.513925552368164,
139
+ "logps/rejected": -23.603103637695312,
140
+ "loss": 0.5899,
141
+ "rewards/accuracies": 0.7250000238418579,
142
+ "rewards/chosen": 0.4022657573223114,
143
+ "rewards/margins": 0.26808053255081177,
144
+ "rewards/rejected": 0.13418518006801605,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.6,
149
+ "grad_norm": 3.0297181606292725,
150
+ "learning_rate": 4.944065422298262e-06,
151
+ "logits/chosen": -0.3988552689552307,
152
+ "logits/rejected": -0.465530127286911,
153
+ "logps/chosen": -14.463418960571289,
154
+ "logps/rejected": -21.68548583984375,
155
+ "loss": 0.5208,
156
+ "rewards/accuracies": 0.7750000357627869,
157
+ "rewards/chosen": 0.7595838904380798,
158
+ "rewards/margins": 0.5038682818412781,
159
+ "rewards/rejected": 0.25571563839912415,
160
+ "step": 90
161
+ },
162
+ {
163
+ "epoch": 1.7777777777777777,
164
+ "grad_norm": 2.1064534187316895,
165
+ "learning_rate": 4.90656061737503e-06,
166
+ "logits/chosen": -0.3954499363899231,
167
+ "logits/rejected": -0.448087602853775,
168
+ "logps/chosen": -13.89409065246582,
169
+ "logps/rejected": -21.66886329650879,
170
+ "loss": 0.5082,
171
+ "rewards/accuracies": 0.6937500238418579,
172
+ "rewards/chosen": 0.9080438613891602,
173
+ "rewards/margins": 0.6023429036140442,
174
+ "rewards/rejected": 0.30570098757743835,
175
+ "step": 100
176
+ },
177
+ {
178
+ "epoch": 1.7777777777777777,
179
+ "eval_logits/chosen": -0.3558562695980072,
180
+ "eval_logits/rejected": -0.4376991093158722,
181
+ "eval_logps/chosen": -12.185461044311523,
182
+ "eval_logps/rejected": -22.450639724731445,
183
+ "eval_loss": 0.4427681863307953,
184
+ "eval_rewards/accuracies": 0.7899999618530273,
185
+ "eval_rewards/chosen": 1.0308177471160889,
186
+ "eval_rewards/margins": 0.8365544080734253,
187
+ "eval_rewards/rejected": 0.19426332414150238,
188
+ "eval_runtime": 12.9808,
189
+ "eval_samples_per_second": 7.704,
190
+ "eval_steps_per_second": 3.852,
191
+ "step": 100
192
+ },
193
+ {
194
+ "epoch": 1.9555555555555557,
195
+ "grad_norm": 2.931056261062622,
196
+ "learning_rate": 4.859708325770919e-06,
197
+ "logits/chosen": -0.3505342900753021,
198
+ "logits/rejected": -0.41251203417778015,
199
+ "logps/chosen": -12.102005004882812,
200
+ "logps/rejected": -22.603740692138672,
201
+ "loss": 0.4419,
202
+ "rewards/accuracies": 0.8375000357627869,
203
+ "rewards/chosen": 0.9892138838768005,
204
+ "rewards/margins": 0.8685030341148376,
205
+ "rewards/rejected": 0.12071088701486588,
206
+ "step": 110
207
+ },
208
+ {
209
+ "epoch": 2.1333333333333333,
210
+ "grad_norm": 3.659130573272705,
211
+ "learning_rate": 4.80369052967602e-06,
212
+ "logits/chosen": -0.3336087167263031,
213
+ "logits/rejected": -0.4222935140132904,
214
+ "logps/chosen": -11.186881065368652,
215
+ "logps/rejected": -24.831924438476562,
216
+ "loss": 0.365,
217
+ "rewards/accuracies": 0.84375,
218
+ "rewards/chosen": 1.164413332939148,
219
+ "rewards/margins": 1.20394766330719,
220
+ "rewards/rejected": -0.03953445702791214,
221
+ "step": 120
222
+ },
223
+ {
224
+ "epoch": 2.311111111111111,
225
+ "grad_norm": 3.590104103088379,
226
+ "learning_rate": 4.7387248116432524e-06,
227
+ "logits/chosen": -0.319705069065094,
228
+ "logits/rejected": -0.4010167717933655,
229
+ "logps/chosen": -10.573092460632324,
230
+ "logps/rejected": -25.705326080322266,
231
+ "loss": 0.3523,
232
+ "rewards/accuracies": 0.824999988079071,
233
+ "rewards/chosen": 1.1774530410766602,
234
+ "rewards/margins": 1.3132127523422241,
235
+ "rewards/rejected": -0.13575978577136993,
236
+ "step": 130
237
+ },
238
+ {
239
+ "epoch": 2.488888888888889,
240
+ "grad_norm": 3.342087984085083,
241
+ "learning_rate": 4.665063509461098e-06,
242
+ "logits/chosen": -0.33713865280151367,
243
+ "logits/rejected": -0.40943965315818787,
244
+ "logps/chosen": -11.759323120117188,
245
+ "logps/rejected": -27.167184829711914,
246
+ "loss": 0.3881,
247
+ "rewards/accuracies": 0.824999988079071,
248
+ "rewards/chosen": 1.0846505165100098,
249
+ "rewards/margins": 1.4101663827896118,
250
+ "rewards/rejected": -0.3255158066749573,
251
+ "step": 140
252
+ },
253
+ {
254
+ "epoch": 2.6666666666666665,
255
+ "grad_norm": 1.9896889925003052,
256
+ "learning_rate": 4.5829927360311224e-06,
257
+ "logits/chosen": -0.2843226492404938,
258
+ "logits/rejected": -0.3789869546890259,
259
+ "logps/chosen": -10.016444206237793,
260
+ "logps/rejected": -29.09038734436035,
261
+ "loss": 0.2979,
262
+ "rewards/accuracies": 0.862500011920929,
263
+ "rewards/chosen": 1.313085675239563,
264
+ "rewards/margins": 1.761974573135376,
265
+ "rewards/rejected": -0.4488888680934906,
266
+ "step": 150
267
+ },
268
+ {
269
+ "epoch": 2.6666666666666665,
270
+ "eval_logits/chosen": -0.2695091664791107,
271
+ "eval_logits/rejected": -0.36552613973617554,
272
+ "eval_logps/chosen": -9.01309871673584,
273
+ "eval_logps/rejected": -28.563730239868164,
274
+ "eval_loss": 0.32149210572242737,
275
+ "eval_rewards/accuracies": 0.85999995470047,
276
+ "eval_rewards/chosen": 1.348054051399231,
277
+ "eval_rewards/margins": 1.7650997638702393,
278
+ "eval_rewards/rejected": -0.4170458912849426,
279
+ "eval_runtime": 13.0246,
280
+ "eval_samples_per_second": 7.678,
281
+ "eval_steps_per_second": 3.839,
282
+ "step": 150
283
+ },
284
+ {
285
+ "epoch": 2.8444444444444446,
286
+ "grad_norm": 1.8847765922546387,
287
+ "learning_rate": 4.492831268057307e-06,
288
+ "logits/chosen": -0.2392854541540146,
289
+ "logits/rejected": -0.32519179582595825,
290
+ "logps/chosen": -10.91231918334961,
291
+ "logps/rejected": -29.385107040405273,
292
+ "loss": 0.3806,
293
+ "rewards/accuracies": 0.8375000357627869,
294
+ "rewards/chosen": 1.174955129623413,
295
+ "rewards/margins": 1.6338344812393188,
296
+ "rewards/rejected": -0.458879292011261,
297
+ "step": 160
298
+ },
299
+ {
300
+ "epoch": 3.022222222222222,
301
+ "grad_norm": 4.04591703414917,
302
+ "learning_rate": 4.394929307863633e-06,
303
+ "logits/chosen": -0.25453540682792664,
304
+ "logits/rejected": -0.3346460461616516,
305
+ "logps/chosen": -9.183024406433105,
306
+ "logps/rejected": -30.89423942565918,
307
+ "loss": 0.2928,
308
+ "rewards/accuracies": 0.875,
309
+ "rewards/chosen": 1.3575671911239624,
310
+ "rewards/margins": 1.941465973854065,
311
+ "rewards/rejected": -0.5838987231254578,
312
+ "step": 170
313
+ },
314
+ {
315
+ "epoch": 3.2,
316
+ "grad_norm": 4.6089653968811035,
317
+ "learning_rate": 4.289667123149296e-06,
318
+ "logits/chosen": -0.2588854432106018,
319
+ "logits/rejected": -0.344396710395813,
320
+ "logps/chosen": -8.893033027648926,
321
+ "logps/rejected": -30.978750228881836,
322
+ "loss": 0.2991,
323
+ "rewards/accuracies": 0.8812500238418579,
324
+ "rewards/chosen": 1.382232427597046,
325
+ "rewards/margins": 1.9790451526641846,
326
+ "rewards/rejected": -0.596812903881073,
327
+ "step": 180
328
+ },
329
+ {
330
+ "epoch": 3.3777777777777778,
331
+ "grad_norm": 2.9751718044281006,
332
+ "learning_rate": 4.177453569964925e-06,
333
+ "logits/chosen": -0.21491435170173645,
334
+ "logits/rejected": -0.30069640278816223,
335
+ "logps/chosen": -8.679551124572754,
336
+ "logps/rejected": -32.274330139160156,
337
+ "loss": 0.2848,
338
+ "rewards/accuracies": 0.8687500357627869,
339
+ "rewards/chosen": 1.3893749713897705,
340
+ "rewards/margins": 2.1629250049591064,
341
+ "rewards/rejected": -0.7735500931739807,
342
+ "step": 190
343
+ },
344
+ {
345
+ "epoch": 3.5555555555555554,
346
+ "grad_norm": 4.007388591766357,
347
+ "learning_rate": 4.058724504646834e-06,
348
+ "logits/chosen": -0.2135535329580307,
349
+ "logits/rejected": -0.280956506729126,
350
+ "logps/chosen": -8.478487968444824,
351
+ "logps/rejected": -31.442169189453125,
352
+ "loss": 0.2862,
353
+ "rewards/accuracies": 0.8500000238418579,
354
+ "rewards/chosen": 1.4557212591171265,
355
+ "rewards/margins": 2.184141159057617,
356
+ "rewards/rejected": -0.7284198999404907,
357
+ "step": 200
358
+ },
359
+ {
360
+ "epoch": 3.5555555555555554,
361
+ "eval_logits/chosen": -0.21535295248031616,
362
+ "eval_logits/rejected": -0.31062132120132446,
363
+ "eval_logps/chosen": -7.679609775543213,
364
+ "eval_logps/rejected": -31.993576049804688,
365
+ "eval_loss": 0.307707816362381,
366
+ "eval_rewards/accuracies": 0.8499999642372131,
367
+ "eval_rewards/chosen": 1.48140287399292,
368
+ "eval_rewards/margins": 2.2414333820343018,
369
+ "eval_rewards/rejected": -0.7600305080413818,
370
+ "eval_runtime": 13.0936,
371
+ "eval_samples_per_second": 7.637,
372
+ "eval_steps_per_second": 3.819,
373
+ "step": 200
374
+ },
375
+ {
376
+ "epoch": 3.7333333333333334,
377
+ "grad_norm": 20.840646743774414,
378
+ "learning_rate": 3.933941090877615e-06,
379
+ "logits/chosen": -0.21822166442871094,
380
+ "logits/rejected": -0.3239297866821289,
381
+ "logps/chosen": -10.312289237976074,
382
+ "logps/rejected": -33.20119857788086,
383
+ "loss": 0.3321,
384
+ "rewards/accuracies": 0.887499988079071,
385
+ "rewards/chosen": 1.287852168083191,
386
+ "rewards/margins": 2.1572673320770264,
387
+ "rewards/rejected": -0.8694152235984802,
388
+ "step": 210
389
+ },
390
+ {
391
+ "epoch": 3.911111111111111,
392
+ "grad_norm": 15.999884605407715,
393
+ "learning_rate": 3.8035880084487454e-06,
394
+ "logits/chosen": -0.21503937244415283,
395
+ "logits/rejected": -0.29350587725639343,
396
+ "logps/chosen": -8.648448944091797,
397
+ "logps/rejected": -34.50216293334961,
398
+ "loss": 0.2686,
399
+ "rewards/accuracies": 0.893750011920929,
400
+ "rewards/chosen": 1.3622721433639526,
401
+ "rewards/margins": 2.405822992324829,
402
+ "rewards/rejected": -1.0435508489608765,
403
+ "step": 220
404
+ },
405
+ {
406
+ "epoch": 4.088888888888889,
407
+ "grad_norm": 7.8126959800720215,
408
+ "learning_rate": 3.6681715706826555e-06,
409
+ "logits/chosen": -0.22714447975158691,
410
+ "logits/rejected": -0.3037336468696594,
411
+ "logps/chosen": -9.144305229187012,
412
+ "logps/rejected": -35.50819778442383,
413
+ "loss": 0.2284,
414
+ "rewards/accuracies": 0.9125000238418579,
415
+ "rewards/chosen": 1.3389980792999268,
416
+ "rewards/margins": 2.4519691467285156,
417
+ "rewards/rejected": -1.1129711866378784,
418
+ "step": 230
419
+ },
420
+ {
421
+ "epoch": 4.266666666666667,
422
+ "grad_norm": 9.81760311126709,
423
+ "learning_rate": 3.5282177578265295e-06,
424
+ "logits/chosen": -0.1930251568555832,
425
+ "logits/rejected": -0.28283536434173584,
426
+ "logps/chosen": -9.54443073272705,
427
+ "logps/rejected": -36.10524368286133,
428
+ "loss": 0.2864,
429
+ "rewards/accuracies": 0.9000000357627869,
430
+ "rewards/chosen": 1.4175978899002075,
431
+ "rewards/margins": 2.55643630027771,
432
+ "rewards/rejected": -1.1388384103775024,
433
+ "step": 240
434
+ },
435
+ {
436
+ "epoch": 4.444444444444445,
437
+ "grad_norm": 5.866613864898682,
438
+ "learning_rate": 3.384270174056454e-06,
439
+ "logits/chosen": -0.21315816044807434,
440
+ "logits/rejected": -0.30596357583999634,
441
+ "logps/chosen": -8.75170612335205,
442
+ "logps/rejected": -37.464111328125,
443
+ "loss": 0.2747,
444
+ "rewards/accuracies": 0.8812500238418579,
445
+ "rewards/chosen": 1.375083088874817,
446
+ "rewards/margins": 2.6868672370910645,
447
+ "rewards/rejected": -1.3117841482162476,
448
+ "step": 250
449
+ },
450
+ {
451
+ "epoch": 4.444444444444445,
452
+ "eval_logits/chosen": -0.18717029690742493,
453
+ "eval_logits/rejected": -0.28792956471443176,
454
+ "eval_logps/chosen": -8.346588134765625,
455
+ "eval_logps/rejected": -36.838497161865234,
456
+ "eval_loss": 0.3183891177177429,
457
+ "eval_rewards/accuracies": 0.85999995470047,
458
+ "eval_rewards/chosen": 1.4147050380706787,
459
+ "eval_rewards/margins": 2.6592278480529785,
460
+ "eval_rewards/rejected": -1.2445228099822998,
461
+ "eval_runtime": 13.1818,
462
+ "eval_samples_per_second": 7.586,
463
+ "eval_steps_per_second": 3.793,
464
+ "step": 250
465
+ },
466
+ {
467
+ "epoch": 4.622222222222222,
468
+ "grad_norm": 1.7029662132263184,
469
+ "learning_rate": 3.236887936027261e-06,
470
+ "logits/chosen": -0.19846662878990173,
471
+ "logits/rejected": -0.27518361806869507,
472
+ "logps/chosen": -8.800247192382812,
473
+ "logps/rejected": -38.98476791381836,
474
+ "loss": 0.2502,
475
+ "rewards/accuracies": 0.862500011920929,
476
+ "rewards/chosen": 1.383112907409668,
477
+ "rewards/margins": 2.8278017044067383,
478
+ "rewards/rejected": -1.4446886777877808,
479
+ "step": 260
480
+ },
481
+ {
482
+ "epoch": 4.8,
483
+ "grad_norm": 0.8134036064147949,
484
+ "learning_rate": 3.0866435011692884e-06,
485
+ "logits/chosen": -0.1693725436925888,
486
+ "logits/rejected": -0.2626606523990631,
487
+ "logps/chosen": -7.444068908691406,
488
+ "logps/rejected": -39.20288848876953,
489
+ "loss": 0.2289,
490
+ "rewards/accuracies": 0.9000000357627869,
491
+ "rewards/chosen": 1.527998447418213,
492
+ "rewards/margins": 2.998661994934082,
493
+ "rewards/rejected": -1.47066330909729,
494
+ "step": 270
495
+ },
496
+ {
497
+ "epoch": 4.977777777777778,
498
+ "grad_norm": 6.755254745483398,
499
+ "learning_rate": 2.9341204441673267e-06,
500
+ "logits/chosen": -0.1920958012342453,
501
+ "logits/rejected": -0.28297463059425354,
502
+ "logps/chosen": -8.827613830566406,
503
+ "logps/rejected": -37.988773345947266,
504
+ "loss": 0.2723,
505
+ "rewards/accuracies": 0.8812500238418579,
506
+ "rewards/chosen": 1.3093467950820923,
507
+ "rewards/margins": 2.6953580379486084,
508
+ "rewards/rejected": -1.3860112428665161,
509
+ "step": 280
510
+ },
511
+ {
512
+ "epoch": 5.155555555555556,
513
+ "grad_norm": 8.348061561584473,
514
+ "learning_rate": 2.7799111902582697e-06,
515
+ "logits/chosen": -0.19827289879322052,
516
+ "logits/rejected": -0.28890833258628845,
517
+ "logps/chosen": -8.641542434692383,
518
+ "logps/rejected": -37.83559036254883,
519
+ "loss": 0.2319,
520
+ "rewards/accuracies": 0.90625,
521
+ "rewards/chosen": 1.4176464080810547,
522
+ "rewards/margins": 2.750339984893799,
523
+ "rewards/rejected": -1.3326934576034546,
524
+ "step": 290
525
+ },
526
+ {
527
+ "epoch": 5.333333333333333,
528
+ "grad_norm": 4.27839469909668,
529
+ "learning_rate": 2.624614714151743e-06,
530
+ "logits/chosen": -0.16082732379436493,
531
+ "logits/rejected": -0.2566200792789459,
532
+ "logps/chosen": -7.895478248596191,
533
+ "logps/rejected": -37.70954132080078,
534
+ "loss": 0.2688,
535
+ "rewards/accuracies": 0.8812500238418579,
536
+ "rewards/chosen": 1.4443522691726685,
537
+ "rewards/margins": 2.8291947841644287,
538
+ "rewards/rejected": -1.3848422765731812,
539
+ "step": 300
540
+ },
541
+ {
542
+ "epoch": 5.333333333333333,
543
+ "eval_logits/chosen": -0.17137397825717926,
544
+ "eval_logits/rejected": -0.27050745487213135,
545
+ "eval_logps/chosen": -8.0242280960083,
546
+ "eval_logps/rejected": -37.18735122680664,
547
+ "eval_loss": 0.3195304572582245,
548
+ "eval_rewards/accuracies": 0.8499999642372131,
549
+ "eval_rewards/chosen": 1.4469408988952637,
550
+ "eval_rewards/margins": 2.726349115371704,
551
+ "eval_rewards/rejected": -1.2794082164764404,
552
+ "eval_runtime": 13.1946,
553
+ "eval_samples_per_second": 7.579,
554
+ "eval_steps_per_second": 3.789,
555
+ "step": 300
556
+ },
557
+ {
558
+ "epoch": 5.511111111111111,
559
+ "grad_norm": 1.2177822589874268,
560
+ "learning_rate": 2.4688342135114625e-06,
561
+ "logits/chosen": -0.18045730888843536,
562
+ "logits/rejected": -0.2632906138896942,
563
+ "logps/chosen": -8.068435668945312,
564
+ "logps/rejected": -38.28617477416992,
565
+ "loss": 0.2432,
566
+ "rewards/accuracies": 0.893750011920929,
567
+ "rewards/chosen": 1.5070723295211792,
568
+ "rewards/margins": 2.8888046741485596,
569
+ "rewards/rejected": -1.3817322254180908,
570
+ "step": 310
571
+ },
572
+ {
573
+ "epoch": 5.688888888888889,
574
+ "grad_norm": 6.162428379058838,
575
+ "learning_rate": 2.3131747660339396e-06,
576
+ "logits/chosen": -0.16258001327514648,
577
+ "logits/rejected": -0.2531837522983551,
578
+ "logps/chosen": -7.69020938873291,
579
+ "logps/rejected": -39.50102233886719,
580
+ "loss": 0.2155,
581
+ "rewards/accuracies": 0.893750011920929,
582
+ "rewards/chosen": 1.4402731657028198,
583
+ "rewards/margins": 2.9210917949676514,
584
+ "rewards/rejected": -1.4808186292648315,
585
+ "step": 320
586
+ },
587
+ {
588
+ "epoch": 5.866666666666667,
589
+ "grad_norm": 4.585377216339111,
590
+ "learning_rate": 2.158240979224817e-06,
591
+ "logits/chosen": -0.16741995513439178,
592
+ "logits/rejected": -0.25773563981056213,
593
+ "logps/chosen": -7.611827373504639,
594
+ "logps/rejected": -40.226985931396484,
595
+ "loss": 0.1765,
596
+ "rewards/accuracies": 0.918749988079071,
597
+ "rewards/chosen": 1.5339945554733276,
598
+ "rewards/margins": 3.106376886367798,
599
+ "rewards/rejected": -1.5723823308944702,
600
+ "step": 330
601
+ },
602
+ {
603
+ "epoch": 6.044444444444444,
604
+ "grad_norm": 1.262926697731018,
605
+ "learning_rate": 2.004634642001507e-06,
606
+ "logits/chosen": -0.17107848823070526,
607
+ "logits/rejected": -0.2691803276538849,
608
+ "logps/chosen": -8.973631858825684,
609
+ "logps/rejected": -41.54354476928711,
610
+ "loss": 0.1994,
611
+ "rewards/accuracies": 0.9312500357627869,
612
+ "rewards/chosen": 1.4238203763961792,
613
+ "rewards/margins": 3.09421968460083,
614
+ "rewards/rejected": -1.67039954662323,
615
+ "step": 340
616
+ },
617
+ {
618
+ "epoch": 6.222222222222222,
619
+ "grad_norm": 7.014649868011475,
620
+ "learning_rate": 1.852952387243698e-06,
621
+ "logits/chosen": -0.17273230850696564,
622
+ "logits/rejected": -0.2703326344490051,
623
+ "logps/chosen": -8.288504600524902,
624
+ "logps/rejected": -42.46126174926758,
625
+ "loss": 0.2047,
626
+ "rewards/accuracies": 0.918749988079071,
627
+ "rewards/chosen": 1.4250819683074951,
628
+ "rewards/margins": 3.235600709915161,
629
+ "rewards/rejected": -1.8105186223983765,
630
+ "step": 350
631
+ },
632
+ {
633
+ "epoch": 6.222222222222222,
634
+ "eval_logits/chosen": -0.15530475974082947,
635
+ "eval_logits/rejected": -0.2578139305114746,
636
+ "eval_logps/chosen": -9.47491455078125,
637
+ "eval_logps/rejected": -40.34950637817383,
638
+ "eval_loss": 0.3629891574382782,
639
+ "eval_rewards/accuracies": 0.8399999737739563,
640
+ "eval_rewards/chosen": 1.3018722534179688,
641
+ "eval_rewards/margins": 2.897495746612549,
642
+ "eval_rewards/rejected": -1.59562349319458,
643
+ "eval_runtime": 13.1284,
644
+ "eval_samples_per_second": 7.617,
645
+ "eval_steps_per_second": 3.809,
646
+ "step": 350
647
+ },
648
+ {
649
+ "epoch": 6.4,
650
+ "grad_norm": 8.984504699707031,
651
+ "learning_rate": 1.7037833743707892e-06,
652
+ "logits/chosen": -0.14409998059272766,
653
+ "logits/rejected": -0.2163417637348175,
654
+ "logps/chosen": -8.419466018676758,
655
+ "logps/rejected": -41.356468200683594,
656
+ "loss": 0.1894,
657
+ "rewards/accuracies": 0.9000000357627869,
658
+ "rewards/chosen": 1.4324077367782593,
659
+ "rewards/margins": 3.1765902042388916,
660
+ "rewards/rejected": -1.7441825866699219,
661
+ "step": 360
662
+ },
663
+ {
664
+ "epoch": 6.5777777777777775,
665
+ "grad_norm": 7.33497953414917,
666
+ "learning_rate": 1.5577070009474872e-06,
667
+ "logits/chosen": -0.14226722717285156,
668
+ "logits/rejected": -0.23484404385089874,
669
+ "logps/chosen": -7.855214595794678,
670
+ "logps/rejected": -42.96932601928711,
671
+ "loss": 0.19,
672
+ "rewards/accuracies": 0.9000000357627869,
673
+ "rewards/chosen": 1.4563724994659424,
674
+ "rewards/margins": 3.2921605110168457,
675
+ "rewards/rejected": -1.8357880115509033,
676
+ "step": 370
677
+ },
678
+ {
679
+ "epoch": 6.7555555555555555,
680
+ "grad_norm": 1.3846873044967651,
681
+ "learning_rate": 1.415290652206105e-06,
682
+ "logits/chosen": -0.11893842369318008,
683
+ "logits/rejected": -0.22303898632526398,
684
+ "logps/chosen": -5.515126705169678,
685
+ "logps/rejected": -43.147682189941406,
686
+ "loss": 0.1296,
687
+ "rewards/accuracies": 0.96875,
688
+ "rewards/chosen": 1.6900787353515625,
689
+ "rewards/margins": 3.544553756713867,
690
+ "rewards/rejected": -1.8544749021530151,
691
+ "step": 380
692
+ },
693
+ {
694
+ "epoch": 6.933333333333334,
695
+ "grad_norm": 3.4178779125213623,
696
+ "learning_rate": 1.2770874972267777e-06,
697
+ "logits/chosen": -0.2029663324356079,
698
+ "logits/rejected": -0.27416250109672546,
699
+ "logps/chosen": -10.447962760925293,
700
+ "logps/rejected": -41.886138916015625,
701
+ "loss": 0.2833,
702
+ "rewards/accuracies": 0.925000011920929,
703
+ "rewards/chosen": 1.2835975885391235,
704
+ "rewards/margins": 3.022332191467285,
705
+ "rewards/rejected": -1.738734483718872,
706
+ "step": 390
707
+ },
708
+ {
709
+ "epoch": 7.111111111111111,
710
+ "grad_norm": 5.459321975708008,
711
+ "learning_rate": 1.1436343403356019e-06,
712
+ "logits/chosen": -0.14208023250102997,
713
+ "logits/rejected": -0.22397640347480774,
714
+ "logps/chosen": -8.451988220214844,
715
+ "logps/rejected": -42.06145095825195,
716
+ "loss": 0.2268,
717
+ "rewards/accuracies": 0.893750011920929,
718
+ "rewards/chosen": 1.4957071542739868,
719
+ "rewards/margins": 3.238795518875122,
720
+ "rewards/rejected": -1.7430883646011353,
721
+ "step": 400
722
+ },
723
+ {
724
+ "epoch": 7.111111111111111,
725
+ "eval_logits/chosen": -0.1451883465051651,
726
+ "eval_logits/rejected": -0.24787528812885284,
727
+ "eval_logps/chosen": -8.884222030639648,
728
+ "eval_logps/rejected": -41.02870178222656,
729
+ "eval_loss": 0.352620393037796,
730
+ "eval_rewards/accuracies": 0.8499999642372131,
731
+ "eval_rewards/chosen": 1.3609414100646973,
732
+ "eval_rewards/margins": 3.024484872817993,
733
+ "eval_rewards/rejected": -1.6635433435440063,
734
+ "eval_runtime": 13.1853,
735
+ "eval_samples_per_second": 7.584,
736
+ "eval_steps_per_second": 3.792,
737
+ "step": 400
738
+ },
739
+ {
740
+ "epoch": 7.288888888888889,
741
+ "grad_norm": 2.762150764465332,
742
+ "learning_rate": 1.0154495360662464e-06,
743
+ "logits/chosen": -0.18289707601070404,
744
+ "logits/rejected": -0.28008756041526794,
745
+ "logps/chosen": -7.561173439025879,
746
+ "logps/rejected": -43.28544998168945,
747
+ "loss": 0.1976,
748
+ "rewards/accuracies": 0.925000011920929,
749
+ "rewards/chosen": 1.419569730758667,
750
+ "rewards/margins": 3.3263187408447266,
751
+ "rewards/rejected": -1.9067490100860596,
752
+ "step": 410
753
+ },
754
+ {
755
+ "epoch": 7.466666666666667,
756
+ "grad_norm": 2.009507417678833,
757
+ "learning_rate": 8.930309757836517e-07,
758
+ "logits/chosen": -0.12424879521131516,
759
+ "logits/rejected": -0.22191472351551056,
760
+ "logps/chosen": -7.338543891906738,
761
+ "logps/rejected": -42.68900680541992,
762
+ "loss": 0.2035,
763
+ "rewards/accuracies": 0.90625,
764
+ "rewards/chosen": 1.533159852027893,
765
+ "rewards/margins": 3.3633785247802734,
766
+ "rewards/rejected": -1.8302189111709595,
767
+ "step": 420
768
+ },
769
+ {
770
+ "epoch": 7.644444444444445,
771
+ "grad_norm": 1.1898467540740967,
772
+ "learning_rate": 7.768541537901325e-07,
773
+ "logits/chosen": -0.13503606617450714,
774
+ "logits/rejected": -0.23675648868083954,
775
+ "logps/chosen": -8.620238304138184,
776
+ "logps/rejected": -44.113094329833984,
777
+ "loss": 0.1728,
778
+ "rewards/accuracies": 0.918749988079071,
779
+ "rewards/chosen": 1.4350792169570923,
780
+ "rewards/margins": 3.3960673809051514,
781
+ "rewards/rejected": -1.9609882831573486,
782
+ "step": 430
783
+ },
784
+ {
785
+ "epoch": 7.822222222222222,
786
+ "grad_norm": 6.634049892425537,
787
+ "learning_rate": 6.673703204254348e-07,
788
+ "logits/chosen": -0.13498146831989288,
789
+ "logits/rejected": -0.21733498573303223,
790
+ "logps/chosen": -8.704492568969727,
791
+ "logps/rejected": -43.417938232421875,
792
+ "loss": 0.2495,
793
+ "rewards/accuracies": 0.918749988079071,
794
+ "rewards/chosen": 1.4058576822280884,
795
+ "rewards/margins": 3.286863088607788,
796
+ "rewards/rejected": -1.8810051679611206,
797
+ "step": 440
798
+ },
799
+ {
800
+ "epoch": 8.0,
801
+ "grad_norm": 4.482236385345459,
802
+ "learning_rate": 5.650047293344316e-07,
803
+ "logits/chosen": -0.14725255966186523,
804
+ "logits/rejected": -0.2412375509738922,
805
+ "logps/chosen": -7.016914367675781,
806
+ "logps/rejected": -44.169490814208984,
807
+ "loss": 0.144,
808
+ "rewards/accuracies": 0.9375,
809
+ "rewards/chosen": 1.5698460340499878,
810
+ "rewards/margins": 3.528562307357788,
811
+ "rewards/rejected": -1.9587162733078003,
812
+ "step": 450
813
+ },
814
+ {
815
+ "epoch": 8.0,
816
+ "eval_logits/chosen": -0.14205148816108704,
817
+ "eval_logits/rejected": -0.24482183158397675,
818
+ "eval_logps/chosen": -9.005942344665527,
819
+ "eval_logps/rejected": -41.425498962402344,
820
+ "eval_loss": 0.3662210702896118,
821
+ "eval_rewards/accuracies": 0.8399999737739563,
822
+ "eval_rewards/chosen": 1.3487695455551147,
823
+ "eval_rewards/margins": 3.051992654800415,
824
+ "eval_rewards/rejected": -1.7032231092453003,
825
+ "eval_runtime": 13.1525,
826
+ "eval_samples_per_second": 7.603,
827
+ "eval_steps_per_second": 3.802,
828
+ "step": 450
829
+ },
830
+ {
831
+ "epoch": 8.177777777777777,
832
+ "grad_norm": 18.307058334350586,
833
+ "learning_rate": 4.7015498571035877e-07,
834
+ "logits/chosen": -0.10620441287755966,
835
+ "logits/rejected": -0.19579415023326874,
836
+ "logps/chosen": -9.014161109924316,
837
+ "logps/rejected": -43.37262725830078,
838
+ "loss": 0.2079,
839
+ "rewards/accuracies": 0.918749988079071,
840
+ "rewards/chosen": 1.3402118682861328,
841
+ "rewards/margins": 3.250830888748169,
842
+ "rewards/rejected": -1.910618782043457,
843
+ "step": 460
844
+ },
845
+ {
846
+ "epoch": 8.355555555555556,
847
+ "grad_norm": 1.149519681930542,
848
+ "learning_rate": 3.831895019292897e-07,
849
+ "logits/chosen": -0.12923561036586761,
850
+ "logits/rejected": -0.24200458824634552,
851
+ "logps/chosen": -6.25982666015625,
852
+ "logps/rejected": -44.513946533203125,
853
+ "loss": 0.1659,
854
+ "rewards/accuracies": 0.9125000238418579,
855
+ "rewards/chosen": 1.5601545572280884,
856
+ "rewards/margins": 3.614009141921997,
857
+ "rewards/rejected": -2.0538547039031982,
858
+ "step": 470
859
+ },
860
+ {
861
+ "epoch": 8.533333333333333,
862
+ "grad_norm": 9.660409927368164,
863
+ "learning_rate": 3.044460665744284e-07,
864
+ "logits/chosen": -0.12687157094478607,
865
+ "logits/rejected": -0.2105720490217209,
866
+ "logps/chosen": -8.170889854431152,
867
+ "logps/rejected": -43.91897201538086,
868
+ "loss": 0.2133,
869
+ "rewards/accuracies": 0.9000000357627869,
870
+ "rewards/chosen": 1.490912675857544,
871
+ "rewards/margins": 3.416290283203125,
872
+ "rewards/rejected": -1.9253777265548706,
873
+ "step": 480
874
+ },
875
+ {
876
+ "epoch": 8.71111111111111,
877
+ "grad_norm": 6.015744686126709,
878
+ "learning_rate": 2.3423053240837518e-07,
879
+ "logits/chosen": -0.1734294891357422,
880
+ "logits/rejected": -0.2597627341747284,
881
+ "logps/chosen": -9.138812065124512,
882
+ "logps/rejected": -43.02109909057617,
883
+ "loss": 0.1926,
884
+ "rewards/accuracies": 0.925000011920929,
885
+ "rewards/chosen": 1.4286350011825562,
886
+ "rewards/margins": 3.2899787425994873,
887
+ "rewards/rejected": -1.8613442182540894,
888
+ "step": 490
889
+ },
890
+ {
891
+ "epoch": 8.88888888888889,
892
+ "grad_norm": 5.968247890472412,
893
+ "learning_rate": 1.7281562838948968e-07,
894
+ "logits/chosen": -0.1330401748418808,
895
+ "logits/rejected": -0.2250121682882309,
896
+ "logps/chosen": -7.425269603729248,
897
+ "logps/rejected": -43.31748580932617,
898
+ "loss": 0.171,
899
+ "rewards/accuracies": 0.956250011920929,
900
+ "rewards/chosen": 1.5740032196044922,
901
+ "rewards/margins": 3.423919677734375,
902
+ "rewards/rejected": -1.8499164581298828,
903
+ "step": 500
904
+ },
905
+ {
906
+ "epoch": 8.88888888888889,
907
+ "eval_logits/chosen": -0.1399160772562027,
908
+ "eval_logits/rejected": -0.2429780215024948,
909
+ "eval_logps/chosen": -9.18053913116455,
910
+ "eval_logps/rejected": -41.719730377197266,
911
+ "eval_loss": 0.36349406838417053,
912
+ "eval_rewards/accuracies": 0.8399999737739563,
913
+ "eval_rewards/chosen": 1.3313097953796387,
914
+ "eval_rewards/margins": 3.0639562606811523,
915
+ "eval_rewards/rejected": -1.7326463460922241,
916
+ "eval_runtime": 13.2218,
917
+ "eval_samples_per_second": 7.563,
918
+ "eval_steps_per_second": 3.782,
919
+ "step": 500
920
+ },
921
+ {
922
+ "epoch": 9.066666666666666,
923
+ "grad_norm": 2.3828182220458984,
924
+ "learning_rate": 1.2043990034669413e-07,
925
+ "logits/chosen": -0.16531167924404144,
926
+ "logits/rejected": -0.26316291093826294,
927
+ "logps/chosen": -6.18184757232666,
928
+ "logps/rejected": -45.66379928588867,
929
+ "loss": 0.136,
930
+ "rewards/accuracies": 0.949999988079071,
931
+ "rewards/chosen": 1.5860216617584229,
932
+ "rewards/margins": 3.676870107650757,
933
+ "rewards/rejected": -2.090848207473755,
934
+ "step": 510
935
+ },
936
+ {
937
+ "epoch": 9.244444444444444,
938
+ "grad_norm": 1.0561579465866089,
939
+ "learning_rate": 7.730678442730539e-08,
940
+ "logits/chosen": -0.12930361926555634,
941
+ "logits/rejected": -0.22025151550769806,
942
+ "logps/chosen": -7.336406230926514,
943
+ "logps/rejected": -44.349239349365234,
944
+ "loss": 0.1573,
945
+ "rewards/accuracies": 0.9375,
946
+ "rewards/chosen": 1.5953136682510376,
947
+ "rewards/margins": 3.572978973388672,
948
+ "rewards/rejected": -1.9776651859283447,
949
+ "step": 520
950
+ },
951
+ {
952
+ "epoch": 9.422222222222222,
953
+ "grad_norm": 11.791927337646484,
954
+ "learning_rate": 4.358381691677932e-08,
955
+ "logits/chosen": -0.10277407616376877,
956
+ "logits/rejected": -0.1993442177772522,
957
+ "logps/chosen": -7.81331205368042,
958
+ "logps/rejected": -44.955955505371094,
959
+ "loss": 0.1684,
960
+ "rewards/accuracies": 0.925000011920929,
961
+ "rewards/chosen": 1.5365545749664307,
962
+ "rewards/margins": 3.621255874633789,
963
+ "rewards/rejected": -2.0847015380859375,
964
+ "step": 530
965
+ },
966
+ {
967
+ "epoch": 9.6,
968
+ "grad_norm": 3.8846185207366943,
969
+ "learning_rate": 1.9401983499569843e-08,
970
+ "logits/chosen": -0.1286163181066513,
971
+ "logits/rejected": -0.25202152132987976,
972
+ "logps/chosen": -7.090148448944092,
973
+ "logps/rejected": -45.3367805480957,
974
+ "loss": 0.1675,
975
+ "rewards/accuracies": 0.925000011920929,
976
+ "rewards/chosen": 1.5582847595214844,
977
+ "rewards/margins": 3.6432290077209473,
978
+ "rewards/rejected": -2.084944009780884,
979
+ "step": 540
980
+ },
981
+ {
982
+ "epoch": 9.777777777777779,
983
+ "grad_norm": 6.826923847198486,
984
+ "learning_rate": 4.855210488670381e-09,
985
+ "logits/chosen": -0.14255917072296143,
986
+ "logits/rejected": -0.24119222164154053,
987
+ "logps/chosen": -9.036189079284668,
988
+ "logps/rejected": -41.72261047363281,
989
+ "loss": 0.2313,
990
+ "rewards/accuracies": 0.9125000238418579,
991
+ "rewards/chosen": 1.3214199542999268,
992
+ "rewards/margins": 3.085725784301758,
993
+ "rewards/rejected": -1.7643059492111206,
994
+ "step": 550
995
+ },
996
+ {
997
+ "epoch": 9.777777777777779,
998
+ "eval_logits/chosen": -0.13784560561180115,
999
+ "eval_logits/rejected": -0.24095743894577026,
1000
+ "eval_logps/chosen": -9.1017484664917,
1001
+ "eval_logps/rejected": -41.82564926147461,
1002
+ "eval_loss": 0.3612578511238098,
1003
+ "eval_rewards/accuracies": 0.8399999737739563,
1004
+ "eval_rewards/chosen": 1.3391889333724976,
1005
+ "eval_rewards/margins": 3.0824267864227295,
1006
+ "eval_rewards/rejected": -1.7432377338409424,
1007
+ "eval_runtime": 13.1911,
1008
+ "eval_samples_per_second": 7.581,
1009
+ "eval_steps_per_second": 3.79,
1010
+ "step": 550
1011
+ },
1012
+ {
1013
+ "epoch": 9.955555555555556,
1014
+ "grad_norm": 1.7594248056411743,
1015
+ "learning_rate": 0.0,
1016
+ "logits/chosen": -0.15151795744895935,
1017
+ "logits/rejected": -0.2365965098142624,
1018
+ "logps/chosen": -7.8617448806762695,
1019
+ "logps/rejected": -42.87919235229492,
1020
+ "loss": 0.1746,
1021
+ "rewards/accuracies": 0.9312500357627869,
1022
+ "rewards/chosen": 1.5052164793014526,
1023
+ "rewards/margins": 3.3184120655059814,
1024
+ "rewards/rejected": -1.8131954669952393,
1025
+ "step": 560
1026
+ },
1027
+ {
1028
+ "epoch": 9.955555555555556,
1029
+ "step": 560,
1030
+ "total_flos": 6.741083695664333e+16,
1031
+ "train_loss": 0.31093634622437616,
1032
+ "train_runtime": 2833.3873,
1033
+ "train_samples_per_second": 3.176,
1034
+ "train_steps_per_second": 0.198
1035
+ }
1036
+ ],
1037
+ "logging_steps": 10,
1038
+ "max_steps": 560,
1039
+ "num_input_tokens_seen": 0,
1040
+ "num_train_epochs": 10,
1041
+ "save_steps": 50,
1042
+ "stateful_callbacks": {
1043
+ "TrainerControl": {
1044
+ "args": {
1045
+ "should_epoch_stop": false,
1046
+ "should_evaluate": false,
1047
+ "should_log": false,
1048
+ "should_save": true,
1049
+ "should_training_stop": true
1050
+ },
1051
+ "attributes": {}
1052
+ }
1053
+ },
1054
+ "total_flos": 6.741083695664333e+16,
1055
+ "train_batch_size": 2,
1056
+ "trial_name": null,
1057
+ "trial_params": null
1058
+ }
training_eval_loss.png ADDED
training_loss.png ADDED
training_rewards_accuracies.png ADDED