chchen commited on
Commit
1fbe65a
·
verified ·
1 Parent(s): 6a3eef4

End of training

Browse files
README.md CHANGED
@@ -3,9 +3,10 @@ library_name: peft
3
  license: llama3.1
4
  base_model: meta-llama/Llama-3.1-8B-Instruct
5
  tags:
 
 
6
  - trl
7
  - dpo
8
- - llama-factory
9
  - generated_from_trainer
10
  model-index:
11
  - name: Llama-3.1-8B-Instruct-dpo-mistral-1000
@@ -17,17 +18,17 @@ should probably proofread and complete it, then remove this comment. -->
17
 
18
  # Llama-3.1-8B-Instruct-dpo-mistral-1000
19
 
20
- This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on an unknown dataset.
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.4934
23
- - Rewards/chosen: 0.8943
24
- - Rewards/rejected: -0.7017
25
- - Rewards/accuracies: 0.75
26
- - Rewards/margins: 1.5960
27
- - Logps/chosen: -14.2081
28
- - Logps/rejected: -32.2463
29
- - Logits/chosen: -0.0873
30
- - Logits/rejected: -0.1548
31
 
32
  ## Model description
33
 
 
3
  license: llama3.1
4
  base_model: meta-llama/Llama-3.1-8B-Instruct
5
  tags:
6
+ - llama-factory
7
+ - lora
8
  - trl
9
  - dpo
 
10
  - generated_from_trainer
11
  model-index:
12
  - name: Llama-3.1-8B-Instruct-dpo-mistral-1000
 
18
 
19
  # Llama-3.1-8B-Instruct-dpo-mistral-1000
20
 
21
+ This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the answer_mistral dataset.
22
  It achieves the following results on the evaluation set:
23
+ - Loss: 0.4675
24
+ - Rewards/chosen: 0.9903
25
+ - Rewards/rejected: -0.3997
26
+ - Rewards/accuracies: 0.7900
27
+ - Rewards/margins: 1.3900
28
+ - Logps/chosen: -13.2488
29
+ - Logps/rejected: -29.2269
30
+ - Logits/chosen: -0.1396
31
+ - Logits/rejected: -0.2080
32
 
33
  ## Model description
34
 
all_results.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.977728285077951,
3
+ "eval_logits/chosen": -0.13963328301906586,
4
+ "eval_logits/rejected": -0.2079545557498932,
5
+ "eval_logps/chosen": -13.248819351196289,
6
+ "eval_logps/rejected": -29.226905822753906,
7
+ "eval_loss": 0.4674791693687439,
8
+ "eval_rewards/accuracies": 0.7899999618530273,
9
+ "eval_rewards/chosen": 0.9902573823928833,
10
+ "eval_rewards/margins": 1.390006184577942,
11
+ "eval_rewards/rejected": -0.3997488021850586,
12
+ "eval_runtime": 9.6246,
13
+ "eval_samples_per_second": 10.39,
14
+ "eval_steps_per_second": 5.195,
15
+ "total_flos": 5.23282185018409e+16,
16
+ "train_loss": 0.4115582968507494,
17
+ "train_runtime": 2017.6604,
18
+ "train_samples_per_second": 4.451,
19
+ "train_steps_per_second": 0.278
20
+ }
eval_results.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.977728285077951,
3
+ "eval_logits/chosen": -0.13963328301906586,
4
+ "eval_logits/rejected": -0.2079545557498932,
5
+ "eval_logps/chosen": -13.248819351196289,
6
+ "eval_logps/rejected": -29.226905822753906,
7
+ "eval_loss": 0.4674791693687439,
8
+ "eval_rewards/accuracies": 0.7899999618530273,
9
+ "eval_rewards/chosen": 0.9902573823928833,
10
+ "eval_rewards/margins": 1.390006184577942,
11
+ "eval_rewards/rejected": -0.3997488021850586,
12
+ "eval_runtime": 9.6246,
13
+ "eval_samples_per_second": 10.39,
14
+ "eval_steps_per_second": 5.195
15
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.977728285077951,
3
+ "total_flos": 5.23282185018409e+16,
4
+ "train_loss": 0.4115582968507494,
5
+ "train_runtime": 2017.6604,
6
+ "train_samples_per_second": 4.451,
7
+ "train_steps_per_second": 0.278
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1058 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.4674791693687439,
3
+ "best_model_checkpoint": "saves/sycophancy/Llama-3.1-8B-Instruct/dpo-mistral-1000/train/checkpoint-250",
4
+ "epoch": 9.977728285077951,
5
+ "eval_steps": 50,
6
+ "global_step": 560,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.17817371937639198,
13
+ "grad_norm": 1.6836061477661133,
14
+ "learning_rate": 8.928571428571429e-07,
15
+ "logits/chosen": -0.34122365713119507,
16
+ "logits/rejected": -0.392149418592453,
17
+ "logps/chosen": -22.754064559936523,
18
+ "logps/rejected": -24.806787490844727,
19
+ "loss": 0.6925,
20
+ "rewards/accuracies": 0.4000000059604645,
21
+ "rewards/chosen": -0.00038957837386988103,
22
+ "rewards/margins": 0.0013258380349725485,
23
+ "rewards/rejected": -0.0017154163215309381,
24
+ "step": 10
25
+ },
26
+ {
27
+ "epoch": 0.35634743875278396,
28
+ "grad_norm": 1.5660648345947266,
29
+ "learning_rate": 1.7857142857142859e-06,
30
+ "logits/chosen": -0.3305678367614746,
31
+ "logits/rejected": -0.36542031168937683,
32
+ "logps/chosen": -22.68667221069336,
33
+ "logps/rejected": -24.383272171020508,
34
+ "loss": 0.694,
35
+ "rewards/accuracies": 0.46875,
36
+ "rewards/chosen": -0.0019774259999394417,
37
+ "rewards/margins": -0.0016633094055578113,
38
+ "rewards/rejected": -0.0003141165361739695,
39
+ "step": 20
40
+ },
41
+ {
42
+ "epoch": 0.534521158129176,
43
+ "grad_norm": 1.7988139390945435,
44
+ "learning_rate": 2.6785714285714285e-06,
45
+ "logits/chosen": -0.33457204699516296,
46
+ "logits/rejected": -0.3815780282020569,
47
+ "logps/chosen": -23.1612548828125,
48
+ "logps/rejected": -24.836502075195312,
49
+ "loss": 0.6913,
50
+ "rewards/accuracies": 0.6000000238418579,
51
+ "rewards/chosen": 0.004869468975812197,
52
+ "rewards/margins": 0.0037120466586202383,
53
+ "rewards/rejected": 0.001157421967945993,
54
+ "step": 30
55
+ },
56
+ {
57
+ "epoch": 0.7126948775055679,
58
+ "grad_norm": 1.3268821239471436,
59
+ "learning_rate": 3.5714285714285718e-06,
60
+ "logits/chosen": -0.3601827919483185,
61
+ "logits/rejected": -0.39572954177856445,
62
+ "logps/chosen": -22.90369987487793,
63
+ "logps/rejected": -24.570892333984375,
64
+ "loss": 0.6885,
65
+ "rewards/accuracies": 0.6312500238418579,
66
+ "rewards/chosen": 0.01631110906600952,
67
+ "rewards/margins": 0.009484974667429924,
68
+ "rewards/rejected": 0.006826136261224747,
69
+ "step": 40
70
+ },
71
+ {
72
+ "epoch": 0.89086859688196,
73
+ "grad_norm": 1.776597261428833,
74
+ "learning_rate": 4.464285714285715e-06,
75
+ "logits/chosen": -0.3659666180610657,
76
+ "logits/rejected": -0.39071187376976013,
77
+ "logps/chosen": -23.908523559570312,
78
+ "logps/rejected": -24.262924194335938,
79
+ "loss": 0.6891,
80
+ "rewards/accuracies": 0.574999988079071,
81
+ "rewards/chosen": 0.026137981563806534,
82
+ "rewards/margins": 0.008756262250244617,
83
+ "rewards/rejected": 0.017381716519594193,
84
+ "step": 50
85
+ },
86
+ {
87
+ "epoch": 0.89086859688196,
88
+ "eval_logits/chosen": -0.32067814469337463,
89
+ "eval_logits/rejected": -0.36896541714668274,
90
+ "eval_logps/chosen": -22.66472053527832,
91
+ "eval_logps/rejected": -24.953529357910156,
92
+ "eval_loss": 0.6832554340362549,
93
+ "eval_rewards/accuracies": 0.6200000047683716,
94
+ "eval_rewards/chosen": 0.048666905611753464,
95
+ "eval_rewards/margins": 0.02107813023030758,
96
+ "eval_rewards/rejected": 0.027588771656155586,
97
+ "eval_runtime": 9.7251,
98
+ "eval_samples_per_second": 10.283,
99
+ "eval_steps_per_second": 5.141,
100
+ "step": 50
101
+ },
102
+ {
103
+ "epoch": 1.069042316258352,
104
+ "grad_norm": 1.7856236696243286,
105
+ "learning_rate": 4.999222955002041e-06,
106
+ "logits/chosen": -0.35137540102005005,
107
+ "logits/rejected": -0.3762962818145752,
108
+ "logps/chosen": -22.667428970336914,
109
+ "logps/rejected": -24.934234619140625,
110
+ "loss": 0.675,
111
+ "rewards/accuracies": 0.612500011920929,
112
+ "rewards/chosen": 0.06478282064199448,
113
+ "rewards/margins": 0.0387502983212471,
114
+ "rewards/rejected": 0.026032526046037674,
115
+ "step": 60
116
+ },
117
+ {
118
+ "epoch": 1.247216035634744,
119
+ "grad_norm": 2.1098411083221436,
120
+ "learning_rate": 4.990486745229364e-06,
121
+ "logits/chosen": -0.34582623839378357,
122
+ "logits/rejected": -0.3823074400424957,
123
+ "logps/chosen": -21.945472717285156,
124
+ "logps/rejected": -23.575443267822266,
125
+ "loss": 0.6683,
126
+ "rewards/accuracies": 0.6500000357627869,
127
+ "rewards/chosen": 0.1263795644044876,
128
+ "rewards/margins": 0.05839619040489197,
129
+ "rewards/rejected": 0.06798337399959564,
130
+ "step": 70
131
+ },
132
+ {
133
+ "epoch": 1.4253897550111359,
134
+ "grad_norm": 2.4574801921844482,
135
+ "learning_rate": 4.9720770655628216e-06,
136
+ "logits/chosen": -0.3288664221763611,
137
+ "logits/rejected": -0.3651907444000244,
138
+ "logps/chosen": -21.0363712310791,
139
+ "logps/rejected": -24.36761474609375,
140
+ "loss": 0.6223,
141
+ "rewards/accuracies": 0.737500011920929,
142
+ "rewards/chosen": 0.26030755043029785,
143
+ "rewards/margins": 0.1713070124387741,
144
+ "rewards/rejected": 0.08900053799152374,
145
+ "step": 80
146
+ },
147
+ {
148
+ "epoch": 1.6035634743875278,
149
+ "grad_norm": 2.382906675338745,
150
+ "learning_rate": 4.944065422298262e-06,
151
+ "logits/chosen": -0.3467506468296051,
152
+ "logits/rejected": -0.38536104559898376,
153
+ "logps/chosen": -19.177209854125977,
154
+ "logps/rejected": -23.13625144958496,
155
+ "loss": 0.6035,
156
+ "rewards/accuracies": 0.699999988079071,
157
+ "rewards/chosen": 0.41219550371170044,
158
+ "rewards/margins": 0.25021034479141235,
159
+ "rewards/rejected": 0.1619851440191269,
160
+ "step": 90
161
+ },
162
+ {
163
+ "epoch": 1.7817371937639197,
164
+ "grad_norm": 2.123657703399658,
165
+ "learning_rate": 4.90656061737503e-06,
166
+ "logits/chosen": -0.3112158477306366,
167
+ "logits/rejected": -0.355999618768692,
168
+ "logps/chosen": -17.31022834777832,
169
+ "logps/rejected": -22.826955795288086,
170
+ "loss": 0.5716,
171
+ "rewards/accuracies": 0.699999988079071,
172
+ "rewards/chosen": 0.5237475633621216,
173
+ "rewards/margins": 0.35302406549453735,
174
+ "rewards/rejected": 0.17072349786758423,
175
+ "step": 100
176
+ },
177
+ {
178
+ "epoch": 1.7817371937639197,
179
+ "eval_logits/chosen": -0.2933846712112427,
180
+ "eval_logits/rejected": -0.34558260440826416,
181
+ "eval_logps/chosen": -17.070594787597656,
182
+ "eval_logps/rejected": -23.3165225982666,
183
+ "eval_loss": 0.5618187785148621,
184
+ "eval_rewards/accuracies": 0.699999988079071,
185
+ "eval_rewards/chosen": 0.6080796718597412,
186
+ "eval_rewards/margins": 0.4167901575565338,
187
+ "eval_rewards/rejected": 0.191289484500885,
188
+ "eval_runtime": 9.6952,
189
+ "eval_samples_per_second": 10.314,
190
+ "eval_steps_per_second": 5.157,
191
+ "step": 100
192
+ },
193
+ {
194
+ "epoch": 1.9599109131403119,
195
+ "grad_norm": 3.316772699356079,
196
+ "learning_rate": 4.859708325770919e-06,
197
+ "logits/chosen": -0.3041900396347046,
198
+ "logits/rejected": -0.35448360443115234,
199
+ "logps/chosen": -16.122028350830078,
200
+ "logps/rejected": -22.659238815307617,
201
+ "loss": 0.5299,
202
+ "rewards/accuracies": 0.7250000238418579,
203
+ "rewards/chosen": 0.7380782961845398,
204
+ "rewards/margins": 0.5233559608459473,
205
+ "rewards/rejected": 0.21472235023975372,
206
+ "step": 110
207
+ },
208
+ {
209
+ "epoch": 2.138084632516704,
210
+ "grad_norm": 2.7125298976898193,
211
+ "learning_rate": 4.80369052967602e-06,
212
+ "logits/chosen": -0.2812982499599457,
213
+ "logits/rejected": -0.3268232047557831,
214
+ "logps/chosen": -15.398541450500488,
215
+ "logps/rejected": -23.535791397094727,
216
+ "loss": 0.4948,
217
+ "rewards/accuracies": 0.7750000357627869,
218
+ "rewards/chosen": 0.8048359155654907,
219
+ "rewards/margins": 0.660244882106781,
220
+ "rewards/rejected": 0.14459095895290375,
221
+ "step": 120
222
+ },
223
+ {
224
+ "epoch": 2.316258351893096,
225
+ "grad_norm": 2.149165391921997,
226
+ "learning_rate": 4.7387248116432524e-06,
227
+ "logits/chosen": -0.24634817242622375,
228
+ "logits/rejected": -0.313901424407959,
229
+ "logps/chosen": -13.057968139648438,
230
+ "logps/rejected": -23.697710037231445,
231
+ "loss": 0.4275,
232
+ "rewards/accuracies": 0.7750000357627869,
233
+ "rewards/chosen": 1.0445865392684937,
234
+ "rewards/margins": 0.944391667842865,
235
+ "rewards/rejected": 0.10019483417272568,
236
+ "step": 130
237
+ },
238
+ {
239
+ "epoch": 2.494432071269488,
240
+ "grad_norm": 2.2673017978668213,
241
+ "learning_rate": 4.665063509461098e-06,
242
+ "logits/chosen": -0.2440420240163803,
243
+ "logits/rejected": -0.29979586601257324,
244
+ "logps/chosen": -13.353204727172852,
245
+ "logps/rejected": -24.62551498413086,
246
+ "loss": 0.4575,
247
+ "rewards/accuracies": 0.7875000238418579,
248
+ "rewards/chosen": 0.9695870280265808,
249
+ "rewards/margins": 0.9478777050971985,
250
+ "rewards/rejected": 0.021709401160478592,
251
+ "step": 140
252
+ },
253
+ {
254
+ "epoch": 2.6726057906458798,
255
+ "grad_norm": 2.5609817504882812,
256
+ "learning_rate": 4.5829927360311224e-06,
257
+ "logits/chosen": -0.21621806919574738,
258
+ "logits/rejected": -0.27183279395103455,
259
+ "logps/chosen": -12.837681770324707,
260
+ "logps/rejected": -24.354141235351562,
261
+ "loss": 0.4581,
262
+ "rewards/accuracies": 0.7750000357627869,
263
+ "rewards/chosen": 0.9683197140693665,
264
+ "rewards/margins": 0.9282311797142029,
265
+ "rewards/rejected": 0.04008860886096954,
266
+ "step": 150
267
+ },
268
+ {
269
+ "epoch": 2.6726057906458798,
270
+ "eval_logits/chosen": -0.2092563956975937,
271
+ "eval_logits/rejected": -0.2738884389400482,
272
+ "eval_logps/chosen": -13.789220809936523,
273
+ "eval_logps/rejected": -25.666584014892578,
274
+ "eval_loss": 0.4760795533657074,
275
+ "eval_rewards/accuracies": 0.7599999904632568,
276
+ "eval_rewards/chosen": 0.9362172484397888,
277
+ "eval_rewards/margins": 0.9799338579177856,
278
+ "eval_rewards/rejected": -0.04371662810444832,
279
+ "eval_runtime": 9.724,
280
+ "eval_samples_per_second": 10.284,
281
+ "eval_steps_per_second": 5.142,
282
+ "step": 150
283
+ },
284
+ {
285
+ "epoch": 2.8507795100222717,
286
+ "grad_norm": 6.471484661102295,
287
+ "learning_rate": 4.492831268057307e-06,
288
+ "logits/chosen": -0.25587916374206543,
289
+ "logits/rejected": -0.3184296786785126,
290
+ "logps/chosen": -12.521145820617676,
291
+ "logps/rejected": -25.58576011657715,
292
+ "loss": 0.406,
293
+ "rewards/accuracies": 0.7750000357627869,
294
+ "rewards/chosen": 1.077256202697754,
295
+ "rewards/margins": 1.1716722249984741,
296
+ "rewards/rejected": -0.09441610425710678,
297
+ "step": 160
298
+ },
299
+ {
300
+ "epoch": 3.0289532293986636,
301
+ "grad_norm": 6.336172580718994,
302
+ "learning_rate": 4.394929307863633e-06,
303
+ "logits/chosen": -0.23687370121479034,
304
+ "logits/rejected": -0.2813915014266968,
305
+ "logps/chosen": -14.616838455200195,
306
+ "logps/rejected": -26.1534366607666,
307
+ "loss": 0.452,
308
+ "rewards/accuracies": 0.800000011920929,
309
+ "rewards/chosen": 0.8590637445449829,
310
+ "rewards/margins": 1.0506114959716797,
311
+ "rewards/rejected": -0.19154782593250275,
312
+ "step": 170
313
+ },
314
+ {
315
+ "epoch": 3.2071269487750556,
316
+ "grad_norm": 11.07475471496582,
317
+ "learning_rate": 4.289667123149296e-06,
318
+ "logits/chosen": -0.2059316188097,
319
+ "logits/rejected": -0.2756357192993164,
320
+ "logps/chosen": -12.392863273620605,
321
+ "logps/rejected": -26.90511131286621,
322
+ "loss": 0.4051,
323
+ "rewards/accuracies": 0.793749988079071,
324
+ "rewards/chosen": 1.0431259870529175,
325
+ "rewards/margins": 1.3048280477523804,
326
+ "rewards/rejected": -0.26170212030410767,
327
+ "step": 180
328
+ },
329
+ {
330
+ "epoch": 3.3853006681514475,
331
+ "grad_norm": 6.2247819900512695,
332
+ "learning_rate": 4.177453569964925e-06,
333
+ "logits/chosen": -0.17799881100654602,
334
+ "logits/rejected": -0.24945221841335297,
335
+ "logps/chosen": -12.190529823303223,
336
+ "logps/rejected": -27.486194610595703,
337
+ "loss": 0.3954,
338
+ "rewards/accuracies": 0.800000011920929,
339
+ "rewards/chosen": 1.0829803943634033,
340
+ "rewards/margins": 1.3739397525787354,
341
+ "rewards/rejected": -0.29095926880836487,
342
+ "step": 190
343
+ },
344
+ {
345
+ "epoch": 3.5634743875278394,
346
+ "grad_norm": 2.581493377685547,
347
+ "learning_rate": 4.058724504646834e-06,
348
+ "logits/chosen": -0.17182210087776184,
349
+ "logits/rejected": -0.23636431992053986,
350
+ "logps/chosen": -12.724395751953125,
351
+ "logps/rejected": -27.663299560546875,
352
+ "loss": 0.4032,
353
+ "rewards/accuracies": 0.793749988079071,
354
+ "rewards/chosen": 1.032086968421936,
355
+ "rewards/margins": 1.3668633699417114,
356
+ "rewards/rejected": -0.3347764015197754,
357
+ "step": 200
358
+ },
359
+ {
360
+ "epoch": 3.5634743875278394,
361
+ "eval_logits/chosen": -0.16310954093933105,
362
+ "eval_logits/rejected": -0.23061738908290863,
363
+ "eval_logps/chosen": -13.548603057861328,
364
+ "eval_logps/rejected": -28.07323455810547,
365
+ "eval_loss": 0.4708513021469116,
366
+ "eval_rewards/accuracies": 0.8100000023841858,
367
+ "eval_rewards/chosen": 0.9602789282798767,
368
+ "eval_rewards/margins": 1.244660496711731,
369
+ "eval_rewards/rejected": -0.28438156843185425,
370
+ "eval_runtime": 9.734,
371
+ "eval_samples_per_second": 10.273,
372
+ "eval_steps_per_second": 5.137,
373
+ "step": 200
374
+ },
375
+ {
376
+ "epoch": 3.7416481069042318,
377
+ "grad_norm": 2.387537717819214,
378
+ "learning_rate": 3.933941090877615e-06,
379
+ "logits/chosen": -0.15226894617080688,
380
+ "logits/rejected": -0.23664240539073944,
381
+ "logps/chosen": -12.170495986938477,
382
+ "logps/rejected": -28.98239517211914,
383
+ "loss": 0.3887,
384
+ "rewards/accuracies": 0.8125,
385
+ "rewards/chosen": 1.1489580869674683,
386
+ "rewards/margins": 1.5401805639266968,
387
+ "rewards/rejected": -0.39122244715690613,
388
+ "step": 210
389
+ },
390
+ {
391
+ "epoch": 3.9198218262806237,
392
+ "grad_norm": 7.2560529708862305,
393
+ "learning_rate": 3.8035880084487454e-06,
394
+ "logits/chosen": -0.19596333801746368,
395
+ "logits/rejected": -0.2662833034992218,
396
+ "logps/chosen": -13.05119514465332,
397
+ "logps/rejected": -29.070032119750977,
398
+ "loss": 0.4093,
399
+ "rewards/accuracies": 0.8125,
400
+ "rewards/chosen": 1.0079821348190308,
401
+ "rewards/margins": 1.3972289562225342,
402
+ "rewards/rejected": -0.38924697041511536,
403
+ "step": 220
404
+ },
405
+ {
406
+ "epoch": 4.097995545657016,
407
+ "grad_norm": 8.643023490905762,
408
+ "learning_rate": 3.6681715706826555e-06,
409
+ "logits/chosen": -0.1443440169095993,
410
+ "logits/rejected": -0.2094365656375885,
411
+ "logps/chosen": -12.210596084594727,
412
+ "logps/rejected": -28.835458755493164,
413
+ "loss": 0.3526,
414
+ "rewards/accuracies": 0.831250011920929,
415
+ "rewards/chosen": 1.1435935497283936,
416
+ "rewards/margins": 1.6025713682174683,
417
+ "rewards/rejected": -0.45897769927978516,
418
+ "step": 230
419
+ },
420
+ {
421
+ "epoch": 4.276169265033408,
422
+ "grad_norm": 8.177175521850586,
423
+ "learning_rate": 3.5282177578265295e-06,
424
+ "logits/chosen": -0.16080088913440704,
425
+ "logits/rejected": -0.22377625107765198,
426
+ "logps/chosen": -12.389388084411621,
427
+ "logps/rejected": -29.504175186157227,
428
+ "loss": 0.3327,
429
+ "rewards/accuracies": 0.862500011920929,
430
+ "rewards/chosen": 1.1701444387435913,
431
+ "rewards/margins": 1.6151951551437378,
432
+ "rewards/rejected": -0.44505080580711365,
433
+ "step": 240
434
+ },
435
+ {
436
+ "epoch": 4.4543429844097995,
437
+ "grad_norm": 3.6266558170318604,
438
+ "learning_rate": 3.384270174056454e-06,
439
+ "logits/chosen": -0.17858387529850006,
440
+ "logits/rejected": -0.23888497054576874,
441
+ "logps/chosen": -12.181685447692871,
442
+ "logps/rejected": -29.275564193725586,
443
+ "loss": 0.3836,
444
+ "rewards/accuracies": 0.8187500238418579,
445
+ "rewards/chosen": 1.0620548725128174,
446
+ "rewards/margins": 1.5715227127075195,
447
+ "rewards/rejected": -0.5094677805900574,
448
+ "step": 250
449
+ },
450
+ {
451
+ "epoch": 4.4543429844097995,
452
+ "eval_logits/chosen": -0.13963328301906586,
453
+ "eval_logits/rejected": -0.2079545557498932,
454
+ "eval_logps/chosen": -13.248819351196289,
455
+ "eval_logps/rejected": -29.226905822753906,
456
+ "eval_loss": 0.4674791693687439,
457
+ "eval_rewards/accuracies": 0.7899999618530273,
458
+ "eval_rewards/chosen": 0.9902573823928833,
459
+ "eval_rewards/margins": 1.390006184577942,
460
+ "eval_rewards/rejected": -0.3997488021850586,
461
+ "eval_runtime": 9.715,
462
+ "eval_samples_per_second": 10.293,
463
+ "eval_steps_per_second": 5.147,
464
+ "step": 250
465
+ },
466
+ {
467
+ "epoch": 4.632516703786192,
468
+ "grad_norm": 4.119279384613037,
469
+ "learning_rate": 3.236887936027261e-06,
470
+ "logits/chosen": -0.12825655937194824,
471
+ "logits/rejected": -0.19981206953525543,
472
+ "logps/chosen": -11.074287414550781,
473
+ "logps/rejected": -31.082468032836914,
474
+ "loss": 0.3368,
475
+ "rewards/accuracies": 0.862500011920929,
476
+ "rewards/chosen": 1.162680983543396,
477
+ "rewards/margins": 1.7504284381866455,
478
+ "rewards/rejected": -0.5877474546432495,
479
+ "step": 260
480
+ },
481
+ {
482
+ "epoch": 4.810690423162583,
483
+ "grad_norm": 6.509561061859131,
484
+ "learning_rate": 3.0866435011692884e-06,
485
+ "logits/chosen": -0.15560267865657806,
486
+ "logits/rejected": -0.2208879441022873,
487
+ "logps/chosen": -11.37994384765625,
488
+ "logps/rejected": -31.601425170898438,
489
+ "loss": 0.338,
490
+ "rewards/accuracies": 0.8812500238418579,
491
+ "rewards/chosen": 1.1841315031051636,
492
+ "rewards/margins": 1.8628618717193604,
493
+ "rewards/rejected": -0.6787301898002625,
494
+ "step": 270
495
+ },
496
+ {
497
+ "epoch": 4.988864142538976,
498
+ "grad_norm": 4.498559951782227,
499
+ "learning_rate": 2.9341204441673267e-06,
500
+ "logits/chosen": -0.1727777123451233,
501
+ "logits/rejected": -0.2310466766357422,
502
+ "logps/chosen": -13.475479125976562,
503
+ "logps/rejected": -27.789844512939453,
504
+ "loss": 0.5027,
505
+ "rewards/accuracies": 0.768750011920929,
506
+ "rewards/chosen": 0.9239163398742676,
507
+ "rewards/margins": 1.258786678314209,
508
+ "rewards/rejected": -0.33487027883529663,
509
+ "step": 280
510
+ },
511
+ {
512
+ "epoch": 5.167037861915367,
513
+ "grad_norm": 3.030965805053711,
514
+ "learning_rate": 2.7799111902582697e-06,
515
+ "logits/chosen": -0.15166734158992767,
516
+ "logits/rejected": -0.22125867009162903,
517
+ "logps/chosen": -12.045249938964844,
518
+ "logps/rejected": -31.249042510986328,
519
+ "loss": 0.32,
520
+ "rewards/accuracies": 0.856249988079071,
521
+ "rewards/chosen": 1.147273063659668,
522
+ "rewards/margins": 1.7833982706069946,
523
+ "rewards/rejected": -0.6361253261566162,
524
+ "step": 290
525
+ },
526
+ {
527
+ "epoch": 5.3452115812917596,
528
+ "grad_norm": 7.184595584869385,
529
+ "learning_rate": 2.624614714151743e-06,
530
+ "logits/chosen": -0.0994877815246582,
531
+ "logits/rejected": -0.17965565621852875,
532
+ "logps/chosen": -12.375929832458496,
533
+ "logps/rejected": -31.211719512939453,
534
+ "loss": 0.3588,
535
+ "rewards/accuracies": 0.8500000238418579,
536
+ "rewards/chosen": 1.1141376495361328,
537
+ "rewards/margins": 1.7681716680526733,
538
+ "rewards/rejected": -0.6540343165397644,
539
+ "step": 300
540
+ },
541
+ {
542
+ "epoch": 5.3452115812917596,
543
+ "eval_logits/chosen": -0.1254928857088089,
544
+ "eval_logits/rejected": -0.19310244917869568,
545
+ "eval_logps/chosen": -13.40658187866211,
546
+ "eval_logps/rejected": -29.754491806030273,
547
+ "eval_loss": 0.4751954674720764,
548
+ "eval_rewards/accuracies": 0.7699999809265137,
549
+ "eval_rewards/chosen": 0.9744812846183777,
550
+ "eval_rewards/margins": 1.4269884824752808,
551
+ "eval_rewards/rejected": -0.452507346868515,
552
+ "eval_runtime": 9.7087,
553
+ "eval_samples_per_second": 10.3,
554
+ "eval_steps_per_second": 5.15,
555
+ "step": 300
556
+ },
557
+ {
558
+ "epoch": 5.523385300668151,
559
+ "grad_norm": 3.5659422874450684,
560
+ "learning_rate": 2.4688342135114625e-06,
561
+ "logits/chosen": -0.13226279616355896,
562
+ "logits/rejected": -0.21120555698871613,
563
+ "logps/chosen": -11.256922721862793,
564
+ "logps/rejected": -30.438766479492188,
565
+ "loss": 0.3473,
566
+ "rewards/accuracies": 0.831250011920929,
567
+ "rewards/chosen": 1.2133493423461914,
568
+ "rewards/margins": 1.790173888206482,
569
+ "rewards/rejected": -0.5768246054649353,
570
+ "step": 310
571
+ },
572
+ {
573
+ "epoch": 5.701559020044543,
574
+ "grad_norm": 1.8460242748260498,
575
+ "learning_rate": 2.3131747660339396e-06,
576
+ "logits/chosen": -0.14635451138019562,
577
+ "logits/rejected": -0.19989195466041565,
578
+ "logps/chosen": -12.754526138305664,
579
+ "logps/rejected": -30.52984619140625,
580
+ "loss": 0.3378,
581
+ "rewards/accuracies": 0.856249988079071,
582
+ "rewards/chosen": 1.090704321861267,
583
+ "rewards/margins": 1.7227219343185425,
584
+ "rewards/rejected": -0.6320176124572754,
585
+ "step": 320
586
+ },
587
+ {
588
+ "epoch": 5.879732739420936,
589
+ "grad_norm": 5.234955310821533,
590
+ "learning_rate": 2.158240979224817e-06,
591
+ "logits/chosen": -0.1211402416229248,
592
+ "logits/rejected": -0.19876351952552795,
593
+ "logps/chosen": -10.352498054504395,
594
+ "logps/rejected": -31.713010787963867,
595
+ "loss": 0.3241,
596
+ "rewards/accuracies": 0.8812500238418579,
597
+ "rewards/chosen": 1.1969298124313354,
598
+ "rewards/margins": 1.8751522302627563,
599
+ "rewards/rejected": -0.6782223582267761,
600
+ "step": 330
601
+ },
602
+ {
603
+ "epoch": 6.057906458797327,
604
+ "grad_norm": 4.994526386260986,
605
+ "learning_rate": 2.004634642001507e-06,
606
+ "logits/chosen": -0.13834324479103088,
607
+ "logits/rejected": -0.2308368682861328,
608
+ "logps/chosen": -11.07118034362793,
609
+ "logps/rejected": -31.32500648498535,
610
+ "loss": 0.3868,
611
+ "rewards/accuracies": 0.8500000238418579,
612
+ "rewards/chosen": 1.1334035396575928,
613
+ "rewards/margins": 1.7996925115585327,
614
+ "rewards/rejected": -0.6662889719009399,
615
+ "step": 340
616
+ },
617
+ {
618
+ "epoch": 6.23608017817372,
619
+ "grad_norm": 3.519150733947754,
620
+ "learning_rate": 1.852952387243698e-06,
621
+ "logits/chosen": -0.1219867691397667,
622
+ "logits/rejected": -0.19822999835014343,
623
+ "logps/chosen": -10.62753677368164,
624
+ "logps/rejected": -32.93171310424805,
625
+ "loss": 0.2861,
626
+ "rewards/accuracies": 0.893750011920929,
627
+ "rewards/chosen": 1.1698468923568726,
628
+ "rewards/margins": 2.0219147205352783,
629
+ "rewards/rejected": -0.8520679473876953,
630
+ "step": 350
631
+ },
632
+ {
633
+ "epoch": 6.23608017817372,
634
+ "eval_logits/chosen": -0.11021164059638977,
635
+ "eval_logits/rejected": -0.17845386266708374,
636
+ "eval_logps/chosen": -13.759096145629883,
637
+ "eval_logps/rejected": -30.731969833374023,
638
+ "eval_loss": 0.4811996519565582,
639
+ "eval_rewards/accuracies": 0.7699999809265137,
640
+ "eval_rewards/chosen": 0.9392297267913818,
641
+ "eval_rewards/margins": 1.4894847869873047,
642
+ "eval_rewards/rejected": -0.5502550601959229,
643
+ "eval_runtime": 9.6948,
644
+ "eval_samples_per_second": 10.315,
645
+ "eval_steps_per_second": 5.157,
646
+ "step": 350
647
+ },
648
+ {
649
+ "epoch": 6.414253897550111,
650
+ "grad_norm": 2.1364710330963135,
651
+ "learning_rate": 1.7037833743707892e-06,
652
+ "logits/chosen": -0.13026945292949677,
653
+ "logits/rejected": -0.2250211238861084,
654
+ "logps/chosen": -11.657172203063965,
655
+ "logps/rejected": -31.794097900390625,
656
+ "loss": 0.3096,
657
+ "rewards/accuracies": 0.875,
658
+ "rewards/chosen": 1.1625641584396362,
659
+ "rewards/margins": 1.8979129791259766,
660
+ "rewards/rejected": -0.7353487610816956,
661
+ "step": 360
662
+ },
663
+ {
664
+ "epoch": 6.5924276169265035,
665
+ "grad_norm": 6.27597713470459,
666
+ "learning_rate": 1.5577070009474872e-06,
667
+ "logits/chosen": -0.08007471263408661,
668
+ "logits/rejected": -0.14471502602100372,
669
+ "logps/chosen": -12.703398704528809,
670
+ "logps/rejected": -31.92864418029785,
671
+ "loss": 0.3395,
672
+ "rewards/accuracies": 0.84375,
673
+ "rewards/chosen": 1.0662773847579956,
674
+ "rewards/margins": 1.84355628490448,
675
+ "rewards/rejected": -0.7772787809371948,
676
+ "step": 370
677
+ },
678
+ {
679
+ "epoch": 6.770601336302895,
680
+ "grad_norm": 1.5935697555541992,
681
+ "learning_rate": 1.415290652206105e-06,
682
+ "logits/chosen": -0.15049786865711212,
683
+ "logits/rejected": -0.21002641320228577,
684
+ "logps/chosen": -12.739409446716309,
685
+ "logps/rejected": -32.949214935302734,
686
+ "loss": 0.3588,
687
+ "rewards/accuracies": 0.84375,
688
+ "rewards/chosen": 1.0807316303253174,
689
+ "rewards/margins": 1.8172613382339478,
690
+ "rewards/rejected": -0.7365297675132751,
691
+ "step": 380
692
+ },
693
+ {
694
+ "epoch": 6.948775055679287,
695
+ "grad_norm": 12.438933372497559,
696
+ "learning_rate": 1.2770874972267777e-06,
697
+ "logits/chosen": -0.10974359512329102,
698
+ "logits/rejected": -0.1568942815065384,
699
+ "logps/chosen": -12.335737228393555,
700
+ "logps/rejected": -33.184993743896484,
701
+ "loss": 0.3357,
702
+ "rewards/accuracies": 0.8687500357627869,
703
+ "rewards/chosen": 1.1363105773925781,
704
+ "rewards/margins": 1.9653358459472656,
705
+ "rewards/rejected": -0.8290252685546875,
706
+ "step": 390
707
+ },
708
+ {
709
+ "epoch": 7.12694877505568,
710
+ "grad_norm": 2.4859490394592285,
711
+ "learning_rate": 1.1436343403356019e-06,
712
+ "logits/chosen": -0.11985152959823608,
713
+ "logits/rejected": -0.18024906516075134,
714
+ "logps/chosen": -13.101457595825195,
715
+ "logps/rejected": -32.767181396484375,
716
+ "loss": 0.3662,
717
+ "rewards/accuracies": 0.84375,
718
+ "rewards/chosen": 0.9988664984703064,
719
+ "rewards/margins": 1.8207927942276,
720
+ "rewards/rejected": -0.8219264149665833,
721
+ "step": 400
722
+ },
723
+ {
724
+ "epoch": 7.12694877505568,
725
+ "eval_logits/chosen": -0.099003367125988,
726
+ "eval_logits/rejected": -0.16790008544921875,
727
+ "eval_logps/chosen": -13.986225128173828,
728
+ "eval_logps/rejected": -31.5858154296875,
729
+ "eval_loss": 0.4867512881755829,
730
+ "eval_rewards/accuracies": 0.7699999809265137,
731
+ "eval_rewards/chosen": 0.9165167212486267,
732
+ "eval_rewards/margins": 1.5521563291549683,
733
+ "eval_rewards/rejected": -0.6356395483016968,
734
+ "eval_runtime": 9.7105,
735
+ "eval_samples_per_second": 10.298,
736
+ "eval_steps_per_second": 5.149,
737
+ "step": 400
738
+ },
739
+ {
740
+ "epoch": 7.305122494432071,
741
+ "grad_norm": 4.4709153175354,
742
+ "learning_rate": 1.0154495360662464e-06,
743
+ "logits/chosen": -0.12129449844360352,
744
+ "logits/rejected": -0.18251024186611176,
745
+ "logps/chosen": -13.18463134765625,
746
+ "logps/rejected": -32.92716598510742,
747
+ "loss": 0.366,
748
+ "rewards/accuracies": 0.862500011920929,
749
+ "rewards/chosen": 1.057719349861145,
750
+ "rewards/margins": 1.885604739189148,
751
+ "rewards/rejected": -0.8278852701187134,
752
+ "step": 410
753
+ },
754
+ {
755
+ "epoch": 7.4832962138084635,
756
+ "grad_norm": 3.132544994354248,
757
+ "learning_rate": 8.930309757836517e-07,
758
+ "logits/chosen": -0.10683136433362961,
759
+ "logits/rejected": -0.1750011295080185,
760
+ "logps/chosen": -11.23306941986084,
761
+ "logps/rejected": -32.81039047241211,
762
+ "loss": 0.3044,
763
+ "rewards/accuracies": 0.8812500238418579,
764
+ "rewards/chosen": 1.1720637083053589,
765
+ "rewards/margins": 2.0555508136749268,
766
+ "rewards/rejected": -0.8834872245788574,
767
+ "step": 420
768
+ },
769
+ {
770
+ "epoch": 7.661469933184855,
771
+ "grad_norm": 6.80822229385376,
772
+ "learning_rate": 7.768541537901325e-07,
773
+ "logits/chosen": -0.11390836536884308,
774
+ "logits/rejected": -0.2062007039785385,
775
+ "logps/chosen": -11.768353462219238,
776
+ "logps/rejected": -33.399192810058594,
777
+ "loss": 0.3337,
778
+ "rewards/accuracies": 0.8687500357627869,
779
+ "rewards/chosen": 1.1444917917251587,
780
+ "rewards/margins": 1.9670883417129517,
781
+ "rewards/rejected": -0.8225963711738586,
782
+ "step": 430
783
+ },
784
+ {
785
+ "epoch": 7.839643652561247,
786
+ "grad_norm": 4.160062789916992,
787
+ "learning_rate": 6.673703204254348e-07,
788
+ "logits/chosen": -0.1045166477560997,
789
+ "logits/rejected": -0.18522079288959503,
790
+ "logps/chosen": -10.534876823425293,
791
+ "logps/rejected": -34.59806442260742,
792
+ "loss": 0.2644,
793
+ "rewards/accuracies": 0.893750011920929,
794
+ "rewards/chosen": 1.2115627527236938,
795
+ "rewards/margins": 2.1745359897613525,
796
+ "rewards/rejected": -0.9629732370376587,
797
+ "step": 440
798
+ },
799
+ {
800
+ "epoch": 8.017817371937639,
801
+ "grad_norm": 2.6154518127441406,
802
+ "learning_rate": 5.650047293344316e-07,
803
+ "logits/chosen": -0.07358330488204956,
804
+ "logits/rejected": -0.14352913200855255,
805
+ "logps/chosen": -10.867822647094727,
806
+ "logps/rejected": -33.30778884887695,
807
+ "loss": 0.2822,
808
+ "rewards/accuracies": 0.8687500357627869,
809
+ "rewards/chosen": 1.2662678956985474,
810
+ "rewards/margins": 2.097227096557617,
811
+ "rewards/rejected": -0.830959141254425,
812
+ "step": 450
813
+ },
814
+ {
815
+ "epoch": 8.017817371937639,
816
+ "eval_logits/chosen": -0.09359554946422577,
817
+ "eval_logits/rejected": -0.16219539940357208,
818
+ "eval_logps/chosen": -14.051919937133789,
819
+ "eval_logps/rejected": -31.741615295410156,
820
+ "eval_loss": 0.49268820881843567,
821
+ "eval_rewards/accuracies": 0.7599999904632568,
822
+ "eval_rewards/chosen": 0.9099473357200623,
823
+ "eval_rewards/margins": 1.5611672401428223,
824
+ "eval_rewards/rejected": -0.6512197852134705,
825
+ "eval_runtime": 9.7252,
826
+ "eval_samples_per_second": 10.283,
827
+ "eval_steps_per_second": 5.141,
828
+ "step": 450
829
+ },
830
+ {
831
+ "epoch": 8.195991091314031,
832
+ "grad_norm": 3.175278425216675,
833
+ "learning_rate": 4.7015498571035877e-07,
834
+ "logits/chosen": -0.08536185324192047,
835
+ "logits/rejected": -0.16373853385448456,
836
+ "logps/chosen": -10.688807487487793,
837
+ "logps/rejected": -35.358089447021484,
838
+ "loss": 0.2417,
839
+ "rewards/accuracies": 0.925000011920929,
840
+ "rewards/chosen": 1.2969163656234741,
841
+ "rewards/margins": 2.3196523189544678,
842
+ "rewards/rejected": -1.022735834121704,
843
+ "step": 460
844
+ },
845
+ {
846
+ "epoch": 8.374164810690424,
847
+ "grad_norm": 10.432084083557129,
848
+ "learning_rate": 3.831895019292897e-07,
849
+ "logits/chosen": -0.11144526302814484,
850
+ "logits/rejected": -0.18342572450637817,
851
+ "logps/chosen": -12.896112442016602,
852
+ "logps/rejected": -34.08356857299805,
853
+ "loss": 0.3277,
854
+ "rewards/accuracies": 0.90625,
855
+ "rewards/chosen": 1.0476747751235962,
856
+ "rewards/margins": 1.9816349744796753,
857
+ "rewards/rejected": -0.9339599609375,
858
+ "step": 470
859
+ },
860
+ {
861
+ "epoch": 8.552338530066816,
862
+ "grad_norm": 4.943639278411865,
863
+ "learning_rate": 3.044460665744284e-07,
864
+ "logits/chosen": -0.06191817671060562,
865
+ "logits/rejected": -0.13474522531032562,
866
+ "logps/chosen": -11.490986824035645,
867
+ "logps/rejected": -33.3411750793457,
868
+ "loss": 0.3482,
869
+ "rewards/accuracies": 0.862500011920929,
870
+ "rewards/chosen": 1.0883612632751465,
871
+ "rewards/margins": 1.9929107427597046,
872
+ "rewards/rejected": -0.9045494198799133,
873
+ "step": 480
874
+ },
875
+ {
876
+ "epoch": 8.730512249443207,
877
+ "grad_norm": 4.994484901428223,
878
+ "learning_rate": 2.3423053240837518e-07,
879
+ "logits/chosen": -0.11047738045454025,
880
+ "logits/rejected": -0.1758051961660385,
881
+ "logps/chosen": -12.3035306930542,
882
+ "logps/rejected": -33.04331588745117,
883
+ "loss": 0.3433,
884
+ "rewards/accuracies": 0.831250011920929,
885
+ "rewards/chosen": 1.051175594329834,
886
+ "rewards/margins": 1.9459151029586792,
887
+ "rewards/rejected": -0.8947394490242004,
888
+ "step": 490
889
+ },
890
+ {
891
+ "epoch": 8.908685968819599,
892
+ "grad_norm": 2.0751497745513916,
893
+ "learning_rate": 1.7281562838948968e-07,
894
+ "logits/chosen": -0.12217041105031967,
895
+ "logits/rejected": -0.19585369527339935,
896
+ "logps/chosen": -10.848115921020508,
897
+ "logps/rejected": -34.68525314331055,
898
+ "loss": 0.2416,
899
+ "rewards/accuracies": 0.887499988079071,
900
+ "rewards/chosen": 1.2858046293258667,
901
+ "rewards/margins": 2.253690481185913,
902
+ "rewards/rejected": -0.9678859114646912,
903
+ "step": 500
904
+ },
905
+ {
906
+ "epoch": 8.908685968819599,
907
+ "eval_logits/chosen": -0.08983445167541504,
908
+ "eval_logits/rejected": -0.15845781564712524,
909
+ "eval_logps/chosen": -14.23983097076416,
910
+ "eval_logps/rejected": -32.18784713745117,
911
+ "eval_loss": 0.49790748953819275,
912
+ "eval_rewards/accuracies": 0.7599999904632568,
913
+ "eval_rewards/chosen": 0.8911561369895935,
914
+ "eval_rewards/margins": 1.586998701095581,
915
+ "eval_rewards/rejected": -0.6958425641059875,
916
+ "eval_runtime": 9.7042,
917
+ "eval_samples_per_second": 10.305,
918
+ "eval_steps_per_second": 5.152,
919
+ "step": 500
920
+ },
921
+ {
922
+ "epoch": 9.086859688195991,
923
+ "grad_norm": 4.091846466064453,
924
+ "learning_rate": 1.2043990034669413e-07,
925
+ "logits/chosen": -0.10412635654211044,
926
+ "logits/rejected": -0.20009151101112366,
927
+ "logps/chosen": -11.962130546569824,
928
+ "logps/rejected": -34.19890594482422,
929
+ "loss": 0.3145,
930
+ "rewards/accuracies": 0.887499988079071,
931
+ "rewards/chosen": 1.1489673852920532,
932
+ "rewards/margins": 2.0444374084472656,
933
+ "rewards/rejected": -0.8954699635505676,
934
+ "step": 510
935
+ },
936
+ {
937
+ "epoch": 9.265033407572384,
938
+ "grad_norm": 2.8685364723205566,
939
+ "learning_rate": 7.730678442730539e-08,
940
+ "logits/chosen": -0.13043461740016937,
941
+ "logits/rejected": -0.1936294585466385,
942
+ "logps/chosen": -12.278414726257324,
943
+ "logps/rejected": -35.482120513916016,
944
+ "loss": 0.2503,
945
+ "rewards/accuracies": 0.90625,
946
+ "rewards/chosen": 1.113724946975708,
947
+ "rewards/margins": 2.1855549812316895,
948
+ "rewards/rejected": -1.0718300342559814,
949
+ "step": 520
950
+ },
951
+ {
952
+ "epoch": 9.443207126948774,
953
+ "grad_norm": 8.583150863647461,
954
+ "learning_rate": 4.358381691677932e-08,
955
+ "logits/chosen": -0.1350499838590622,
956
+ "logits/rejected": -0.1929396241903305,
957
+ "logps/chosen": -12.875258445739746,
958
+ "logps/rejected": -33.197200775146484,
959
+ "loss": 0.3269,
960
+ "rewards/accuracies": 0.862500011920929,
961
+ "rewards/chosen": 1.0666710138320923,
962
+ "rewards/margins": 1.9351733922958374,
963
+ "rewards/rejected": -0.8685024380683899,
964
+ "step": 530
965
+ },
966
+ {
967
+ "epoch": 9.621380846325167,
968
+ "grad_norm": 11.428977012634277,
969
+ "learning_rate": 1.9401983499569843e-08,
970
+ "logits/chosen": -0.1044853925704956,
971
+ "logits/rejected": -0.17102347314357758,
972
+ "logps/chosen": -11.800514221191406,
973
+ "logps/rejected": -33.656124114990234,
974
+ "loss": 0.3398,
975
+ "rewards/accuracies": 0.856249988079071,
976
+ "rewards/chosen": 1.1207963228225708,
977
+ "rewards/margins": 1.9993568658828735,
978
+ "rewards/rejected": -0.878560483455658,
979
+ "step": 540
980
+ },
981
+ {
982
+ "epoch": 9.799554565701559,
983
+ "grad_norm": 2.2846410274505615,
984
+ "learning_rate": 4.855210488670381e-09,
985
+ "logits/chosen": -0.08743849396705627,
986
+ "logits/rejected": -0.16115520894527435,
987
+ "logps/chosen": -11.04604434967041,
988
+ "logps/rejected": -34.041011810302734,
989
+ "loss": 0.3096,
990
+ "rewards/accuracies": 0.887499988079071,
991
+ "rewards/chosen": 1.1628315448760986,
992
+ "rewards/margins": 2.1141316890716553,
993
+ "rewards/rejected": -0.9513001441955566,
994
+ "step": 550
995
+ },
996
+ {
997
+ "epoch": 9.799554565701559,
998
+ "eval_logits/chosen": -0.08727238327264786,
999
+ "eval_logits/rejected": -0.15482930839061737,
1000
+ "eval_logps/chosen": -14.208100318908691,
1001
+ "eval_logps/rejected": -32.24625015258789,
1002
+ "eval_loss": 0.4933530390262604,
1003
+ "eval_rewards/accuracies": 0.75,
1004
+ "eval_rewards/chosen": 0.8943293690681458,
1005
+ "eval_rewards/margins": 1.5960127115249634,
1006
+ "eval_rewards/rejected": -0.7016833424568176,
1007
+ "eval_runtime": 9.7018,
1008
+ "eval_samples_per_second": 10.307,
1009
+ "eval_steps_per_second": 5.154,
1010
+ "step": 550
1011
+ },
1012
+ {
1013
+ "epoch": 9.977728285077951,
1014
+ "grad_norm": 2.3474748134613037,
1015
+ "learning_rate": 0.0,
1016
+ "logits/chosen": -0.030874544754624367,
1017
+ "logits/rejected": -0.118435338139534,
1018
+ "logps/chosen": -10.919882774353027,
1019
+ "logps/rejected": -32.95122146606445,
1020
+ "loss": 0.3123,
1021
+ "rewards/accuracies": 0.8500000238418579,
1022
+ "rewards/chosen": 1.201145052909851,
1023
+ "rewards/margins": 2.0579233169555664,
1024
+ "rewards/rejected": -0.8567783236503601,
1025
+ "step": 560
1026
+ },
1027
+ {
1028
+ "epoch": 9.977728285077951,
1029
+ "step": 560,
1030
+ "total_flos": 5.23282185018409e+16,
1031
+ "train_loss": 0.4115582968507494,
1032
+ "train_runtime": 2017.6604,
1033
+ "train_samples_per_second": 4.451,
1034
+ "train_steps_per_second": 0.278
1035
+ }
1036
+ ],
1037
+ "logging_steps": 10,
1038
+ "max_steps": 560,
1039
+ "num_input_tokens_seen": 0,
1040
+ "num_train_epochs": 10,
1041
+ "save_steps": 50,
1042
+ "stateful_callbacks": {
1043
+ "TrainerControl": {
1044
+ "args": {
1045
+ "should_epoch_stop": false,
1046
+ "should_evaluate": false,
1047
+ "should_log": false,
1048
+ "should_save": true,
1049
+ "should_training_stop": true
1050
+ },
1051
+ "attributes": {}
1052
+ }
1053
+ },
1054
+ "total_flos": 5.23282185018409e+16,
1055
+ "train_batch_size": 2,
1056
+ "trial_name": null,
1057
+ "trial_params": null
1058
+ }
training_eval_loss.png ADDED
training_loss.png ADDED
training_rewards_accuracies.png ADDED