weltonwang88 commited on
Commit
798f7cc
·
verified ·
1 Parent(s): 0ea5745

Model save

Browse files
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
3
+ library_name: transformers
4
+ model_name: Qwen2.5-1.5B-Open-R1-GRPO-cot-v3
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - grpo
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for Qwen2.5-1.5B-Open-R1-GRPO-cot-v3
13
+
14
+ This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="weltonwang88/Qwen2.5-1.5B-Open-R1-GRPO-cot-v3", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/weltonwang88-stanford/huggingface/runs/ef81zz98)
31
+
32
+
33
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.16.0.dev0
38
+ - Transformers: 4.50.0.dev0
39
+ - Pytorch: 2.5.1+cu121
40
+ - Datasets: 3.3.2
41
+ - Tokenizers: 0.21.1
42
+
43
+ ## Citations
44
+
45
+ Cite GRPO as:
46
+
47
+ ```bibtex
48
+ @article{zhihong2024deepseekmath,
49
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
50
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
51
+ year = 2024,
52
+ eprint = {arXiv:2402.03300},
53
+ }
54
+
55
+ ```
56
+
57
+ Cite TRL as:
58
+
59
+ ```bibtex
60
+ @misc{vonwerra2022trl,
61
+ title = {{TRL: Transformer Reinforcement Learning}},
62
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
63
+ year = 2020,
64
+ journal = {GitHub repository},
65
+ publisher = {GitHub},
66
+ howpublished = {\url{https://github.com/huggingface/trl}}
67
+ }
68
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.6775813137989773,
4
+ "train_runtime": 20674.7257,
5
+ "train_samples": 50,
6
+ "train_samples_per_second": 0.087,
7
+ "train_steps_per_second": 0.005
8
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151646,
4
+ "do_sample": true,
5
+ "eos_token_id": 151643,
6
+ "temperature": 0.6,
7
+ "top_p": 0.95,
8
+ "transformers_version": "4.50.0.dev0"
9
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.6775813137989773,
4
+ "train_runtime": 20674.7257,
5
+ "train_samples": 50,
6
+ "train_samples_per_second": 0.087,
7
+ "train_steps_per_second": 0.005
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 3.9955555555555557,
5
+ "eval_steps": 30,
6
+ "global_step": 112,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "clip_ratio": 0.0,
13
+ "completion_length": 1992.2813301086426,
14
+ "epoch": 0.035555555555555556,
15
+ "grad_norm": 0.11222778239149554,
16
+ "kl": 0.0,
17
+ "learning_rate": 1.6666666666666667e-06,
18
+ "loss": 0.0509,
19
+ "reward": -8.534029252827168,
20
+ "reward_std": 3.1286671087145805,
21
+ "rewards/cot_length_penalty_reward": -8.792958237230778,
22
+ "rewards/math_latex_accuracy_reward": 0.2589285857975483,
23
+ "step": 1
24
+ },
25
+ {
26
+ "clip_ratio": 0.0,
27
+ "epoch": 0.07111111111111111,
28
+ "grad_norm": 0.11224900964736327,
29
+ "kl": 0.0,
30
+ "learning_rate": 3.3333333333333333e-06,
31
+ "loss": 0.0509,
32
+ "step": 2
33
+ },
34
+ {
35
+ "clip_ratio": 0.002639908329001628,
36
+ "epoch": 0.10666666666666667,
37
+ "grad_norm": 0.11190265883625335,
38
+ "kl": 0.0004132986068725586,
39
+ "learning_rate": 5e-06,
40
+ "loss": 0.051,
41
+ "step": 3
42
+ },
43
+ {
44
+ "clip_ratio": 0.0026859724457608536,
45
+ "epoch": 0.14222222222222222,
46
+ "grad_norm": 0.10842287053917446,
47
+ "kl": 0.00042808055877685547,
48
+ "learning_rate": 6.666666666666667e-06,
49
+ "loss": 0.0506,
50
+ "step": 4
51
+ },
52
+ {
53
+ "clip_ratio": 0.0,
54
+ "completion_length": 2354.6050567626953,
55
+ "epoch": 0.17777777777777778,
56
+ "grad_norm": 0.11660083282365401,
57
+ "kl": 0.0005452632904052734,
58
+ "learning_rate": 8.333333333333334e-06,
59
+ "loss": 0.0532,
60
+ "reward": -9.576663568615913,
61
+ "reward_std": 3.4961936213076115,
62
+ "rewards/cot_length_penalty_reward": -9.817734986543655,
63
+ "rewards/math_latex_accuracy_reward": 0.24107144074514508,
64
+ "step": 5
65
+ },
66
+ {
67
+ "clip_ratio": 0.004271271725883707,
68
+ "epoch": 0.21333333333333335,
69
+ "grad_norm": 0.15124528039054966,
70
+ "kl": 0.0023946762084960938,
71
+ "learning_rate": 1e-05,
72
+ "loss": 0.0521,
73
+ "step": 6
74
+ },
75
+ {
76
+ "clip_ratio": 0.00593576196115464,
77
+ "epoch": 0.24888888888888888,
78
+ "grad_norm": 0.22845939154064746,
79
+ "kl": 0.0017180442810058594,
80
+ "learning_rate": 1.1666666666666668e-05,
81
+ "loss": 0.0519,
82
+ "step": 7
83
+ },
84
+ {
85
+ "clip_ratio": 0.00681446076487191,
86
+ "epoch": 0.28444444444444444,
87
+ "grad_norm": 0.25815482225017694,
88
+ "kl": 0.002631664276123047,
89
+ "learning_rate": 1.3333333333333333e-05,
90
+ "loss": 0.0484,
91
+ "step": 8
92
+ },
93
+ {
94
+ "clip_ratio": 0.0,
95
+ "completion_length": 2251.4844703674316,
96
+ "epoch": 0.32,
97
+ "grad_norm": 0.11515864264586664,
98
+ "kl": 0.0024547576904296875,
99
+ "learning_rate": 1.5000000000000002e-05,
100
+ "loss": 0.0261,
101
+ "reward": -8.878705874085426,
102
+ "reward_std": 3.5140193179249763,
103
+ "rewards/cot_length_penalty_reward": -9.128705888986588,
104
+ "rewards/math_latex_accuracy_reward": 0.2500000149011612,
105
+ "step": 9
106
+ },
107
+ {
108
+ "clip_ratio": 0.0070763813419034705,
109
+ "epoch": 0.35555555555555557,
110
+ "grad_norm": 0.2974695944541913,
111
+ "kl": 0.005417823791503906,
112
+ "learning_rate": 1.6666666666666667e-05,
113
+ "loss": 0.0255,
114
+ "step": 10
115
+ },
116
+ {
117
+ "clip_ratio": 0.010105093329912052,
118
+ "epoch": 0.39111111111111113,
119
+ "grad_norm": 85.04432926044146,
120
+ "kl": 0.007180213928222656,
121
+ "learning_rate": 1.8333333333333333e-05,
122
+ "loss": 19.5224,
123
+ "step": 11
124
+ },
125
+ {
126
+ "clip_ratio": 0.017627036664634943,
127
+ "epoch": 0.4266666666666667,
128
+ "grad_norm": 2.432235493998052,
129
+ "kl": 0.0702056884765625,
130
+ "learning_rate": 2e-05,
131
+ "loss": 0.0247,
132
+ "step": 12
133
+ },
134
+ {
135
+ "clip_ratio": 0.0,
136
+ "completion_length": 1870.0737648010254,
137
+ "epoch": 0.4622222222222222,
138
+ "grad_norm": 0.32693917550649865,
139
+ "kl": 0.022480010986328125,
140
+ "learning_rate": 1.9995065603657317e-05,
141
+ "loss": 0.0103,
142
+ "reward": -9.444286078214645,
143
+ "reward_std": 3.207320176064968,
144
+ "rewards/cot_length_penalty_reward": -9.848303943872452,
145
+ "rewards/math_latex_accuracy_reward": 0.4040178805589676,
146
+ "step": 13
147
+ },
148
+ {
149
+ "clip_ratio": 0.004404508654261008,
150
+ "epoch": 0.49777777777777776,
151
+ "grad_norm": 1.6311442510668763,
152
+ "kl": 0.010528564453125,
153
+ "learning_rate": 1.9980267284282718e-05,
154
+ "loss": 0.0093,
155
+ "step": 14
156
+ },
157
+ {
158
+ "clip_ratio": 0.0062026621017139405,
159
+ "epoch": 0.5333333333333333,
160
+ "grad_norm": 0.5615009008596469,
161
+ "kl": 0.051082611083984375,
162
+ "learning_rate": 1.99556196460308e-05,
163
+ "loss": 0.0076,
164
+ "step": 15
165
+ },
166
+ {
167
+ "clip_ratio": 0.006701507809339091,
168
+ "epoch": 0.5688888888888889,
169
+ "grad_norm": 0.15171374489080983,
170
+ "kl": 0.018802642822265625,
171
+ "learning_rate": 1.9921147013144782e-05,
172
+ "loss": 0.0041,
173
+ "step": 16
174
+ },
175
+ {
176
+ "clip_ratio": 0.0,
177
+ "completion_length": 2222.4309005737305,
178
+ "epoch": 0.6044444444444445,
179
+ "grad_norm": 0.11283429637288535,
180
+ "kl": 0.01442718505859375,
181
+ "learning_rate": 1.9876883405951378e-05,
182
+ "loss": 0.081,
183
+ "reward": -9.554554164409637,
184
+ "reward_std": 4.235630825161934,
185
+ "rewards/cot_length_penalty_reward": -9.844732716679573,
186
+ "rewards/math_latex_accuracy_reward": 0.29017858393490314,
187
+ "step": 17
188
+ },
189
+ {
190
+ "clip_ratio": 0.004120954225072637,
191
+ "epoch": 0.64,
192
+ "grad_norm": 0.10899171004441378,
193
+ "kl": 0.015628814697265625,
194
+ "learning_rate": 1.982287250728689e-05,
195
+ "loss": 0.2482,
196
+ "step": 18
197
+ },
198
+ {
199
+ "clip_ratio": 0.005419444481958635,
200
+ "epoch": 0.6755555555555556,
201
+ "grad_norm": 0.1157331857796372,
202
+ "kl": 0.01764678955078125,
203
+ "learning_rate": 1.9759167619387474e-05,
204
+ "loss": 0.2459,
205
+ "step": 19
206
+ },
207
+ {
208
+ "clip_ratio": 0.005958295805612579,
209
+ "epoch": 0.7111111111111111,
210
+ "grad_norm": 0.10652991525129403,
211
+ "kl": 0.01905059814453125,
212
+ "learning_rate": 1.9685831611286312e-05,
213
+ "loss": 0.2434,
214
+ "step": 20
215
+ },
216
+ {
217
+ "clip_ratio": 0.0,
218
+ "completion_length": 2501.2255668640137,
219
+ "epoch": 0.7466666666666667,
220
+ "grad_norm": 0.12065925342674215,
221
+ "kl": 0.020366668701171875,
222
+ "learning_rate": 1.9602936856769432e-05,
223
+ "loss": 0.033,
224
+ "reward": -11.581663489341736,
225
+ "reward_std": 4.305310405790806,
226
+ "rewards/cot_length_penalty_reward": -11.86737784743309,
227
+ "rewards/math_latex_accuracy_reward": 0.2857142973225564,
228
+ "step": 21
229
+ },
230
+ {
231
+ "clip_ratio": 0.0039043642027536407,
232
+ "epoch": 0.7822222222222223,
233
+ "grad_norm": 0.36463686705700576,
234
+ "kl": 0.019252777099609375,
235
+ "learning_rate": 1.9510565162951538e-05,
236
+ "loss": 0.0325,
237
+ "step": 22
238
+ },
239
+ {
240
+ "clip_ratio": 0.005750590149546042,
241
+ "epoch": 0.8177777777777778,
242
+ "grad_norm": 20283.4909421177,
243
+ "kl": 1147.058982849121,
244
+ "learning_rate": 1.9408807689542257e-05,
245
+ "loss": 46.027,
246
+ "step": 23
247
+ },
248
+ {
249
+ "clip_ratio": 0.008487990504363552,
250
+ "epoch": 0.8533333333333334,
251
+ "grad_norm": 0.17175661916148474,
252
+ "kl": 0.026885986328125,
253
+ "learning_rate": 1.9297764858882516e-05,
254
+ "loss": 0.0289,
255
+ "step": 24
256
+ },
257
+ {
258
+ "clip_ratio": 0.0,
259
+ "completion_length": 2333.062614440918,
260
+ "epoch": 0.8888888888888888,
261
+ "grad_norm": 0.10364225115067384,
262
+ "kl": 0.0201416015625,
263
+ "learning_rate": 1.9177546256839814e-05,
264
+ "loss": 0.0113,
265
+ "reward": -10.619498401880264,
266
+ "reward_std": 3.768970273435116,
267
+ "rewards/cot_length_penalty_reward": -10.869498312473297,
268
+ "rewards/math_latex_accuracy_reward": 0.2500000123400241,
269
+ "step": 25
270
+ },
271
+ {
272
+ "clip_ratio": 0.0032545153953833506,
273
+ "epoch": 0.9244444444444444,
274
+ "grad_norm": 0.10471903488067007,
275
+ "kl": 0.02140045166015625,
276
+ "learning_rate": 1.9048270524660197e-05,
277
+ "loss": 0.0103,
278
+ "step": 26
279
+ },
280
+ {
281
+ "clip_ratio": 0.004354664255515672,
282
+ "epoch": 0.96,
283
+ "grad_norm": 0.0965546219660842,
284
+ "kl": 0.0223541259765625,
285
+ "learning_rate": 1.891006524188368e-05,
286
+ "loss": 0.0085,
287
+ "step": 27
288
+ },
289
+ {
290
+ "clip_ratio": 0.005275880845147185,
291
+ "epoch": 0.9955555555555555,
292
+ "grad_norm": 0.10544368470129743,
293
+ "kl": 0.023712158203125,
294
+ "learning_rate": 1.8763066800438638e-05,
295
+ "loss": 0.0065,
296
+ "step": 28
297
+ },
298
+ {
299
+ "clip_ratio": 0.0,
300
+ "completion_length": 2427.296974182129,
301
+ "epoch": 1.0355555555555556,
302
+ "grad_norm": 0.42983301720943096,
303
+ "kl": 0.03394317626953125,
304
+ "learning_rate": 1.860742027003944e-05,
305
+ "loss": 0.0039,
306
+ "reward": -11.214900106191635,
307
+ "reward_std": 3.760936316102743,
308
+ "rewards/cot_length_penalty_reward": -11.5006143450737,
309
+ "rewards/math_latex_accuracy_reward": 0.285714297555387,
310
+ "step": 29
311
+ },
312
+ {
313
+ "epoch": 1.0711111111111111,
314
+ "grad_norm": 0.1067773530257121,
315
+ "learning_rate": 1.8443279255020153e-05,
316
+ "loss": 0.0071,
317
+ "step": 30
318
+ },
319
+ {
320
+ "epoch": 1.0711111111111111,
321
+ "eval_clip_ratio": 0.0,
322
+ "eval_completion_length": 2303.7637939453125,
323
+ "eval_kl": 0.025606595552884616,
324
+ "eval_loss": 0.04365207254886627,
325
+ "eval_reward": -8.784464891140278,
326
+ "eval_reward_std": 3.7600448498359094,
327
+ "eval_rewards/cot_length_penalty_reward": -9.116882379238422,
328
+ "eval_rewards/math_latex_accuracy_reward": 0.3324175958450024,
329
+ "eval_runtime": 448.0952,
330
+ "eval_samples_per_second": 0.112,
331
+ "eval_steps_per_second": 0.004,
332
+ "step": 30
333
+ },
334
+ {
335
+ "clip_ratio": 0.0037656883359886706,
336
+ "epoch": 1.1066666666666667,
337
+ "grad_norm": 0.7346983824268155,
338
+ "kl": 0.027835845947265625,
339
+ "learning_rate": 1.827080574274562e-05,
340
+ "loss": 0.0033,
341
+ "step": 31
342
+ },
343
+ {
344
+ "clip_ratio": 0.006145871157059446,
345
+ "epoch": 1.1422222222222222,
346
+ "grad_norm": 11.473259751864658,
347
+ "kl": 1.4422760009765625,
348
+ "learning_rate": 1.8090169943749477e-05,
349
+ "loss": 0.0549,
350
+ "step": 32
351
+ },
352
+ {
353
+ "clip_ratio": 0.0,
354
+ "completion_length": 2531.1674995422363,
355
+ "epoch": 1.1777777777777778,
356
+ "grad_norm": 0.11984100416609303,
357
+ "kl": 0.03124237060546875,
358
+ "learning_rate": 1.7901550123756906e-05,
359
+ "loss": 0.0239,
360
+ "reward": -8.361417755484581,
361
+ "reward_std": 4.016169548034668,
362
+ "rewards/cot_length_penalty_reward": -8.689542889595032,
363
+ "rewards/math_latex_accuracy_reward": 0.3281250139698386,
364
+ "step": 33
365
+ },
366
+ {
367
+ "clip_ratio": 0.004058451057062484,
368
+ "epoch": 1.2133333333333334,
369
+ "grad_norm": 0.16684935043495766,
370
+ "kl": 0.03450775146484375,
371
+ "learning_rate": 1.7705132427757895e-05,
372
+ "loss": 0.0232,
373
+ "step": 34
374
+ },
375
+ {
376
+ "clip_ratio": 0.006084064312744886,
377
+ "epoch": 1.248888888888889,
378
+ "grad_norm": 0.11163561617870978,
379
+ "kl": 0.0318145751953125,
380
+ "learning_rate": 1.7501110696304598e-05,
381
+ "loss": 0.0214,
382
+ "step": 35
383
+ },
384
+ {
385
+ "clip_ratio": 0.007263028150191531,
386
+ "epoch": 1.2844444444444445,
387
+ "grad_norm": 0.11128954549532989,
388
+ "kl": 0.03281402587890625,
389
+ "learning_rate": 1.7289686274214116e-05,
390
+ "loss": 0.0197,
391
+ "step": 36
392
+ },
393
+ {
394
+ "clip_ratio": 0.0,
395
+ "completion_length": 1888.4487342834473,
396
+ "epoch": 1.32,
397
+ "grad_norm": 0.6837402463902799,
398
+ "kl": 0.05413055419921875,
399
+ "learning_rate": 1.7071067811865477e-05,
400
+ "loss": 0.1151,
401
+ "reward": -7.783694684505463,
402
+ "reward_std": 3.098730646073818,
403
+ "rewards/cot_length_penalty_reward": -8.16985534131527,
404
+ "rewards/math_latex_accuracy_reward": 0.3861607341095805,
405
+ "step": 37
406
+ },
407
+ {
408
+ "clip_ratio": 0.0028788788622478023,
409
+ "epoch": 1.3555555555555556,
410
+ "grad_norm": 2.464864347264584,
411
+ "kl": 0.04084014892578125,
412
+ "learning_rate": 1.684547105928689e-05,
413
+ "loss": 0.3644,
414
+ "step": 38
415
+ },
416
+ {
417
+ "clip_ratio": 0.004996606716304086,
418
+ "epoch": 1.3911111111111112,
419
+ "grad_norm": 0.32236476835294475,
420
+ "kl": 0.04229736328125,
421
+ "learning_rate": 1.661311865323652e-05,
422
+ "loss": 0.1132,
423
+ "step": 39
424
+ },
425
+ {
426
+ "clip_ratio": 0.005848184140631929,
427
+ "epoch": 1.4266666666666667,
428
+ "grad_norm": 2.236113570550632,
429
+ "kl": 0.180694580078125,
430
+ "learning_rate": 1.63742398974869e-05,
431
+ "loss": 0.116,
432
+ "step": 40
433
+ },
434
+ {
435
+ "clip_ratio": 0.0,
436
+ "completion_length": 1681.7277793884277,
437
+ "epoch": 1.462222222222222,
438
+ "grad_norm": 55.91487468875501,
439
+ "kl": 1.1835174560546875,
440
+ "learning_rate": 1.6129070536529767e-05,
441
+ "loss": 0.0918,
442
+ "reward": -7.822701282799244,
443
+ "reward_std": 2.5297958850860596,
444
+ "rewards/cot_length_penalty_reward": -8.191005058586597,
445
+ "rewards/math_latex_accuracy_reward": 0.3683035862632096,
446
+ "step": 41
447
+ },
448
+ {
449
+ "clip_ratio": 0.003095990905421786,
450
+ "epoch": 1.4977777777777779,
451
+ "grad_norm": 3454.9772869499748,
452
+ "kl": 0.0470123291015625,
453
+ "learning_rate": 1.5877852522924733e-05,
454
+ "loss": 6.8427,
455
+ "step": 42
456
+ },
457
+ {
458
+ "clip_ratio": 0.005094703097711317,
459
+ "epoch": 1.5333333333333332,
460
+ "grad_norm": 15.584761637863307,
461
+ "kl": 1.0414886474609375,
462
+ "learning_rate": 1.5620833778521306e-05,
463
+ "loss": 0.0866,
464
+ "step": 43
465
+ },
466
+ {
467
+ "clip_ratio": 0.008272722363471985,
468
+ "epoch": 1.568888888888889,
469
+ "grad_norm": 1.2550063449844295,
470
+ "kl": 0.04656219482421875,
471
+ "learning_rate": 1.5358267949789968e-05,
472
+ "loss": 0.0502,
473
+ "step": 44
474
+ },
475
+ {
476
+ "clip_ratio": 0.0,
477
+ "completion_length": 2313.542510986328,
478
+ "epoch": 1.6044444444444443,
479
+ "grad_norm": 0.1122939508940078,
480
+ "kl": 0.038604736328125,
481
+ "learning_rate": 1.5090414157503715e-05,
482
+ "loss": 0.0762,
483
+ "reward": -9.223605461418629,
484
+ "reward_std": 3.827972359955311,
485
+ "rewards/cot_length_penalty_reward": -9.58521255850792,
486
+ "rewards/math_latex_accuracy_reward": 0.36160715692676604,
487
+ "step": 45
488
+ },
489
+ {
490
+ "clip_ratio": 0.0038211173960007727,
491
+ "epoch": 1.6400000000000001,
492
+ "grad_norm": 0.12378389214572502,
493
+ "kl": 0.04041290283203125,
494
+ "learning_rate": 1.4817536741017153e-05,
495
+ "loss": 0.0756,
496
+ "step": 46
497
+ },
498
+ {
499
+ "clip_ratio": 0.005922177180764265,
500
+ "epoch": 1.6755555555555555,
501
+ "grad_norm": 0.12771635689812005,
502
+ "kl": 0.04157257080078125,
503
+ "learning_rate": 1.4539904997395468e-05,
504
+ "loss": 0.0745,
505
+ "step": 47
506
+ },
507
+ {
508
+ "clip_ratio": 0.006733638554578647,
509
+ "epoch": 1.7111111111111112,
510
+ "grad_norm": 0.10941185512569215,
511
+ "kl": 0.04193115234375,
512
+ "learning_rate": 1.4257792915650728e-05,
513
+ "loss": 0.0731,
514
+ "step": 48
515
+ },
516
+ {
517
+ "clip_ratio": 0.0,
518
+ "completion_length": 1813.1072387695312,
519
+ "epoch": 1.7466666666666666,
520
+ "grad_norm": 3.0104818633298525,
521
+ "kl": 0.2118988037109375,
522
+ "learning_rate": 1.3971478906347806e-05,
523
+ "loss": -0.0205,
524
+ "reward": -10.289654642343521,
525
+ "reward_std": 3.131831008940935,
526
+ "rewards/cot_length_penalty_reward": -10.762868821620941,
527
+ "rewards/math_latex_accuracy_reward": 0.47321430779993534,
528
+ "step": 49
529
+ },
530
+ {
531
+ "clip_ratio": 0.002227201643108856,
532
+ "epoch": 1.7822222222222224,
533
+ "grad_norm": 0.11902180384589063,
534
+ "kl": 0.04238128662109375,
535
+ "learning_rate": 1.3681245526846782e-05,
536
+ "loss": -0.0276,
537
+ "step": 50
538
+ },
539
+ {
540
+ "clip_ratio": 0.0032200364221353084,
541
+ "epoch": 1.8177777777777777,
542
+ "grad_norm": 0.1256236704087802,
543
+ "kl": 0.04290008544921875,
544
+ "learning_rate": 1.3387379202452917e-05,
545
+ "loss": -0.0286,
546
+ "step": 51
547
+ },
548
+ {
549
+ "clip_ratio": 0.003942100578569807,
550
+ "epoch": 1.8533333333333335,
551
+ "grad_norm": 0.10119215538963353,
552
+ "kl": 0.0430450439453125,
553
+ "learning_rate": 1.3090169943749475e-05,
554
+ "loss": -0.03,
555
+ "step": 52
556
+ },
557
+ {
558
+ "clip_ratio": 0.0,
559
+ "completion_length": 2273.6608276367188,
560
+ "epoch": 1.8888888888888888,
561
+ "grad_norm": 0.17450385200895388,
562
+ "kl": 0.05017852783203125,
563
+ "learning_rate": 1.2789911060392295e-05,
564
+ "loss": 0.005,
565
+ "reward": -7.135257016867399,
566
+ "reward_std": 3.697649233043194,
567
+ "rewards/cot_length_penalty_reward": -7.5548999309539795,
568
+ "rewards/math_latex_accuracy_reward": 0.4196428684517741,
569
+ "step": 53
570
+ },
571
+ {
572
+ "clip_ratio": 0.002888819137297105,
573
+ "epoch": 1.9244444444444444,
574
+ "grad_norm": 0.10967403835477697,
575
+ "kl": 0.04810333251953125,
576
+ "learning_rate": 1.2486898871648552e-05,
577
+ "loss": 0.0038,
578
+ "step": 54
579
+ },
580
+ {
581
+ "clip_ratio": 0.005481840795255266,
582
+ "epoch": 1.96,
583
+ "grad_norm": 0.14987018246678602,
584
+ "kl": 0.05097198486328125,
585
+ "learning_rate": 1.2181432413965428e-05,
586
+ "loss": 0.0028,
587
+ "step": 55
588
+ },
589
+ {
590
+ "clip_ratio": 0.007852705341065302,
591
+ "epoch": 1.9955555555555555,
592
+ "grad_norm": 0.12818843141531608,
593
+ "kl": 0.0562286376953125,
594
+ "learning_rate": 1.187381314585725e-05,
595
+ "loss": 0.0013,
596
+ "step": 56
597
+ },
598
+ {
599
+ "clip_ratio": 0.0,
600
+ "completion_length": 2008.5916328430176,
601
+ "epoch": 2.0355555555555553,
602
+ "grad_norm": 0.1643995846810127,
603
+ "kl": 0.0548858642578125,
604
+ "learning_rate": 1.156434465040231e-05,
605
+ "loss": 0.0118,
606
+ "reward": -7.901306234300137,
607
+ "reward_std": 2.6794423200190067,
608
+ "rewards/cot_length_penalty_reward": -8.209341906011105,
609
+ "rewards/math_latex_accuracy_reward": 0.3080357303842902,
610
+ "step": 57
611
+ },
612
+ {
613
+ "clip_ratio": 0.0029279392474563792,
614
+ "epoch": 2.071111111111111,
615
+ "grad_norm": 0.12045324524183162,
616
+ "kl": 0.0587005615234375,
617
+ "learning_rate": 1.1253332335643043e-05,
618
+ "loss": 0.0108,
619
+ "step": 58
620
+ },
621
+ {
622
+ "clip_ratio": 0.005807226421893574,
623
+ "epoch": 2.1066666666666665,
624
+ "grad_norm": 0.14841036250602688,
625
+ "kl": 0.0640106201171875,
626
+ "learning_rate": 1.0941083133185146e-05,
627
+ "loss": 0.0097,
628
+ "step": 59
629
+ },
630
+ {
631
+ "epoch": 2.1422222222222222,
632
+ "grad_norm": 0.11089468253649072,
633
+ "learning_rate": 1.0627905195293135e-05,
634
+ "loss": 0.0084,
635
+ "step": 60
636
+ },
637
+ {
638
+ "epoch": 2.1422222222222222,
639
+ "eval_clip_ratio": 0.0,
640
+ "eval_completion_length": 2107.960148737981,
641
+ "eval_kl": 0.05983323317307692,
642
+ "eval_loss": -0.00015631201677024364,
643
+ "eval_reward": -8.494354761563814,
644
+ "eval_reward_std": 3.4025442325151882,
645
+ "eval_rewards/cot_length_penalty_reward": -8.876223013951229,
646
+ "eval_rewards/math_latex_accuracy_reward": 0.3818681509448932,
647
+ "eval_runtime": 422.1329,
648
+ "eval_samples_per_second": 0.118,
649
+ "eval_steps_per_second": 0.005,
650
+ "step": 60
651
+ },
652
+ {
653
+ "clip_ratio": 0.003193242686393205,
654
+ "completion_length": 1800.6674766540527,
655
+ "epoch": 2.1777777777777776,
656
+ "grad_norm": 0.1697521453747013,
657
+ "kl": 0.065582275390625,
658
+ "learning_rate": 1.0314107590781284e-05,
659
+ "loss": 0.0174,
660
+ "reward": -8.092556223273277,
661
+ "reward_std": 3.146493151783943,
662
+ "rewards/cot_length_penalty_reward": -8.458627462387085,
663
+ "rewards/math_latex_accuracy_reward": 0.3660714477300644,
664
+ "step": 61
665
+ },
666
+ {
667
+ "clip_ratio": 0.003192656353348866,
668
+ "epoch": 2.2133333333333334,
669
+ "grad_norm": 0.12330763778596482,
670
+ "kl": 0.0718231201171875,
671
+ "learning_rate": 1e-05,
672
+ "loss": 0.0163,
673
+ "step": 62
674
+ },
675
+ {
676
+ "clip_ratio": 0.0062453514110529795,
677
+ "epoch": 2.2488888888888887,
678
+ "grad_norm": 0.16098784282181033,
679
+ "kl": 0.078704833984375,
680
+ "learning_rate": 9.685892409218718e-06,
681
+ "loss": 0.0151,
682
+ "step": 63
683
+ },
684
+ {
685
+ "clip_ratio": 0.006978008910664357,
686
+ "epoch": 2.2844444444444445,
687
+ "grad_norm": 0.1406450810476633,
688
+ "kl": 0.0782470703125,
689
+ "learning_rate": 9.372094804706867e-06,
690
+ "loss": 0.0137,
691
+ "step": 64
692
+ },
693
+ {
694
+ "clip_ratio": 0.0,
695
+ "completion_length": 2595.997859954834,
696
+ "epoch": 2.32,
697
+ "grad_norm": 0.18250322704566987,
698
+ "kl": 0.0649871826171875,
699
+ "learning_rate": 9.058916866814857e-06,
700
+ "loss": 0.0147,
701
+ "reward": -9.348549716174603,
702
+ "reward_std": 3.3840084299445152,
703
+ "rewards/cot_length_penalty_reward": -9.70346000418067,
704
+ "rewards/math_latex_accuracy_reward": 0.35491072852164507,
705
+ "step": 65
706
+ },
707
+ {
708
+ "clip_ratio": 0.0031114893354242668,
709
+ "epoch": 2.3555555555555556,
710
+ "grad_norm": 0.13005553219968966,
711
+ "kl": 0.0695648193359375,
712
+ "learning_rate": 8.746667664356957e-06,
713
+ "loss": 0.014,
714
+ "step": 66
715
+ },
716
+ {
717
+ "clip_ratio": 0.0075038159266114235,
718
+ "epoch": 2.391111111111111,
719
+ "grad_norm": 0.19621512659848725,
720
+ "kl": 0.0780181884765625,
721
+ "learning_rate": 8.43565534959769e-06,
722
+ "loss": 0.0133,
723
+ "step": 67
724
+ },
725
+ {
726
+ "clip_ratio": 0.006932365708053112,
727
+ "epoch": 2.4266666666666667,
728
+ "grad_norm": 0.13215694629988284,
729
+ "kl": 0.07647705078125,
730
+ "learning_rate": 8.126186854142752e-06,
731
+ "loss": 0.0122,
732
+ "step": 68
733
+ },
734
+ {
735
+ "clip_ratio": 0.0,
736
+ "completion_length": 2154.4286880493164,
737
+ "epoch": 2.462222222222222,
738
+ "grad_norm": 0.27197173025048144,
739
+ "kl": 0.086151123046875,
740
+ "learning_rate": 7.818567586034578e-06,
741
+ "loss": 0.0247,
742
+ "reward": -8.337913118302822,
743
+ "reward_std": 3.0677984952926636,
744
+ "rewards/cot_length_penalty_reward": -8.806663155555725,
745
+ "rewards/math_latex_accuracy_reward": 0.4687500186264515,
746
+ "step": 69
747
+ },
748
+ {
749
+ "clip_ratio": 0.005053140237578191,
750
+ "epoch": 2.497777777777778,
751
+ "grad_norm": 0.20964263197545416,
752
+ "kl": 0.0977630615234375,
753
+ "learning_rate": 7.513101128351454e-06,
754
+ "loss": 0.0237,
755
+ "step": 70
756
+ },
757
+ {
758
+ "clip_ratio": 0.005771905358415097,
759
+ "epoch": 2.533333333333333,
760
+ "grad_norm": 0.15787820635605407,
761
+ "kl": 0.0987091064453125,
762
+ "learning_rate": 7.210088939607709e-06,
763
+ "loss": 0.0226,
764
+ "step": 71
765
+ },
766
+ {
767
+ "clip_ratio": 0.0062158564978744835,
768
+ "epoch": 2.568888888888889,
769
+ "grad_norm": 0.4007267449310534,
770
+ "kl": 0.0895538330078125,
771
+ "learning_rate": 6.909830056250527e-06,
772
+ "loss": 0.022,
773
+ "step": 72
774
+ },
775
+ {
776
+ "clip_ratio": 0.0,
777
+ "completion_length": 1935.1675262451172,
778
+ "epoch": 2.6044444444444443,
779
+ "grad_norm": 0.266781012315019,
780
+ "kl": 0.10394287109375,
781
+ "learning_rate": 6.612620797547087e-06,
782
+ "loss": 0.0125,
783
+ "reward": -7.354442303534597,
784
+ "reward_std": 2.94980551302433,
785
+ "rewards/cot_length_penalty_reward": -7.771853107959032,
786
+ "rewards/math_latex_accuracy_reward": 0.41741072852164507,
787
+ "step": 73
788
+ },
789
+ {
790
+ "clip_ratio": 0.01473489188356325,
791
+ "epoch": 2.64,
792
+ "grad_norm": 0.542093788713072,
793
+ "kl": 0.1417083740234375,
794
+ "learning_rate": 6.318754473153221e-06,
795
+ "loss": 0.0132,
796
+ "step": 74
797
+ },
798
+ {
799
+ "clip_ratio": 0.009351018321467564,
800
+ "epoch": 2.6755555555555555,
801
+ "grad_norm": 0.32832820257493534,
802
+ "kl": 0.1302490234375,
803
+ "learning_rate": 6.028521093652195e-06,
804
+ "loss": 0.0111,
805
+ "step": 75
806
+ },
807
+ {
808
+ "clip_ratio": 0.008401441504247487,
809
+ "epoch": 2.7111111111111112,
810
+ "grad_norm": 0.5313671762370776,
811
+ "kl": 0.106719970703125,
812
+ "learning_rate": 5.742207084349274e-06,
813
+ "loss": 0.0105,
814
+ "step": 76
815
+ },
816
+ {
817
+ "clip_ratio": 0.0,
818
+ "completion_length": 1853.8215065002441,
819
+ "epoch": 2.7466666666666666,
820
+ "grad_norm": 0.25601743476635025,
821
+ "kl": 0.128814697265625,
822
+ "learning_rate": 5.460095002604533e-06,
823
+ "loss": -0.018,
824
+ "reward": -7.114141087979078,
825
+ "reward_std": 2.6430138647556305,
826
+ "rewards/cot_length_penalty_reward": -7.493605274707079,
827
+ "rewards/math_latex_accuracy_reward": 0.37946430314332247,
828
+ "step": 77
829
+ },
830
+ {
831
+ "clip_ratio": 0.004884305511950515,
832
+ "epoch": 2.7822222222222224,
833
+ "grad_norm": 0.18667971276259676,
834
+ "kl": 0.1357421875,
835
+ "learning_rate": 5.1824632589828465e-06,
836
+ "loss": -0.019,
837
+ "step": 78
838
+ },
839
+ {
840
+ "clip_ratio": 0.008678867772687227,
841
+ "epoch": 2.8177777777777777,
842
+ "grad_norm": 0.2515967343714355,
843
+ "kl": 0.1392822265625,
844
+ "learning_rate": 4.909585842496287e-06,
845
+ "loss": -0.0199,
846
+ "step": 79
847
+ },
848
+ {
849
+ "clip_ratio": 0.008155457631801255,
850
+ "epoch": 2.8533333333333335,
851
+ "grad_norm": 0.18942366870295294,
852
+ "kl": 0.131805419921875,
853
+ "learning_rate": 4.641732050210032e-06,
854
+ "loss": -0.0211,
855
+ "step": 80
856
+ },
857
+ {
858
+ "clip_ratio": 0.0,
859
+ "completion_length": 2211.9219856262207,
860
+ "epoch": 2.888888888888889,
861
+ "grad_norm": 0.22043905749174672,
862
+ "kl": 0.1049957275390625,
863
+ "learning_rate": 4.379166221478697e-06,
864
+ "loss": -0.0247,
865
+ "reward": -9.63335988484323,
866
+ "reward_std": 2.9292308390140533,
867
+ "rewards/cot_length_penalty_reward": -10.111038556322455,
868
+ "rewards/math_latex_accuracy_reward": 0.4776785969734192,
869
+ "step": 81
870
+ },
871
+ {
872
+ "clip_ratio": 0.002452510700095445,
873
+ "epoch": 2.924444444444444,
874
+ "grad_norm": 0.2264032851218282,
875
+ "kl": 0.1047821044921875,
876
+ "learning_rate": 4.12214747707527e-06,
877
+ "loss": -0.0248,
878
+ "step": 82
879
+ },
880
+ {
881
+ "clip_ratio": 0.0035177832323824987,
882
+ "epoch": 2.96,
883
+ "grad_norm": 0.14675046732213165,
884
+ "kl": 0.1127166748046875,
885
+ "learning_rate": 3.8709294634702374e-06,
886
+ "loss": -0.0259,
887
+ "step": 83
888
+ },
889
+ {
890
+ "clip_ratio": 0.0036856129445368424,
891
+ "epoch": 2.9955555555555557,
892
+ "grad_norm": 0.1630420640767936,
893
+ "kl": 0.09552001953125,
894
+ "learning_rate": 3.625760102513103e-06,
895
+ "loss": -0.0267,
896
+ "step": 84
897
+ },
898
+ {
899
+ "clip_ratio": 0.0,
900
+ "completion_length": 1545.3081169128418,
901
+ "epoch": 3.0355555555555553,
902
+ "grad_norm": 14.400966954596354,
903
+ "kl": 0.286773681640625,
904
+ "learning_rate": 3.3868813467634833e-06,
905
+ "loss": -0.0265,
906
+ "reward": -7.1288284212350845,
907
+ "reward_std": 1.927463386207819,
908
+ "rewards/cot_length_penalty_reward": -7.606507122516632,
909
+ "rewards/math_latex_accuracy_reward": 0.47767859511077404,
910
+ "step": 85
911
+ },
912
+ {
913
+ "clip_ratio": 0.0029736382130067796,
914
+ "epoch": 3.071111111111111,
915
+ "grad_norm": 0.46033460120838376,
916
+ "kl": 0.130645751953125,
917
+ "learning_rate": 3.1545289407131128e-06,
918
+ "loss": -0.0322,
919
+ "step": 86
920
+ },
921
+ {
922
+ "clip_ratio": 0.0043519225146155804,
923
+ "epoch": 3.1066666666666665,
924
+ "grad_norm": 0.2864629379691617,
925
+ "kl": 0.13775634765625,
926
+ "learning_rate": 2.9289321881345257e-06,
927
+ "loss": -0.0338,
928
+ "step": 87
929
+ },
930
+ {
931
+ "clip_ratio": 0.009186911847791635,
932
+ "epoch": 3.1422222222222222,
933
+ "grad_norm": 0.24547967049213823,
934
+ "kl": 0.155426025390625,
935
+ "learning_rate": 2.7103137257858867e-06,
936
+ "loss": -0.0347,
937
+ "step": 88
938
+ },
939
+ {
940
+ "clip_ratio": 0.0,
941
+ "completion_length": 1969.4442825317383,
942
+ "epoch": 3.1777777777777776,
943
+ "grad_norm": 0.433485726828516,
944
+ "kl": 0.1619415283203125,
945
+ "learning_rate": 2.4988893036954045e-06,
946
+ "loss": -0.0013,
947
+ "reward": -8.979866623878479,
948
+ "reward_std": 2.524892296642065,
949
+ "rewards/cot_length_penalty_reward": -9.319152384996414,
950
+ "rewards/math_latex_accuracy_reward": 0.339285729220137,
951
+ "step": 89
952
+ },
953
+ {
954
+ "epoch": 3.2133333333333334,
955
+ "grad_norm": 0.2766265099056435,
956
+ "learning_rate": 2.2948675722421086e-06,
957
+ "loss": -0.003,
958
+ "step": 90
959
+ },
960
+ {
961
+ "epoch": 3.2133333333333334,
962
+ "eval_clip_ratio": 0.0,
963
+ "eval_completion_length": 2205.780292217548,
964
+ "eval_kl": 0.18556565504807693,
965
+ "eval_loss": 0.01654699072241783,
966
+ "eval_reward": -7.743022455332371,
967
+ "eval_reward_std": 2.8156597109941335,
968
+ "eval_rewards/cot_length_penalty_reward": -8.072692573070526,
969
+ "eval_rewards/math_latex_accuracy_reward": 0.32967034670022816,
970
+ "eval_runtime": 490.0675,
971
+ "eval_samples_per_second": 0.102,
972
+ "eval_steps_per_second": 0.004,
973
+ "step": 90
974
+ },
975
+ {
976
+ "clip_ratio": 0.006224101292900741,
977
+ "epoch": 3.2488888888888887,
978
+ "grad_norm": 0.4063813985520313,
979
+ "kl": 0.197174072265625,
980
+ "learning_rate": 2.098449876243096e-06,
981
+ "loss": -0.0037,
982
+ "step": 91
983
+ },
984
+ {
985
+ "clip_ratio": 0.009523139509838074,
986
+ "epoch": 3.2844444444444445,
987
+ "grad_norm": 0.26707887580226697,
988
+ "kl": 0.196014404296875,
989
+ "learning_rate": 1.9098300562505266e-06,
990
+ "loss": -0.0047,
991
+ "step": 92
992
+ },
993
+ {
994
+ "clip_ratio": 0.0,
995
+ "completion_length": 2093.5982818603516,
996
+ "epoch": 3.32,
997
+ "grad_norm": 0.3581002181636425,
998
+ "kl": 0.199798583984375,
999
+ "learning_rate": 1.7291942572543806e-06,
1000
+ "loss": 0.0178,
1001
+ "reward": -7.027399688959122,
1002
+ "reward_std": 2.7810670882463455,
1003
+ "rewards/cot_length_penalty_reward": -7.435881897807121,
1004
+ "rewards/math_latex_accuracy_reward": 0.40848216507583857,
1005
+ "step": 93
1006
+ },
1007
+ {
1008
+ "clip_ratio": 0.002175023495510686,
1009
+ "epoch": 3.3555555555555556,
1010
+ "grad_norm": 0.35335279380104717,
1011
+ "kl": 0.19476318359375,
1012
+ "learning_rate": 1.5567207449798517e-06,
1013
+ "loss": 0.017,
1014
+ "step": 94
1015
+ },
1016
+ {
1017
+ "clip_ratio": 0.0038238488195929676,
1018
+ "epoch": 3.391111111111111,
1019
+ "grad_norm": 0.26816647206060557,
1020
+ "kl": 0.21075439453125,
1021
+ "learning_rate": 1.3925797299605649e-06,
1022
+ "loss": 0.0159,
1023
+ "step": 95
1024
+ },
1025
+ {
1026
+ "clip_ratio": 0.006063876280677505,
1027
+ "epoch": 3.4266666666666667,
1028
+ "grad_norm": 0.3302248399941887,
1029
+ "kl": 0.22137451171875,
1030
+ "learning_rate": 1.2369331995613664e-06,
1031
+ "loss": 0.0151,
1032
+ "step": 96
1033
+ },
1034
+ {
1035
+ "clip_ratio": 0.0,
1036
+ "completion_length": 2630.805938720703,
1037
+ "epoch": 3.462222222222222,
1038
+ "grad_norm": 1.7166272647509115,
1039
+ "kl": 0.281219482421875,
1040
+ "learning_rate": 1.0899347581163222e-06,
1041
+ "loss": 0.0839,
1042
+ "reward": -7.589185383694712,
1043
+ "reward_std": 3.134066376835108,
1044
+ "rewards/cot_length_penalty_reward": -7.917310604825616,
1045
+ "rewards/math_latex_accuracy_reward": 0.32812501839362085,
1046
+ "step": 97
1047
+ },
1048
+ {
1049
+ "clip_ratio": 0.007515597055316903,
1050
+ "epoch": 3.497777777777778,
1051
+ "grad_norm": 4.303565090937016,
1052
+ "kl": 0.204925537109375,
1053
+ "learning_rate": 9.517294753398066e-07,
1054
+ "loss": 0.0869,
1055
+ "step": 98
1056
+ },
1057
+ {
1058
+ "clip_ratio": 0.0075942349576507695,
1059
+ "epoch": 3.533333333333333,
1060
+ "grad_norm": 2.9889664346961444,
1061
+ "kl": 0.2061920166015625,
1062
+ "learning_rate": 8.224537431601886e-07,
1063
+ "loss": 0.0841,
1064
+ "step": 99
1065
+ },
1066
+ {
1067
+ "clip_ratio": 0.0051619461009977385,
1068
+ "epoch": 3.568888888888889,
1069
+ "grad_norm": 0.4431396626395719,
1070
+ "kl": 0.22930908203125,
1071
+ "learning_rate": 7.022351411174866e-07,
1072
+ "loss": 0.0814,
1073
+ "step": 100
1074
+ },
1075
+ {
1076
+ "clip_ratio": 0.0,
1077
+ "completion_length": 2545.8371925354004,
1078
+ "epoch": 3.6044444444444443,
1079
+ "grad_norm": 0.4301809872674547,
1080
+ "kl": 0.171051025390625,
1081
+ "learning_rate": 5.911923104577455e-07,
1082
+ "loss": 0.0196,
1083
+ "reward": -10.28242233581841,
1084
+ "reward_std": 3.0546065159142017,
1085
+ "rewards/cot_length_penalty_reward": -10.666351079940796,
1086
+ "rewards/math_latex_accuracy_reward": 0.3839285857975483,
1087
+ "step": 101
1088
+ },
1089
+ {
1090
+ "clip_ratio": 0.0024334693371201865,
1091
+ "epoch": 3.64,
1092
+ "grad_norm": 0.40563497309844976,
1093
+ "kl": 0.19537353515625,
1094
+ "learning_rate": 4.894348370484648e-07,
1095
+ "loss": 0.0191,
1096
+ "step": 102
1097
+ },
1098
+ {
1099
+ "clip_ratio": 0.00391879488597624,
1100
+ "epoch": 3.6755555555555555,
1101
+ "grad_norm": 0.5720892424115457,
1102
+ "kl": 0.21160888671875,
1103
+ "learning_rate": 3.9706314323056936e-07,
1104
+ "loss": 0.0191,
1105
+ "step": 103
1106
+ },
1107
+ {
1108
+ "clip_ratio": 0.005467013252200559,
1109
+ "epoch": 3.7111111111111112,
1110
+ "grad_norm": 0.5349890131537904,
1111
+ "kl": 0.21337890625,
1112
+ "learning_rate": 3.1416838871368925e-07,
1113
+ "loss": 0.0189,
1114
+ "step": 104
1115
+ },
1116
+ {
1117
+ "clip_ratio": 0.0,
1118
+ "completion_length": 2160.7389335632324,
1119
+ "epoch": 3.7466666666666666,
1120
+ "grad_norm": 13.207091106199918,
1121
+ "kl": 0.7435302734375,
1122
+ "learning_rate": 2.4083238061252565e-07,
1123
+ "loss": 0.0564,
1124
+ "reward": -7.942288625985384,
1125
+ "reward_std": 2.594830472022295,
1126
+ "rewards/cot_length_penalty_reward": -8.37309193611145,
1127
+ "rewards/math_latex_accuracy_reward": 0.43080358672887087,
1128
+ "step": 105
1129
+ },
1130
+ {
1131
+ "clip_ratio": 0.0028747237083734944,
1132
+ "epoch": 3.7822222222222224,
1133
+ "grad_norm": 3.5352628792853067,
1134
+ "kl": 0.472412109375,
1135
+ "learning_rate": 1.7712749271311392e-07,
1136
+ "loss": 0.0465,
1137
+ "step": 106
1138
+ },
1139
+ {
1140
+ "clip_ratio": 0.004760361814987846,
1141
+ "epoch": 3.8177777777777777,
1142
+ "grad_norm": 0.94419108473764,
1143
+ "kl": 0.3760833740234375,
1144
+ "learning_rate": 1.231165940486234e-07,
1145
+ "loss": 0.0439,
1146
+ "step": 107
1147
+ },
1148
+ {
1149
+ "clip_ratio": 0.0061231208674144,
1150
+ "epoch": 3.8533333333333335,
1151
+ "grad_norm": 1.7315584230873846,
1152
+ "kl": 0.3495025634765625,
1153
+ "learning_rate": 7.885298685522235e-08,
1154
+ "loss": 0.044,
1155
+ "step": 108
1156
+ },
1157
+ {
1158
+ "clip_ratio": 0.0,
1159
+ "completion_length": 1982.8750839233398,
1160
+ "epoch": 3.888888888888889,
1161
+ "grad_norm": 0.7007233776562277,
1162
+ "kl": 0.2796783447265625,
1163
+ "learning_rate": 4.438035396920004e-08,
1164
+ "loss": 0.0207,
1165
+ "reward": -9.675179054960608,
1166
+ "reward_std": 2.574063938111067,
1167
+ "rewards/cot_length_penalty_reward": -9.989911276847124,
1168
+ "rewards/math_latex_accuracy_reward": 0.314732160884887,
1169
+ "step": 109
1170
+ },
1171
+ {
1172
+ "clip_ratio": 0.001981915433134418,
1173
+ "epoch": 3.924444444444444,
1174
+ "grad_norm": 0.6787732749959883,
1175
+ "kl": 0.2747039794921875,
1176
+ "learning_rate": 1.973271571728441e-08,
1177
+ "loss": 0.0208,
1178
+ "step": 110
1179
+ },
1180
+ {
1181
+ "clip_ratio": 0.0019334297030582093,
1182
+ "epoch": 3.96,
1183
+ "grad_norm": 0.6415798052337234,
1184
+ "kl": 0.27728271484375,
1185
+ "learning_rate": 4.9343963426840006e-09,
1186
+ "loss": 0.0206,
1187
+ "step": 111
1188
+ },
1189
+ {
1190
+ "clip_ratio": 0.0017770093054423342,
1191
+ "epoch": 3.9955555555555557,
1192
+ "grad_norm": 0.6316769640873855,
1193
+ "kl": 0.30633544921875,
1194
+ "learning_rate": 0.0,
1195
+ "loss": 0.0207,
1196
+ "step": 112
1197
+ },
1198
+ {
1199
+ "epoch": 3.9955555555555557,
1200
+ "step": 112,
1201
+ "total_flos": 0.0,
1202
+ "train_loss": 0.6775813137989773,
1203
+ "train_runtime": 20674.7257,
1204
+ "train_samples_per_second": 0.087,
1205
+ "train_steps_per_second": 0.005
1206
+ }
1207
+ ],
1208
+ "logging_steps": 1,
1209
+ "max_steps": 112,
1210
+ "num_input_tokens_seen": 0,
1211
+ "num_train_epochs": 4,
1212
+ "save_steps": 10,
1213
+ "stateful_callbacks": {
1214
+ "TrainerControl": {
1215
+ "args": {
1216
+ "should_epoch_stop": false,
1217
+ "should_evaluate": false,
1218
+ "should_log": false,
1219
+ "should_save": true,
1220
+ "should_training_stop": true
1221
+ },
1222
+ "attributes": {}
1223
+ }
1224
+ },
1225
+ "total_flos": 0.0,
1226
+ "train_batch_size": 4,
1227
+ "trial_name": null,
1228
+ "trial_params": null
1229
+ }