nghind commited on
Commit
355d9d3
·
verified ·
1 Parent(s): f6d2815

Model save

Browse files
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: philschmid/llama-3-1-8b-math-orca-spectrum-10k-ep1
3
+ library_name: transformers
4
+ model_name: grpo-llama-3-1-8b-math-ep3
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - grpo
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for grpo-llama-3-1-8b-math-ep3
13
+
14
+ This model is a fine-tuned version of [philschmid/llama-3-1-8b-math-orca-spectrum-10k-ep1](https://huggingface.co/philschmid/llama-3-1-8b-math-orca-spectrum-10k-ep1).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="nghind/grpo-llama-3-1-8b-math-ep3", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+
31
+
32
+
33
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.15.1
38
+ - Transformers: 4.49.0
39
+ - Pytorch: 2.5.1+cu121
40
+ - Datasets: 3.3.1
41
+ - Tokenizers: 0.21.0
42
+
43
+ ## Citations
44
+
45
+ Cite GRPO as:
46
+
47
+ ```bibtex
48
+ @article{zhihong2024deepseekmath,
49
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
50
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
51
+ year = 2024,
52
+ eprint = {arXiv:2402.03300},
53
+ }
54
+
55
+ ```
56
+
57
+ Cite TRL as:
58
+
59
+ ```bibtex
60
+ @misc{vonwerra2022trl,
61
+ title = {{TRL: Transformer Reinforcement Learning}},
62
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
63
+ year = 2020,
64
+ journal = {GitHub repository},
65
+ publisher = {GitHub},
66
+ howpublished = {\url{https://github.com/huggingface/trl}}
67
+ }
68
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00012346780326337724,
4
+ "train_runtime": 38700.7369,
5
+ "train_samples": 7943,
6
+ "train_samples_per_second": 0.616,
7
+ "train_steps_per_second": 0.019
8
+ }
runs/Feb20_10-08-19_koa-dgxa-b11-u17/events.out.tfevents.1740046206.koa-dgxa-b11-u17.2389931.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:aa6131c484535db0f43ecf21d049b8739a742c2e74bd42ca8db5d06cc7a428e4
3
- size 79719
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98cb593a614ca4c271cac54520d34834d166f4f4a1dd2e73a610c9da1b81c0de
3
+ size 80356
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00012346780326337724,
4
+ "train_runtime": 38700.7369,
5
+ "train_samples": 7943,
6
+ "train_samples_per_second": 0.616,
7
+ "train_steps_per_second": 0.019
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1835 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 3.0,
5
+ "eval_steps": 500,
6
+ "global_step": 747,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "completion_length": 159.95703125,
13
+ "epoch": 0.020080321285140562,
14
+ "grad_norm": 0.2833329439163208,
15
+ "kl": 0.00034067649394273756,
16
+ "learning_rate": 5e-06,
17
+ "loss": 0.0,
18
+ "reward": 0.32109375,
19
+ "reward_std": 0.2864044725894928,
20
+ "rewards/acc_reward_func": 0.32109375,
21
+ "step": 5
22
+ },
23
+ {
24
+ "completion_length": 179.0203125,
25
+ "epoch": 0.040160642570281124,
26
+ "grad_norm": 0.47351399064064026,
27
+ "kl": 0.000714331166818738,
28
+ "learning_rate": 5e-06,
29
+ "loss": 0.0001,
30
+ "reward": 0.30625,
31
+ "reward_std": 0.2533454954624176,
32
+ "rewards/acc_reward_func": 0.30625,
33
+ "step": 10
34
+ },
35
+ {
36
+ "completion_length": 163.7078125,
37
+ "epoch": 0.060240963855421686,
38
+ "grad_norm": 0.6223832368850708,
39
+ "kl": 0.0008577127242460847,
40
+ "learning_rate": 5e-06,
41
+ "loss": 0.0001,
42
+ "reward": 0.290625,
43
+ "reward_std": 0.26476518511772157,
44
+ "rewards/acc_reward_func": 0.290625,
45
+ "step": 15
46
+ },
47
+ {
48
+ "completion_length": 177.18203125,
49
+ "epoch": 0.08032128514056225,
50
+ "grad_norm": 0.3723163902759552,
51
+ "kl": 0.000897675973828882,
52
+ "learning_rate": 5e-06,
53
+ "loss": 0.0001,
54
+ "reward": 0.33125,
55
+ "reward_std": 0.27346172034740446,
56
+ "rewards/acc_reward_func": 0.33125,
57
+ "step": 20
58
+ },
59
+ {
60
+ "completion_length": 172.44921875,
61
+ "epoch": 0.10040160642570281,
62
+ "grad_norm": 0.4859201908111572,
63
+ "kl": 0.0007807362941093743,
64
+ "learning_rate": 5e-06,
65
+ "loss": 0.0001,
66
+ "reward": 0.35625,
67
+ "reward_std": 0.28146005868911744,
68
+ "rewards/acc_reward_func": 0.35625,
69
+ "step": 25
70
+ },
71
+ {
72
+ "completion_length": 173.3265625,
73
+ "epoch": 0.12048192771084337,
74
+ "grad_norm": 0.4038971960544586,
75
+ "kl": 0.0007229511742480099,
76
+ "learning_rate": 5e-06,
77
+ "loss": 0.0001,
78
+ "reward": 0.31640625,
79
+ "reward_std": 0.2735124319791794,
80
+ "rewards/acc_reward_func": 0.31640625,
81
+ "step": 30
82
+ },
83
+ {
84
+ "completion_length": 168.64296875,
85
+ "epoch": 0.14056224899598393,
86
+ "grad_norm": 0.3058616816997528,
87
+ "kl": 0.000607735151425004,
88
+ "learning_rate": 5e-06,
89
+ "loss": 0.0001,
90
+ "reward": 0.3265625,
91
+ "reward_std": 0.24166617095470427,
92
+ "rewards/acc_reward_func": 0.3265625,
93
+ "step": 35
94
+ },
95
+ {
96
+ "completion_length": 162.60703125,
97
+ "epoch": 0.1606425702811245,
98
+ "grad_norm": 0.2520759701728821,
99
+ "kl": 0.0005803823471069336,
100
+ "learning_rate": 5e-06,
101
+ "loss": 0.0001,
102
+ "reward": 0.309375,
103
+ "reward_std": 0.2683109283447266,
104
+ "rewards/acc_reward_func": 0.309375,
105
+ "step": 40
106
+ },
107
+ {
108
+ "completion_length": 167.0765625,
109
+ "epoch": 0.18072289156626506,
110
+ "grad_norm": 0.3606955409049988,
111
+ "kl": 0.00041882623336277904,
112
+ "learning_rate": 5e-06,
113
+ "loss": 0.0,
114
+ "reward": 0.3390625,
115
+ "reward_std": 0.270548814535141,
116
+ "rewards/acc_reward_func": 0.3390625,
117
+ "step": 45
118
+ },
119
+ {
120
+ "completion_length": 173.20859375,
121
+ "epoch": 0.20080321285140562,
122
+ "grad_norm": 0.43256309628486633,
123
+ "kl": 0.0004548234341200441,
124
+ "learning_rate": 5e-06,
125
+ "loss": 0.0,
126
+ "reward": 0.3125,
127
+ "reward_std": 0.24177098274230957,
128
+ "rewards/acc_reward_func": 0.3125,
129
+ "step": 50
130
+ },
131
+ {
132
+ "completion_length": 168.24296875,
133
+ "epoch": 0.22088353413654618,
134
+ "grad_norm": 0.5047885775566101,
135
+ "kl": 0.0004514524363912642,
136
+ "learning_rate": 5e-06,
137
+ "loss": 0.0,
138
+ "reward": 0.35,
139
+ "reward_std": 0.27911658585071564,
140
+ "rewards/acc_reward_func": 0.35,
141
+ "step": 55
142
+ },
143
+ {
144
+ "completion_length": 168.32109375,
145
+ "epoch": 0.24096385542168675,
146
+ "grad_norm": 0.2552633285522461,
147
+ "kl": 0.00045427558943629266,
148
+ "learning_rate": 5e-06,
149
+ "loss": 0.0,
150
+ "reward": 0.31015625,
151
+ "reward_std": 0.27751815915107725,
152
+ "rewards/acc_reward_func": 0.31015625,
153
+ "step": 60
154
+ },
155
+ {
156
+ "completion_length": 173.72421875,
157
+ "epoch": 0.26104417670682734,
158
+ "grad_norm": 0.2749291956424713,
159
+ "kl": 0.0004708627995569259,
160
+ "learning_rate": 5e-06,
161
+ "loss": 0.0,
162
+ "reward": 0.36171875,
163
+ "reward_std": 0.2908409178256989,
164
+ "rewards/acc_reward_func": 0.36171875,
165
+ "step": 65
166
+ },
167
+ {
168
+ "completion_length": 174.65390625,
169
+ "epoch": 0.28112449799196787,
170
+ "grad_norm": 0.26198047399520874,
171
+ "kl": 0.000500894442666322,
172
+ "learning_rate": 5e-06,
173
+ "loss": 0.0001,
174
+ "reward": 0.29296875,
175
+ "reward_std": 0.2530863583087921,
176
+ "rewards/acc_reward_func": 0.29296875,
177
+ "step": 70
178
+ },
179
+ {
180
+ "completion_length": 168.96953125,
181
+ "epoch": 0.30120481927710846,
182
+ "grad_norm": 0.34051886200904846,
183
+ "kl": 0.0004913369542919099,
184
+ "learning_rate": 5e-06,
185
+ "loss": 0.0,
186
+ "reward": 0.34453125,
187
+ "reward_std": 0.2887667536735535,
188
+ "rewards/acc_reward_func": 0.34453125,
189
+ "step": 75
190
+ },
191
+ {
192
+ "completion_length": 166.57109375,
193
+ "epoch": 0.321285140562249,
194
+ "grad_norm": 0.2959311306476593,
195
+ "kl": 0.0004982782644219697,
196
+ "learning_rate": 5e-06,
197
+ "loss": 0.0,
198
+ "reward": 0.328125,
199
+ "reward_std": 0.25415654480457306,
200
+ "rewards/acc_reward_func": 0.328125,
201
+ "step": 80
202
+ },
203
+ {
204
+ "completion_length": 169.40703125,
205
+ "epoch": 0.3413654618473896,
206
+ "grad_norm": 0.26805025339126587,
207
+ "kl": 0.0005323876044712961,
208
+ "learning_rate": 5e-06,
209
+ "loss": 0.0001,
210
+ "reward": 0.36796875,
211
+ "reward_std": 0.2889047384262085,
212
+ "rewards/acc_reward_func": 0.36796875,
213
+ "step": 85
214
+ },
215
+ {
216
+ "completion_length": 167.22890625,
217
+ "epoch": 0.3614457831325301,
218
+ "grad_norm": 0.3810369074344635,
219
+ "kl": 0.0004968916124198586,
220
+ "learning_rate": 5e-06,
221
+ "loss": 0.0,
222
+ "reward": 0.36015625,
223
+ "reward_std": 0.2771797090768814,
224
+ "rewards/acc_reward_func": 0.36015625,
225
+ "step": 90
226
+ },
227
+ {
228
+ "completion_length": 175.2796875,
229
+ "epoch": 0.3815261044176707,
230
+ "grad_norm": 0.28440332412719727,
231
+ "kl": 0.0004675893171224743,
232
+ "learning_rate": 5e-06,
233
+ "loss": 0.0,
234
+ "reward": 0.31484375,
235
+ "reward_std": 0.27064948081970214,
236
+ "rewards/acc_reward_func": 0.31484375,
237
+ "step": 95
238
+ },
239
+ {
240
+ "completion_length": 163.48203125,
241
+ "epoch": 0.40160642570281124,
242
+ "grad_norm": 0.3027651309967041,
243
+ "kl": 0.0004759302770253271,
244
+ "learning_rate": 5e-06,
245
+ "loss": 0.0,
246
+ "reward": 0.3125,
247
+ "reward_std": 0.2688873440027237,
248
+ "rewards/acc_reward_func": 0.3125,
249
+ "step": 100
250
+ },
251
+ {
252
+ "completion_length": 175.22734375,
253
+ "epoch": 0.42168674698795183,
254
+ "grad_norm": 0.5133289098739624,
255
+ "kl": 0.0004573037032969296,
256
+ "learning_rate": 5e-06,
257
+ "loss": 0.0,
258
+ "reward": 0.31015625,
259
+ "reward_std": 0.2805974006652832,
260
+ "rewards/acc_reward_func": 0.31015625,
261
+ "step": 105
262
+ },
263
+ {
264
+ "completion_length": 167.0984375,
265
+ "epoch": 0.44176706827309237,
266
+ "grad_norm": 0.3459770381450653,
267
+ "kl": 0.0005908279563300312,
268
+ "learning_rate": 5e-06,
269
+ "loss": 0.0001,
270
+ "reward": 0.33359375,
271
+ "reward_std": 0.2592057645320892,
272
+ "rewards/acc_reward_func": 0.33359375,
273
+ "step": 110
274
+ },
275
+ {
276
+ "completion_length": 167.68671875,
277
+ "epoch": 0.46184738955823296,
278
+ "grad_norm": 0.4075464606285095,
279
+ "kl": 0.0006865185336209833,
280
+ "learning_rate": 5e-06,
281
+ "loss": 0.0001,
282
+ "reward": 0.38203125,
283
+ "reward_std": 0.265577495098114,
284
+ "rewards/acc_reward_func": 0.38203125,
285
+ "step": 115
286
+ },
287
+ {
288
+ "completion_length": 163.8390625,
289
+ "epoch": 0.4819277108433735,
290
+ "grad_norm": 0.37416180968284607,
291
+ "kl": 0.000659433496184647,
292
+ "learning_rate": 5e-06,
293
+ "loss": 0.0001,
294
+ "reward": 0.315625,
295
+ "reward_std": 0.2419831484556198,
296
+ "rewards/acc_reward_func": 0.315625,
297
+ "step": 120
298
+ },
299
+ {
300
+ "completion_length": 167.946875,
301
+ "epoch": 0.5020080321285141,
302
+ "grad_norm": 0.3058314025402069,
303
+ "kl": 0.0005173740908503532,
304
+ "learning_rate": 5e-06,
305
+ "loss": 0.0001,
306
+ "reward": 0.32421875,
307
+ "reward_std": 0.2936626195907593,
308
+ "rewards/acc_reward_func": 0.32421875,
309
+ "step": 125
310
+ },
311
+ {
312
+ "completion_length": 164.896875,
313
+ "epoch": 0.5220883534136547,
314
+ "grad_norm": 0.3665063977241516,
315
+ "kl": 0.0005303317215293646,
316
+ "learning_rate": 5e-06,
317
+ "loss": 0.0001,
318
+ "reward": 0.346875,
319
+ "reward_std": 0.28919236958026884,
320
+ "rewards/acc_reward_func": 0.346875,
321
+ "step": 130
322
+ },
323
+ {
324
+ "completion_length": 169.23515625,
325
+ "epoch": 0.5421686746987951,
326
+ "grad_norm": 0.38457340002059937,
327
+ "kl": 0.0005899963027331979,
328
+ "learning_rate": 5e-06,
329
+ "loss": 0.0001,
330
+ "reward": 0.3453125,
331
+ "reward_std": 0.25805322229862215,
332
+ "rewards/acc_reward_func": 0.3453125,
333
+ "step": 135
334
+ },
335
+ {
336
+ "completion_length": 164.45390625,
337
+ "epoch": 0.5622489959839357,
338
+ "grad_norm": 0.2263726145029068,
339
+ "kl": 0.0005681793205440045,
340
+ "learning_rate": 5e-06,
341
+ "loss": 0.0001,
342
+ "reward": 0.27890625,
343
+ "reward_std": 0.23708621561527252,
344
+ "rewards/acc_reward_func": 0.27890625,
345
+ "step": 140
346
+ },
347
+ {
348
+ "completion_length": 167.23046875,
349
+ "epoch": 0.5823293172690763,
350
+ "grad_norm": 0.28835389018058777,
351
+ "kl": 0.0005865491810254752,
352
+ "learning_rate": 5e-06,
353
+ "loss": 0.0001,
354
+ "reward": 0.340625,
355
+ "reward_std": 0.2691764384508133,
356
+ "rewards/acc_reward_func": 0.340625,
357
+ "step": 145
358
+ },
359
+ {
360
+ "completion_length": 160.4046875,
361
+ "epoch": 0.6024096385542169,
362
+ "grad_norm": 0.2847937345504761,
363
+ "kl": 0.0006729792105033994,
364
+ "learning_rate": 5e-06,
365
+ "loss": 0.0001,
366
+ "reward": 0.3046875,
367
+ "reward_std": 0.2551824957132339,
368
+ "rewards/acc_reward_func": 0.3046875,
369
+ "step": 150
370
+ },
371
+ {
372
+ "completion_length": 160.740625,
373
+ "epoch": 0.6224899598393574,
374
+ "grad_norm": 0.41062450408935547,
375
+ "kl": 0.0006228600163012743,
376
+ "learning_rate": 5e-06,
377
+ "loss": 0.0001,
378
+ "reward": 0.3265625,
379
+ "reward_std": 0.24656105935573577,
380
+ "rewards/acc_reward_func": 0.3265625,
381
+ "step": 155
382
+ },
383
+ {
384
+ "completion_length": 168.340625,
385
+ "epoch": 0.642570281124498,
386
+ "grad_norm": 0.33770281076431274,
387
+ "kl": 0.0009516201331280172,
388
+ "learning_rate": 5e-06,
389
+ "loss": 0.0001,
390
+ "reward": 0.3609375,
391
+ "reward_std": 0.2608398377895355,
392
+ "rewards/acc_reward_func": 0.3609375,
393
+ "step": 160
394
+ },
395
+ {
396
+ "completion_length": 161.2953125,
397
+ "epoch": 0.6626506024096386,
398
+ "grad_norm": 0.3424857556819916,
399
+ "kl": 0.0007472435943782329,
400
+ "learning_rate": 5e-06,
401
+ "loss": 0.0001,
402
+ "reward": 0.3734375,
403
+ "reward_std": 0.292040029168129,
404
+ "rewards/acc_reward_func": 0.3734375,
405
+ "step": 165
406
+ },
407
+ {
408
+ "completion_length": 173.7203125,
409
+ "epoch": 0.6827309236947792,
410
+ "grad_norm": 0.24203689396381378,
411
+ "kl": 0.0007647084421478212,
412
+ "learning_rate": 5e-06,
413
+ "loss": 0.0001,
414
+ "reward": 0.28203125,
415
+ "reward_std": 0.2562589019536972,
416
+ "rewards/acc_reward_func": 0.28203125,
417
+ "step": 170
418
+ },
419
+ {
420
+ "completion_length": 167.025,
421
+ "epoch": 0.7028112449799196,
422
+ "grad_norm": 0.34411001205444336,
423
+ "kl": 0.000770233990624547,
424
+ "learning_rate": 5e-06,
425
+ "loss": 0.0001,
426
+ "reward": 0.32578125,
427
+ "reward_std": 0.25892365276813506,
428
+ "rewards/acc_reward_func": 0.32578125,
429
+ "step": 175
430
+ },
431
+ {
432
+ "completion_length": 172.24453125,
433
+ "epoch": 0.7228915662650602,
434
+ "grad_norm": 0.29481959342956543,
435
+ "kl": 0.0009015992400236428,
436
+ "learning_rate": 5e-06,
437
+ "loss": 0.0001,
438
+ "reward": 0.340625,
439
+ "reward_std": 0.25950155556201937,
440
+ "rewards/acc_reward_func": 0.340625,
441
+ "step": 180
442
+ },
443
+ {
444
+ "completion_length": 166.625,
445
+ "epoch": 0.7429718875502008,
446
+ "grad_norm": 0.2277025729417801,
447
+ "kl": 0.00077429274097085,
448
+ "learning_rate": 5e-06,
449
+ "loss": 0.0001,
450
+ "reward": 0.334375,
451
+ "reward_std": 0.2525084614753723,
452
+ "rewards/acc_reward_func": 0.334375,
453
+ "step": 185
454
+ },
455
+ {
456
+ "completion_length": 159.42734375,
457
+ "epoch": 0.7630522088353414,
458
+ "grad_norm": 0.32006603479385376,
459
+ "kl": 0.0008042196277529001,
460
+ "learning_rate": 5e-06,
461
+ "loss": 0.0001,
462
+ "reward": 0.353125,
463
+ "reward_std": 0.28777270913124087,
464
+ "rewards/acc_reward_func": 0.353125,
465
+ "step": 190
466
+ },
467
+ {
468
+ "completion_length": 164.315625,
469
+ "epoch": 0.7831325301204819,
470
+ "grad_norm": 0.42280659079551697,
471
+ "kl": 0.0008120649959892035,
472
+ "learning_rate": 5e-06,
473
+ "loss": 0.0001,
474
+ "reward": 0.34140625,
475
+ "reward_std": 0.2515306770801544,
476
+ "rewards/acc_reward_func": 0.34140625,
477
+ "step": 195
478
+ },
479
+ {
480
+ "completion_length": 163.465625,
481
+ "epoch": 0.8032128514056225,
482
+ "grad_norm": 0.3453792333602905,
483
+ "kl": 0.0007391226128675044,
484
+ "learning_rate": 5e-06,
485
+ "loss": 0.0001,
486
+ "reward": 0.33984375,
487
+ "reward_std": 0.2593150854110718,
488
+ "rewards/acc_reward_func": 0.33984375,
489
+ "step": 200
490
+ },
491
+ {
492
+ "completion_length": 160.57109375,
493
+ "epoch": 0.8232931726907631,
494
+ "grad_norm": 0.22764600813388824,
495
+ "kl": 0.0009074539528228342,
496
+ "learning_rate": 5e-06,
497
+ "loss": 0.0001,
498
+ "reward": 0.33359375,
499
+ "reward_std": 0.24634140729904175,
500
+ "rewards/acc_reward_func": 0.33359375,
501
+ "step": 205
502
+ },
503
+ {
504
+ "completion_length": 173.69453125,
505
+ "epoch": 0.8433734939759037,
506
+ "grad_norm": 0.3373042941093445,
507
+ "kl": 0.0008413837174884975,
508
+ "learning_rate": 5e-06,
509
+ "loss": 0.0001,
510
+ "reward": 0.32734375,
511
+ "reward_std": 0.2701482236385345,
512
+ "rewards/acc_reward_func": 0.32734375,
513
+ "step": 210
514
+ },
515
+ {
516
+ "completion_length": 163.95546875,
517
+ "epoch": 0.8634538152610441,
518
+ "grad_norm": 0.43492230772972107,
519
+ "kl": 0.0008612593519501388,
520
+ "learning_rate": 5e-06,
521
+ "loss": 0.0001,
522
+ "reward": 0.35546875,
523
+ "reward_std": 0.25468774139881134,
524
+ "rewards/acc_reward_func": 0.35546875,
525
+ "step": 215
526
+ },
527
+ {
528
+ "completion_length": 165.01953125,
529
+ "epoch": 0.8835341365461847,
530
+ "grad_norm": 0.47059279680252075,
531
+ "kl": 0.0007435820298269391,
532
+ "learning_rate": 5e-06,
533
+ "loss": 0.0001,
534
+ "reward": 0.3390625,
535
+ "reward_std": 0.26499341428279877,
536
+ "rewards/acc_reward_func": 0.3390625,
537
+ "step": 220
538
+ },
539
+ {
540
+ "completion_length": 169.05078125,
541
+ "epoch": 0.9036144578313253,
542
+ "grad_norm": 0.22417674958705902,
543
+ "kl": 0.0007649007253348828,
544
+ "learning_rate": 5e-06,
545
+ "loss": 0.0001,
546
+ "reward": 0.284375,
547
+ "reward_std": 0.2533974200487137,
548
+ "rewards/acc_reward_func": 0.284375,
549
+ "step": 225
550
+ },
551
+ {
552
+ "completion_length": 173.67109375,
553
+ "epoch": 0.9236947791164659,
554
+ "grad_norm": 0.2978118658065796,
555
+ "kl": 0.0007940458599478006,
556
+ "learning_rate": 5e-06,
557
+ "loss": 0.0001,
558
+ "reward": 0.2984375,
559
+ "reward_std": 0.24650274217128754,
560
+ "rewards/acc_reward_func": 0.2984375,
561
+ "step": 230
562
+ },
563
+ {
564
+ "completion_length": 167.99375,
565
+ "epoch": 0.9437751004016064,
566
+ "grad_norm": 0.2792234420776367,
567
+ "kl": 0.0010552789666689933,
568
+ "learning_rate": 5e-06,
569
+ "loss": 0.0001,
570
+ "reward": 0.35,
571
+ "reward_std": 0.26749635934829713,
572
+ "rewards/acc_reward_func": 0.35,
573
+ "step": 235
574
+ },
575
+ {
576
+ "completion_length": 171.94609375,
577
+ "epoch": 0.963855421686747,
578
+ "grad_norm": 0.2678660452365875,
579
+ "kl": 0.0006605981849133969,
580
+ "learning_rate": 5e-06,
581
+ "loss": 0.0001,
582
+ "reward": 0.31875,
583
+ "reward_std": 0.2612347215414047,
584
+ "rewards/acc_reward_func": 0.31875,
585
+ "step": 240
586
+ },
587
+ {
588
+ "completion_length": 160.3421875,
589
+ "epoch": 0.9839357429718876,
590
+ "grad_norm": 0.6757539510726929,
591
+ "kl": 0.0006685945438221097,
592
+ "learning_rate": 5e-06,
593
+ "loss": 0.0001,
594
+ "reward": 0.33984375,
595
+ "reward_std": 0.2489775002002716,
596
+ "rewards/acc_reward_func": 0.33984375,
597
+ "step": 245
598
+ },
599
+ {
600
+ "completion_length": 159.35982360839844,
601
+ "epoch": 1.0040160642570282,
602
+ "grad_norm": 0.4328139126300812,
603
+ "kl": 0.0009481518063694239,
604
+ "learning_rate": 5e-06,
605
+ "loss": 0.0001,
606
+ "reward": 0.3328125,
607
+ "reward_std": 0.27164973616600036,
608
+ "rewards/acc_reward_func": 0.3328125,
609
+ "step": 250
610
+ },
611
+ {
612
+ "completion_length": 170.5609375,
613
+ "epoch": 1.0240963855421688,
614
+ "grad_norm": 0.40751489996910095,
615
+ "kl": 0.0014292935142293573,
616
+ "learning_rate": 5e-06,
617
+ "loss": 0.0001,
618
+ "reward": 0.34296875,
619
+ "reward_std": 0.26451934576034547,
620
+ "rewards/acc_reward_func": 0.34296875,
621
+ "step": 255
622
+ },
623
+ {
624
+ "completion_length": 158.06484375,
625
+ "epoch": 1.0441767068273093,
626
+ "grad_norm": 0.32165759801864624,
627
+ "kl": 0.0023219846189022064,
628
+ "learning_rate": 5e-06,
629
+ "loss": 0.0002,
630
+ "reward": 0.31796875,
631
+ "reward_std": 0.26662840247154235,
632
+ "rewards/acc_reward_func": 0.31796875,
633
+ "step": 260
634
+ },
635
+ {
636
+ "completion_length": 162.784375,
637
+ "epoch": 1.0642570281124497,
638
+ "grad_norm": 0.4819350838661194,
639
+ "kl": 0.0021671449765563013,
640
+ "learning_rate": 5e-06,
641
+ "loss": 0.0002,
642
+ "reward": 0.3671875,
643
+ "reward_std": 0.2797468721866608,
644
+ "rewards/acc_reward_func": 0.3671875,
645
+ "step": 265
646
+ },
647
+ {
648
+ "completion_length": 174.1890625,
649
+ "epoch": 1.0843373493975903,
650
+ "grad_norm": 0.2753521502017975,
651
+ "kl": 0.0023205589037388562,
652
+ "learning_rate": 5e-06,
653
+ "loss": 0.0002,
654
+ "reward": 0.33984375,
655
+ "reward_std": 0.2735643357038498,
656
+ "rewards/acc_reward_func": 0.33984375,
657
+ "step": 270
658
+ },
659
+ {
660
+ "completion_length": 173.61640625,
661
+ "epoch": 1.104417670682731,
662
+ "grad_norm": 0.2579421103000641,
663
+ "kl": 0.0014010543003678323,
664
+ "learning_rate": 5e-06,
665
+ "loss": 0.0001,
666
+ "reward": 0.2578125,
667
+ "reward_std": 0.2305402159690857,
668
+ "rewards/acc_reward_func": 0.2578125,
669
+ "step": 275
670
+ },
671
+ {
672
+ "completion_length": 168.61640625,
673
+ "epoch": 1.1244979919678715,
674
+ "grad_norm": 0.3602357506752014,
675
+ "kl": 0.0012595997890457512,
676
+ "learning_rate": 5e-06,
677
+ "loss": 0.0001,
678
+ "reward": 0.359375,
679
+ "reward_std": 0.3013946235179901,
680
+ "rewards/acc_reward_func": 0.359375,
681
+ "step": 280
682
+ },
683
+ {
684
+ "completion_length": 171.4640625,
685
+ "epoch": 1.144578313253012,
686
+ "grad_norm": 0.2711893320083618,
687
+ "kl": 0.0009470222401432693,
688
+ "learning_rate": 5e-06,
689
+ "loss": 0.0001,
690
+ "reward": 0.35390625,
691
+ "reward_std": 0.27838287949562074,
692
+ "rewards/acc_reward_func": 0.35390625,
693
+ "step": 285
694
+ },
695
+ {
696
+ "completion_length": 172.66015625,
697
+ "epoch": 1.1646586345381527,
698
+ "grad_norm": 0.2950332760810852,
699
+ "kl": 0.0009261242463253439,
700
+ "learning_rate": 5e-06,
701
+ "loss": 0.0001,
702
+ "reward": 0.33359375,
703
+ "reward_std": 0.28246039152145386,
704
+ "rewards/acc_reward_func": 0.33359375,
705
+ "step": 290
706
+ },
707
+ {
708
+ "completion_length": 174.0375,
709
+ "epoch": 1.1847389558232932,
710
+ "grad_norm": 0.42756450176239014,
711
+ "kl": 0.0008738423697650432,
712
+ "learning_rate": 5e-06,
713
+ "loss": 0.0001,
714
+ "reward": 0.3171875,
715
+ "reward_std": 0.268758499622345,
716
+ "rewards/acc_reward_func": 0.3171875,
717
+ "step": 295
718
+ },
719
+ {
720
+ "completion_length": 163.82265625,
721
+ "epoch": 1.2048192771084336,
722
+ "grad_norm": 0.4912799000740051,
723
+ "kl": 0.001153232657816261,
724
+ "learning_rate": 5e-06,
725
+ "loss": 0.0001,
726
+ "reward": 0.33359375,
727
+ "reward_std": 0.2731469988822937,
728
+ "rewards/acc_reward_func": 0.33359375,
729
+ "step": 300
730
+ },
731
+ {
732
+ "completion_length": 158.7875,
733
+ "epoch": 1.2248995983935742,
734
+ "grad_norm": 0.30150240659713745,
735
+ "kl": 0.0008952352683991194,
736
+ "learning_rate": 5e-06,
737
+ "loss": 0.0001,
738
+ "reward": 0.3546875,
739
+ "reward_std": 0.27324608862400057,
740
+ "rewards/acc_reward_func": 0.3546875,
741
+ "step": 305
742
+ },
743
+ {
744
+ "completion_length": 176.72421875,
745
+ "epoch": 1.2449799196787148,
746
+ "grad_norm": 0.35327720642089844,
747
+ "kl": 0.0011502272100187838,
748
+ "learning_rate": 5e-06,
749
+ "loss": 0.0001,
750
+ "reward": 0.35625,
751
+ "reward_std": 0.27567465901374816,
752
+ "rewards/acc_reward_func": 0.35625,
753
+ "step": 310
754
+ },
755
+ {
756
+ "completion_length": 166.778125,
757
+ "epoch": 1.2650602409638554,
758
+ "grad_norm": 0.3498896360397339,
759
+ "kl": 0.0016141865635290742,
760
+ "learning_rate": 5e-06,
761
+ "loss": 0.0002,
762
+ "reward": 0.30703125,
763
+ "reward_std": 0.261915448307991,
764
+ "rewards/acc_reward_func": 0.30703125,
765
+ "step": 315
766
+ },
767
+ {
768
+ "completion_length": 178.128125,
769
+ "epoch": 1.285140562248996,
770
+ "grad_norm": 0.297959566116333,
771
+ "kl": 0.00151687542675063,
772
+ "learning_rate": 5e-06,
773
+ "loss": 0.0002,
774
+ "reward": 0.32421875,
775
+ "reward_std": 0.27393734753131865,
776
+ "rewards/acc_reward_func": 0.32421875,
777
+ "step": 320
778
+ },
779
+ {
780
+ "completion_length": 170.6953125,
781
+ "epoch": 1.3052208835341366,
782
+ "grad_norm": 0.365997850894928,
783
+ "kl": 0.0010963335167616605,
784
+ "learning_rate": 5e-06,
785
+ "loss": 0.0001,
786
+ "reward": 0.315625,
787
+ "reward_std": 0.2619461864233017,
788
+ "rewards/acc_reward_func": 0.315625,
789
+ "step": 325
790
+ },
791
+ {
792
+ "completion_length": 166.228125,
793
+ "epoch": 1.3253012048192772,
794
+ "grad_norm": 0.36575648188591003,
795
+ "kl": 0.0010827683610841632,
796
+ "learning_rate": 5e-06,
797
+ "loss": 0.0001,
798
+ "reward": 0.31015625,
799
+ "reward_std": 0.26013057231903075,
800
+ "rewards/acc_reward_func": 0.31015625,
801
+ "step": 330
802
+ },
803
+ {
804
+ "completion_length": 166.16484375,
805
+ "epoch": 1.3453815261044177,
806
+ "grad_norm": 0.29126739501953125,
807
+ "kl": 0.0009882883401587605,
808
+ "learning_rate": 5e-06,
809
+ "loss": 0.0001,
810
+ "reward": 0.309375,
811
+ "reward_std": 0.25576087832450867,
812
+ "rewards/acc_reward_func": 0.309375,
813
+ "step": 335
814
+ },
815
+ {
816
+ "completion_length": 173.39453125,
817
+ "epoch": 1.3654618473895583,
818
+ "grad_norm": 0.33560770750045776,
819
+ "kl": 0.0011842920910567045,
820
+ "learning_rate": 5e-06,
821
+ "loss": 0.0001,
822
+ "reward": 0.31953125,
823
+ "reward_std": 0.2704659789800644,
824
+ "rewards/acc_reward_func": 0.31953125,
825
+ "step": 340
826
+ },
827
+ {
828
+ "completion_length": 173.82578125,
829
+ "epoch": 1.3855421686746987,
830
+ "grad_norm": 0.2982430160045624,
831
+ "kl": 0.0015789813129231333,
832
+ "learning_rate": 5e-06,
833
+ "loss": 0.0002,
834
+ "reward": 0.34453125,
835
+ "reward_std": 0.3086081326007843,
836
+ "rewards/acc_reward_func": 0.34453125,
837
+ "step": 345
838
+ },
839
+ {
840
+ "completion_length": 162.96953125,
841
+ "epoch": 1.4056224899598393,
842
+ "grad_norm": 0.5575153827667236,
843
+ "kl": 0.001101066661067307,
844
+ "learning_rate": 5e-06,
845
+ "loss": 0.0001,
846
+ "reward": 0.3765625,
847
+ "reward_std": 0.29579696655273435,
848
+ "rewards/acc_reward_func": 0.3765625,
849
+ "step": 350
850
+ },
851
+ {
852
+ "completion_length": 181.57578125,
853
+ "epoch": 1.4257028112449799,
854
+ "grad_norm": 0.5113179683685303,
855
+ "kl": 0.001085875742137432,
856
+ "learning_rate": 5e-06,
857
+ "loss": 0.0001,
858
+ "reward": 0.29609375,
859
+ "reward_std": 0.26413264572620393,
860
+ "rewards/acc_reward_func": 0.29609375,
861
+ "step": 355
862
+ },
863
+ {
864
+ "completion_length": 174.1046875,
865
+ "epoch": 1.4457831325301205,
866
+ "grad_norm": 0.47564586997032166,
867
+ "kl": 0.0010274604661390184,
868
+ "learning_rate": 5e-06,
869
+ "loss": 0.0001,
870
+ "reward": 0.32890625,
871
+ "reward_std": 0.2666553735733032,
872
+ "rewards/acc_reward_func": 0.32890625,
873
+ "step": 360
874
+ },
875
+ {
876
+ "completion_length": 164.609375,
877
+ "epoch": 1.465863453815261,
878
+ "grad_norm": 0.23254898190498352,
879
+ "kl": 0.0011269458453170955,
880
+ "learning_rate": 5e-06,
881
+ "loss": 0.0001,
882
+ "reward": 0.3921875,
883
+ "reward_std": 0.26207175552845,
884
+ "rewards/acc_reward_func": 0.3921875,
885
+ "step": 365
886
+ },
887
+ {
888
+ "completion_length": 171.3859375,
889
+ "epoch": 1.4859437751004017,
890
+ "grad_norm": 0.2955392599105835,
891
+ "kl": 0.0013015888049267232,
892
+ "learning_rate": 5e-06,
893
+ "loss": 0.0001,
894
+ "reward": 0.3390625,
895
+ "reward_std": 0.2618942677974701,
896
+ "rewards/acc_reward_func": 0.3390625,
897
+ "step": 370
898
+ },
899
+ {
900
+ "completion_length": 170.51328125,
901
+ "epoch": 1.5060240963855422,
902
+ "grad_norm": 0.2367907017469406,
903
+ "kl": 0.0012052926933392883,
904
+ "learning_rate": 5e-06,
905
+ "loss": 0.0001,
906
+ "reward": 0.30859375,
907
+ "reward_std": 0.243873330950737,
908
+ "rewards/acc_reward_func": 0.30859375,
909
+ "step": 375
910
+ },
911
+ {
912
+ "completion_length": 171.328125,
913
+ "epoch": 1.5261044176706826,
914
+ "grad_norm": 0.4387330412864685,
915
+ "kl": 0.0011040043318644166,
916
+ "learning_rate": 5e-06,
917
+ "loss": 0.0001,
918
+ "reward": 0.34296875,
919
+ "reward_std": 0.27786171436309814,
920
+ "rewards/acc_reward_func": 0.34296875,
921
+ "step": 380
922
+ },
923
+ {
924
+ "completion_length": 167.015625,
925
+ "epoch": 1.5461847389558234,
926
+ "grad_norm": 0.2528681457042694,
927
+ "kl": 0.001643406949006021,
928
+ "learning_rate": 5e-06,
929
+ "loss": 0.0002,
930
+ "reward": 0.36484375,
931
+ "reward_std": 0.3019732892513275,
932
+ "rewards/acc_reward_func": 0.36484375,
933
+ "step": 385
934
+ },
935
+ {
936
+ "completion_length": 150.4078125,
937
+ "epoch": 1.5662650602409638,
938
+ "grad_norm": 0.35000428557395935,
939
+ "kl": 0.0013000907842069865,
940
+ "learning_rate": 5e-06,
941
+ "loss": 0.0001,
942
+ "reward": 0.378125,
943
+ "reward_std": 0.28335249423980713,
944
+ "rewards/acc_reward_func": 0.378125,
945
+ "step": 390
946
+ },
947
+ {
948
+ "completion_length": 158.709375,
949
+ "epoch": 1.5863453815261044,
950
+ "grad_norm": 0.259275883436203,
951
+ "kl": 0.0016851373482495546,
952
+ "learning_rate": 5e-06,
953
+ "loss": 0.0002,
954
+ "reward": 0.31953125,
955
+ "reward_std": 0.28603203892707824,
956
+ "rewards/acc_reward_func": 0.31953125,
957
+ "step": 395
958
+ },
959
+ {
960
+ "completion_length": 160.56796875,
961
+ "epoch": 1.606425702811245,
962
+ "grad_norm": 0.31947171688079834,
963
+ "kl": 0.001047816756181419,
964
+ "learning_rate": 5e-06,
965
+ "loss": 0.0001,
966
+ "reward": 0.3359375,
967
+ "reward_std": 0.27272760272026064,
968
+ "rewards/acc_reward_func": 0.3359375,
969
+ "step": 400
970
+ },
971
+ {
972
+ "completion_length": 161.90234375,
973
+ "epoch": 1.6265060240963856,
974
+ "grad_norm": 0.34025779366493225,
975
+ "kl": 0.0011820008745416998,
976
+ "learning_rate": 5e-06,
977
+ "loss": 0.0001,
978
+ "reward": 0.33671875,
979
+ "reward_std": 0.2665954947471619,
980
+ "rewards/acc_reward_func": 0.33671875,
981
+ "step": 405
982
+ },
983
+ {
984
+ "completion_length": 167.6984375,
985
+ "epoch": 1.6465863453815262,
986
+ "grad_norm": 0.2868824005126953,
987
+ "kl": 0.0010516393929719924,
988
+ "learning_rate": 5e-06,
989
+ "loss": 0.0001,
990
+ "reward": 0.3578125,
991
+ "reward_std": 0.2612337410449982,
992
+ "rewards/acc_reward_func": 0.3578125,
993
+ "step": 410
994
+ },
995
+ {
996
+ "completion_length": 166.8375,
997
+ "epoch": 1.6666666666666665,
998
+ "grad_norm": 0.30937594175338745,
999
+ "kl": 0.0011767351999878884,
1000
+ "learning_rate": 5e-06,
1001
+ "loss": 0.0001,
1002
+ "reward": 0.30859375,
1003
+ "reward_std": 0.2570806533098221,
1004
+ "rewards/acc_reward_func": 0.30859375,
1005
+ "step": 415
1006
+ },
1007
+ {
1008
+ "completion_length": 170.19765625,
1009
+ "epoch": 1.6867469879518073,
1010
+ "grad_norm": 0.25181344151496887,
1011
+ "kl": 0.0010755043127574026,
1012
+ "learning_rate": 5e-06,
1013
+ "loss": 0.0001,
1014
+ "reward": 0.33125,
1015
+ "reward_std": 0.22994134724140167,
1016
+ "rewards/acc_reward_func": 0.33125,
1017
+ "step": 420
1018
+ },
1019
+ {
1020
+ "completion_length": 165.94609375,
1021
+ "epoch": 1.7068273092369477,
1022
+ "grad_norm": 0.396030992269516,
1023
+ "kl": 0.0012019906658679246,
1024
+ "learning_rate": 5e-06,
1025
+ "loss": 0.0001,
1026
+ "reward": 0.315625,
1027
+ "reward_std": 0.27038955092430117,
1028
+ "rewards/acc_reward_func": 0.315625,
1029
+ "step": 425
1030
+ },
1031
+ {
1032
+ "completion_length": 167.24140625,
1033
+ "epoch": 1.7269076305220885,
1034
+ "grad_norm": 0.3269684314727783,
1035
+ "kl": 0.001297155674546957,
1036
+ "learning_rate": 5e-06,
1037
+ "loss": 0.0001,
1038
+ "reward": 0.32890625,
1039
+ "reward_std": 0.26612446308135984,
1040
+ "rewards/acc_reward_func": 0.32890625,
1041
+ "step": 430
1042
+ },
1043
+ {
1044
+ "completion_length": 155.46015625,
1045
+ "epoch": 1.7469879518072289,
1046
+ "grad_norm": 0.46771514415740967,
1047
+ "kl": 0.0014664881862699985,
1048
+ "learning_rate": 5e-06,
1049
+ "loss": 0.0001,
1050
+ "reward": 0.3234375,
1051
+ "reward_std": 0.24766394197940828,
1052
+ "rewards/acc_reward_func": 0.3234375,
1053
+ "step": 435
1054
+ },
1055
+ {
1056
+ "completion_length": 164.859375,
1057
+ "epoch": 1.7670682730923695,
1058
+ "grad_norm": 0.24466943740844727,
1059
+ "kl": 0.0012787181185558438,
1060
+ "learning_rate": 5e-06,
1061
+ "loss": 0.0001,
1062
+ "reward": 0.334375,
1063
+ "reward_std": 0.2795639634132385,
1064
+ "rewards/acc_reward_func": 0.334375,
1065
+ "step": 440
1066
+ },
1067
+ {
1068
+ "completion_length": 180.32578125,
1069
+ "epoch": 1.78714859437751,
1070
+ "grad_norm": 0.28328680992126465,
1071
+ "kl": 0.001317713246680796,
1072
+ "learning_rate": 5e-06,
1073
+ "loss": 0.0001,
1074
+ "reward": 0.29296875,
1075
+ "reward_std": 0.22741309106349944,
1076
+ "rewards/acc_reward_func": 0.29296875,
1077
+ "step": 445
1078
+ },
1079
+ {
1080
+ "completion_length": 163.96875,
1081
+ "epoch": 1.8072289156626506,
1082
+ "grad_norm": 0.27741825580596924,
1083
+ "kl": 0.0015525751281529666,
1084
+ "learning_rate": 5e-06,
1085
+ "loss": 0.0002,
1086
+ "reward": 0.32421875,
1087
+ "reward_std": 0.28004167675971986,
1088
+ "rewards/acc_reward_func": 0.32421875,
1089
+ "step": 450
1090
+ },
1091
+ {
1092
+ "completion_length": 165.8140625,
1093
+ "epoch": 1.8273092369477912,
1094
+ "grad_norm": 0.2740982472896576,
1095
+ "kl": 0.0013578152284026146,
1096
+ "learning_rate": 5e-06,
1097
+ "loss": 0.0001,
1098
+ "reward": 0.33203125,
1099
+ "reward_std": 0.2814855635166168,
1100
+ "rewards/acc_reward_func": 0.33203125,
1101
+ "step": 455
1102
+ },
1103
+ {
1104
+ "completion_length": 164.31171875,
1105
+ "epoch": 1.8473895582329316,
1106
+ "grad_norm": 0.4178365468978882,
1107
+ "kl": 0.0013977297581732272,
1108
+ "learning_rate": 5e-06,
1109
+ "loss": 0.0001,
1110
+ "reward": 0.36796875,
1111
+ "reward_std": 0.2695493370294571,
1112
+ "rewards/acc_reward_func": 0.36796875,
1113
+ "step": 460
1114
+ },
1115
+ {
1116
+ "completion_length": 176.71953125,
1117
+ "epoch": 1.8674698795180724,
1118
+ "grad_norm": 0.2551046311855316,
1119
+ "kl": 0.0013892159098759294,
1120
+ "learning_rate": 5e-06,
1121
+ "loss": 0.0001,
1122
+ "reward": 0.26328125,
1123
+ "reward_std": 0.24187707006931305,
1124
+ "rewards/acc_reward_func": 0.26328125,
1125
+ "step": 465
1126
+ },
1127
+ {
1128
+ "completion_length": 162.015625,
1129
+ "epoch": 1.8875502008032128,
1130
+ "grad_norm": 0.23321259021759033,
1131
+ "kl": 0.0014218664728105068,
1132
+ "learning_rate": 5e-06,
1133
+ "loss": 0.0001,
1134
+ "reward": 0.31328125,
1135
+ "reward_std": 0.24751116037368776,
1136
+ "rewards/acc_reward_func": 0.31328125,
1137
+ "step": 470
1138
+ },
1139
+ {
1140
+ "completion_length": 181.08125,
1141
+ "epoch": 1.9076305220883534,
1142
+ "grad_norm": 0.26958364248275757,
1143
+ "kl": 0.0012838932918384672,
1144
+ "learning_rate": 5e-06,
1145
+ "loss": 0.0001,
1146
+ "reward": 0.31328125,
1147
+ "reward_std": 0.28040671050548555,
1148
+ "rewards/acc_reward_func": 0.31328125,
1149
+ "step": 475
1150
+ },
1151
+ {
1152
+ "completion_length": 162.40390625,
1153
+ "epoch": 1.927710843373494,
1154
+ "grad_norm": 0.27885714173316956,
1155
+ "kl": 0.0011875152122229338,
1156
+ "learning_rate": 5e-06,
1157
+ "loss": 0.0001,
1158
+ "reward": 0.36484375,
1159
+ "reward_std": 0.26504404842853546,
1160
+ "rewards/acc_reward_func": 0.36484375,
1161
+ "step": 480
1162
+ },
1163
+ {
1164
+ "completion_length": 158.46640625,
1165
+ "epoch": 1.9477911646586346,
1166
+ "grad_norm": 0.27203765511512756,
1167
+ "kl": 0.0011519475607201456,
1168
+ "learning_rate": 5e-06,
1169
+ "loss": 0.0001,
1170
+ "reward": 0.35390625,
1171
+ "reward_std": 0.25064152777194976,
1172
+ "rewards/acc_reward_func": 0.35390625,
1173
+ "step": 485
1174
+ },
1175
+ {
1176
+ "completion_length": 163.49921875,
1177
+ "epoch": 1.9678714859437751,
1178
+ "grad_norm": 0.38953107595443726,
1179
+ "kl": 0.002528494060970843,
1180
+ "learning_rate": 5e-06,
1181
+ "loss": 0.0003,
1182
+ "reward": 0.33125,
1183
+ "reward_std": 0.265842866897583,
1184
+ "rewards/acc_reward_func": 0.33125,
1185
+ "step": 490
1186
+ },
1187
+ {
1188
+ "completion_length": 160.84140625,
1189
+ "epoch": 1.9879518072289155,
1190
+ "grad_norm": 0.268568754196167,
1191
+ "kl": 0.0014860291033983231,
1192
+ "learning_rate": 5e-06,
1193
+ "loss": 0.0001,
1194
+ "reward": 0.31953125,
1195
+ "reward_std": 0.25731430053710935,
1196
+ "rewards/acc_reward_func": 0.31953125,
1197
+ "step": 495
1198
+ },
1199
+ {
1200
+ "completion_length": 179.959375,
1201
+ "epoch": 2.0080321285140563,
1202
+ "grad_norm": 0.3279021680355072,
1203
+ "kl": 0.0021104462211951613,
1204
+ "learning_rate": 5e-06,
1205
+ "loss": 0.0002,
1206
+ "reward": 0.34140625,
1207
+ "reward_std": 0.2936858534812927,
1208
+ "rewards/acc_reward_func": 0.34140625,
1209
+ "step": 500
1210
+ },
1211
+ {
1212
+ "completion_length": 172.03984375,
1213
+ "epoch": 2.0281124497991967,
1214
+ "grad_norm": 0.30037155747413635,
1215
+ "kl": 0.001345854508690536,
1216
+ "learning_rate": 5e-06,
1217
+ "loss": 0.0001,
1218
+ "reward": 0.2671875,
1219
+ "reward_std": 0.23858949542045593,
1220
+ "rewards/acc_reward_func": 0.2671875,
1221
+ "step": 505
1222
+ },
1223
+ {
1224
+ "completion_length": 166.13359375,
1225
+ "epoch": 2.0481927710843375,
1226
+ "grad_norm": 0.24528227746486664,
1227
+ "kl": 0.0013967655366286635,
1228
+ "learning_rate": 5e-06,
1229
+ "loss": 0.0001,
1230
+ "reward": 0.28203125,
1231
+ "reward_std": 0.2324840843677521,
1232
+ "rewards/acc_reward_func": 0.28203125,
1233
+ "step": 510
1234
+ },
1235
+ {
1236
+ "completion_length": 167.35703125,
1237
+ "epoch": 2.068273092369478,
1238
+ "grad_norm": 0.4017987847328186,
1239
+ "kl": 0.0014677543425932527,
1240
+ "learning_rate": 5e-06,
1241
+ "loss": 0.0001,
1242
+ "reward": 0.36484375,
1243
+ "reward_std": 0.2759640544652939,
1244
+ "rewards/acc_reward_func": 0.36484375,
1245
+ "step": 515
1246
+ },
1247
+ {
1248
+ "completion_length": 171.8515625,
1249
+ "epoch": 2.0883534136546187,
1250
+ "grad_norm": 0.3457529842853546,
1251
+ "kl": 0.0014000870054587723,
1252
+ "learning_rate": 5e-06,
1253
+ "loss": 0.0001,
1254
+ "reward": 0.31875,
1255
+ "reward_std": 0.2790384829044342,
1256
+ "rewards/acc_reward_func": 0.31875,
1257
+ "step": 520
1258
+ },
1259
+ {
1260
+ "completion_length": 164.70078125,
1261
+ "epoch": 2.108433734939759,
1262
+ "grad_norm": 0.21619907021522522,
1263
+ "kl": 0.0014295668806880713,
1264
+ "learning_rate": 5e-06,
1265
+ "loss": 0.0001,
1266
+ "reward": 0.3140625,
1267
+ "reward_std": 0.2750910699367523,
1268
+ "rewards/acc_reward_func": 0.3140625,
1269
+ "step": 525
1270
+ },
1271
+ {
1272
+ "completion_length": 169.97265625,
1273
+ "epoch": 2.1285140562248994,
1274
+ "grad_norm": 0.31079721450805664,
1275
+ "kl": 0.0012559856520965695,
1276
+ "learning_rate": 5e-06,
1277
+ "loss": 0.0001,
1278
+ "reward": 0.33125,
1279
+ "reward_std": 0.24622083306312562,
1280
+ "rewards/acc_reward_func": 0.33125,
1281
+ "step": 530
1282
+ },
1283
+ {
1284
+ "completion_length": 170.13984375,
1285
+ "epoch": 2.1485943775100402,
1286
+ "grad_norm": 0.27532029151916504,
1287
+ "kl": 0.0015559423482045531,
1288
+ "learning_rate": 5e-06,
1289
+ "loss": 0.0002,
1290
+ "reward": 0.38359375,
1291
+ "reward_std": 0.28083202838897703,
1292
+ "rewards/acc_reward_func": 0.38359375,
1293
+ "step": 535
1294
+ },
1295
+ {
1296
+ "completion_length": 177.421875,
1297
+ "epoch": 2.1686746987951806,
1298
+ "grad_norm": 0.31572937965393066,
1299
+ "kl": 0.0013410489307716488,
1300
+ "learning_rate": 5e-06,
1301
+ "loss": 0.0001,
1302
+ "reward": 0.265625,
1303
+ "reward_std": 0.2276999294757843,
1304
+ "rewards/acc_reward_func": 0.265625,
1305
+ "step": 540
1306
+ },
1307
+ {
1308
+ "completion_length": 167.60234375,
1309
+ "epoch": 2.1887550200803214,
1310
+ "grad_norm": 0.29043304920196533,
1311
+ "kl": 0.0014236285351216793,
1312
+ "learning_rate": 5e-06,
1313
+ "loss": 0.0001,
1314
+ "reward": 0.3203125,
1315
+ "reward_std": 0.23548413515090943,
1316
+ "rewards/acc_reward_func": 0.3203125,
1317
+ "step": 545
1318
+ },
1319
+ {
1320
+ "completion_length": 166.23671875,
1321
+ "epoch": 2.208835341365462,
1322
+ "grad_norm": 0.6554353833198547,
1323
+ "kl": 0.0018002047901973129,
1324
+ "learning_rate": 5e-06,
1325
+ "loss": 0.0002,
1326
+ "reward": 0.3734375,
1327
+ "reward_std": 0.2930103540420532,
1328
+ "rewards/acc_reward_func": 0.3734375,
1329
+ "step": 550
1330
+ },
1331
+ {
1332
+ "completion_length": 159.91953125,
1333
+ "epoch": 2.2289156626506026,
1334
+ "grad_norm": 0.5253407955169678,
1335
+ "kl": 0.0016718338709324598,
1336
+ "learning_rate": 5e-06,
1337
+ "loss": 0.0002,
1338
+ "reward": 0.3703125,
1339
+ "reward_std": 0.2658358722925186,
1340
+ "rewards/acc_reward_func": 0.3703125,
1341
+ "step": 555
1342
+ },
1343
+ {
1344
+ "completion_length": 172.54296875,
1345
+ "epoch": 2.248995983935743,
1346
+ "grad_norm": 0.28389930725097656,
1347
+ "kl": 0.0020956686232239006,
1348
+ "learning_rate": 5e-06,
1349
+ "loss": 0.0002,
1350
+ "reward": 0.3203125,
1351
+ "reward_std": 0.2805899143218994,
1352
+ "rewards/acc_reward_func": 0.3203125,
1353
+ "step": 560
1354
+ },
1355
+ {
1356
+ "completion_length": 166.2515625,
1357
+ "epoch": 2.2690763052208833,
1358
+ "grad_norm": 0.27247288823127747,
1359
+ "kl": 0.0015082385856658221,
1360
+ "learning_rate": 5e-06,
1361
+ "loss": 0.0002,
1362
+ "reward": 0.3296875,
1363
+ "reward_std": 0.2336159199476242,
1364
+ "rewards/acc_reward_func": 0.3296875,
1365
+ "step": 565
1366
+ },
1367
+ {
1368
+ "completion_length": 157.78359375,
1369
+ "epoch": 2.289156626506024,
1370
+ "grad_norm": 0.31251490116119385,
1371
+ "kl": 0.001438130042515695,
1372
+ "learning_rate": 5e-06,
1373
+ "loss": 0.0001,
1374
+ "reward": 0.3796875,
1375
+ "reward_std": 0.277830982208252,
1376
+ "rewards/acc_reward_func": 0.3796875,
1377
+ "step": 570
1378
+ },
1379
+ {
1380
+ "completion_length": 172.43984375,
1381
+ "epoch": 2.3092369477911645,
1382
+ "grad_norm": 0.37210726737976074,
1383
+ "kl": 0.001831929781474173,
1384
+ "learning_rate": 5e-06,
1385
+ "loss": 0.0002,
1386
+ "reward": 0.296875,
1387
+ "reward_std": 0.2679367482662201,
1388
+ "rewards/acc_reward_func": 0.296875,
1389
+ "step": 575
1390
+ },
1391
+ {
1392
+ "completion_length": 162.16015625,
1393
+ "epoch": 2.3293172690763053,
1394
+ "grad_norm": 0.5997582077980042,
1395
+ "kl": 0.0017479128437116742,
1396
+ "learning_rate": 5e-06,
1397
+ "loss": 0.0002,
1398
+ "reward": 0.3875,
1399
+ "reward_std": 0.28567528128623965,
1400
+ "rewards/acc_reward_func": 0.3875,
1401
+ "step": 580
1402
+ },
1403
+ {
1404
+ "completion_length": 167.84765625,
1405
+ "epoch": 2.3493975903614457,
1406
+ "grad_norm": 0.23533369600772858,
1407
+ "kl": 0.001749329548329115,
1408
+ "learning_rate": 5e-06,
1409
+ "loss": 0.0002,
1410
+ "reward": 0.31875,
1411
+ "reward_std": 0.2516347885131836,
1412
+ "rewards/acc_reward_func": 0.31875,
1413
+ "step": 585
1414
+ },
1415
+ {
1416
+ "completion_length": 165.9859375,
1417
+ "epoch": 2.3694779116465865,
1418
+ "grad_norm": 0.3740471303462982,
1419
+ "kl": 0.001700690435245633,
1420
+ "learning_rate": 5e-06,
1421
+ "loss": 0.0002,
1422
+ "reward": 0.31484375,
1423
+ "reward_std": 0.24785314798355101,
1424
+ "rewards/acc_reward_func": 0.31484375,
1425
+ "step": 590
1426
+ },
1427
+ {
1428
+ "completion_length": 160.9109375,
1429
+ "epoch": 2.389558232931727,
1430
+ "grad_norm": 0.8214355111122131,
1431
+ "kl": 0.0020116518251597883,
1432
+ "learning_rate": 5e-06,
1433
+ "loss": 0.0002,
1434
+ "reward": 0.37265625,
1435
+ "reward_std": 0.2642554700374603,
1436
+ "rewards/acc_reward_func": 0.37265625,
1437
+ "step": 595
1438
+ },
1439
+ {
1440
+ "completion_length": 177.66796875,
1441
+ "epoch": 2.4096385542168672,
1442
+ "grad_norm": 0.48972687125205994,
1443
+ "kl": 0.0014791298424825072,
1444
+ "learning_rate": 5e-06,
1445
+ "loss": 0.0001,
1446
+ "reward": 0.2984375,
1447
+ "reward_std": 0.2614441394805908,
1448
+ "rewards/acc_reward_func": 0.2984375,
1449
+ "step": 600
1450
+ },
1451
+ {
1452
+ "completion_length": 167.84765625,
1453
+ "epoch": 2.429718875502008,
1454
+ "grad_norm": 0.29758477210998535,
1455
+ "kl": 0.0022905914578586818,
1456
+ "learning_rate": 5e-06,
1457
+ "loss": 0.0002,
1458
+ "reward": 0.321875,
1459
+ "reward_std": 0.2514772891998291,
1460
+ "rewards/acc_reward_func": 0.321875,
1461
+ "step": 605
1462
+ },
1463
+ {
1464
+ "completion_length": 160.82265625,
1465
+ "epoch": 2.4497991967871484,
1466
+ "grad_norm": 0.3575228750705719,
1467
+ "kl": 0.0013975306414067746,
1468
+ "learning_rate": 5e-06,
1469
+ "loss": 0.0001,
1470
+ "reward": 0.34765625,
1471
+ "reward_std": 0.28261533975601194,
1472
+ "rewards/acc_reward_func": 0.34765625,
1473
+ "step": 610
1474
+ },
1475
+ {
1476
+ "completion_length": 165.34296875,
1477
+ "epoch": 2.4698795180722892,
1478
+ "grad_norm": 0.21417830884456635,
1479
+ "kl": 0.0015371570363640786,
1480
+ "learning_rate": 5e-06,
1481
+ "loss": 0.0002,
1482
+ "reward": 0.33671875,
1483
+ "reward_std": 0.25463504195213316,
1484
+ "rewards/acc_reward_func": 0.33671875,
1485
+ "step": 615
1486
+ },
1487
+ {
1488
+ "completion_length": 173.79921875,
1489
+ "epoch": 2.4899598393574296,
1490
+ "grad_norm": 0.3487497866153717,
1491
+ "kl": 0.0016002030344679952,
1492
+ "learning_rate": 5e-06,
1493
+ "loss": 0.0002,
1494
+ "reward": 0.35078125,
1495
+ "reward_std": 0.2584144353866577,
1496
+ "rewards/acc_reward_func": 0.35078125,
1497
+ "step": 620
1498
+ },
1499
+ {
1500
+ "completion_length": 170.08125,
1501
+ "epoch": 2.5100401606425704,
1502
+ "grad_norm": 0.34159645438194275,
1503
+ "kl": 0.0013422498479485512,
1504
+ "learning_rate": 5e-06,
1505
+ "loss": 0.0001,
1506
+ "reward": 0.38203125,
1507
+ "reward_std": 0.2949507474899292,
1508
+ "rewards/acc_reward_func": 0.38203125,
1509
+ "step": 625
1510
+ },
1511
+ {
1512
+ "completion_length": 163.34765625,
1513
+ "epoch": 2.5301204819277108,
1514
+ "grad_norm": 0.44099095463752747,
1515
+ "kl": 0.0015945957973599433,
1516
+ "learning_rate": 5e-06,
1517
+ "loss": 0.0002,
1518
+ "reward": 0.34921875,
1519
+ "reward_std": 0.2757268697023392,
1520
+ "rewards/acc_reward_func": 0.34921875,
1521
+ "step": 630
1522
+ },
1523
+ {
1524
+ "completion_length": 170.05703125,
1525
+ "epoch": 2.550200803212851,
1526
+ "grad_norm": 0.4719444215297699,
1527
+ "kl": 0.0016027268255129456,
1528
+ "learning_rate": 5e-06,
1529
+ "loss": 0.0002,
1530
+ "reward": 0.32109375,
1531
+ "reward_std": 0.2545212864875793,
1532
+ "rewards/acc_reward_func": 0.32109375,
1533
+ "step": 635
1534
+ },
1535
+ {
1536
+ "completion_length": 167.87421875,
1537
+ "epoch": 2.570281124497992,
1538
+ "grad_norm": 0.34449702501296997,
1539
+ "kl": 0.001873347139917314,
1540
+ "learning_rate": 5e-06,
1541
+ "loss": 0.0002,
1542
+ "reward": 0.30390625,
1543
+ "reward_std": 0.2682854264974594,
1544
+ "rewards/acc_reward_func": 0.30390625,
1545
+ "step": 640
1546
+ },
1547
+ {
1548
+ "completion_length": 158.384375,
1549
+ "epoch": 2.5903614457831328,
1550
+ "grad_norm": 0.35067522525787354,
1551
+ "kl": 0.0016631773207336665,
1552
+ "learning_rate": 5e-06,
1553
+ "loss": 0.0002,
1554
+ "reward": 0.4421875,
1555
+ "reward_std": 0.3065200746059418,
1556
+ "rewards/acc_reward_func": 0.4421875,
1557
+ "step": 645
1558
+ },
1559
+ {
1560
+ "completion_length": 171.3046875,
1561
+ "epoch": 2.610441767068273,
1562
+ "grad_norm": 0.406903475522995,
1563
+ "kl": 0.001370012597180903,
1564
+ "learning_rate": 5e-06,
1565
+ "loss": 0.0001,
1566
+ "reward": 0.32265625,
1567
+ "reward_std": 0.2770428955554962,
1568
+ "rewards/acc_reward_func": 0.32265625,
1569
+ "step": 650
1570
+ },
1571
+ {
1572
+ "completion_length": 170.6921875,
1573
+ "epoch": 2.6305220883534135,
1574
+ "grad_norm": 0.3024086654186249,
1575
+ "kl": 0.0018637768691405654,
1576
+ "learning_rate": 5e-06,
1577
+ "loss": 0.0002,
1578
+ "reward": 0.33828125,
1579
+ "reward_std": 0.2950285911560059,
1580
+ "rewards/acc_reward_func": 0.33828125,
1581
+ "step": 655
1582
+ },
1583
+ {
1584
+ "completion_length": 161.92890625,
1585
+ "epoch": 2.6506024096385543,
1586
+ "grad_norm": 0.25132983922958374,
1587
+ "kl": 0.0013798804953694343,
1588
+ "learning_rate": 5e-06,
1589
+ "loss": 0.0001,
1590
+ "reward": 0.33671875,
1591
+ "reward_std": 0.25781014263629914,
1592
+ "rewards/acc_reward_func": 0.33671875,
1593
+ "step": 660
1594
+ },
1595
+ {
1596
+ "completion_length": 165.38203125,
1597
+ "epoch": 2.6706827309236947,
1598
+ "grad_norm": 0.24434784054756165,
1599
+ "kl": 0.0015540448017418384,
1600
+ "learning_rate": 5e-06,
1601
+ "loss": 0.0002,
1602
+ "reward": 0.34609375,
1603
+ "reward_std": 0.2942123174667358,
1604
+ "rewards/acc_reward_func": 0.34609375,
1605
+ "step": 665
1606
+ },
1607
+ {
1608
+ "completion_length": 168.7765625,
1609
+ "epoch": 2.6907630522088355,
1610
+ "grad_norm": 0.28431424498558044,
1611
+ "kl": 0.0027118735713884236,
1612
+ "learning_rate": 5e-06,
1613
+ "loss": 0.0003,
1614
+ "reward": 0.32734375,
1615
+ "reward_std": 0.26073101758956907,
1616
+ "rewards/acc_reward_func": 0.32734375,
1617
+ "step": 670
1618
+ },
1619
+ {
1620
+ "completion_length": 166.51484375,
1621
+ "epoch": 2.710843373493976,
1622
+ "grad_norm": 0.2701532542705536,
1623
+ "kl": 0.0017335619311779737,
1624
+ "learning_rate": 5e-06,
1625
+ "loss": 0.0002,
1626
+ "reward": 0.37890625,
1627
+ "reward_std": 0.29145594835281374,
1628
+ "rewards/acc_reward_func": 0.37890625,
1629
+ "step": 675
1630
+ },
1631
+ {
1632
+ "completion_length": 172.08359375,
1633
+ "epoch": 2.7309236947791167,
1634
+ "grad_norm": 0.25592124462127686,
1635
+ "kl": 0.001596282934769988,
1636
+ "learning_rate": 5e-06,
1637
+ "loss": 0.0002,
1638
+ "reward": 0.3484375,
1639
+ "reward_std": 0.3067091882228851,
1640
+ "rewards/acc_reward_func": 0.3484375,
1641
+ "step": 680
1642
+ },
1643
+ {
1644
+ "completion_length": 169.3625,
1645
+ "epoch": 2.751004016064257,
1646
+ "grad_norm": 0.369191437959671,
1647
+ "kl": 0.0018961878260597587,
1648
+ "learning_rate": 5e-06,
1649
+ "loss": 0.0002,
1650
+ "reward": 0.35703125,
1651
+ "reward_std": 0.26675377786159515,
1652
+ "rewards/acc_reward_func": 0.35703125,
1653
+ "step": 685
1654
+ },
1655
+ {
1656
+ "completion_length": 167.40546875,
1657
+ "epoch": 2.7710843373493974,
1658
+ "grad_norm": 0.3414689600467682,
1659
+ "kl": 0.001623287471011281,
1660
+ "learning_rate": 5e-06,
1661
+ "loss": 0.0002,
1662
+ "reward": 0.34375,
1663
+ "reward_std": 0.24926708936691283,
1664
+ "rewards/acc_reward_func": 0.34375,
1665
+ "step": 690
1666
+ },
1667
+ {
1668
+ "completion_length": 164.5421875,
1669
+ "epoch": 2.791164658634538,
1670
+ "grad_norm": 0.2602121829986572,
1671
+ "kl": 0.0020977890817448497,
1672
+ "learning_rate": 5e-06,
1673
+ "loss": 0.0002,
1674
+ "reward": 0.3359375,
1675
+ "reward_std": 0.2536847472190857,
1676
+ "rewards/acc_reward_func": 0.3359375,
1677
+ "step": 695
1678
+ },
1679
+ {
1680
+ "completion_length": 153.6890625,
1681
+ "epoch": 2.8112449799196786,
1682
+ "grad_norm": 0.28262779116630554,
1683
+ "kl": 0.0015694845002144574,
1684
+ "learning_rate": 5e-06,
1685
+ "loss": 0.0002,
1686
+ "reward": 0.38046875,
1687
+ "reward_std": 0.273573100566864,
1688
+ "rewards/acc_reward_func": 0.38046875,
1689
+ "step": 700
1690
+ },
1691
+ {
1692
+ "completion_length": 166.97421875,
1693
+ "epoch": 2.8313253012048194,
1694
+ "grad_norm": 0.2816270589828491,
1695
+ "kl": 0.001585571584291756,
1696
+ "learning_rate": 5e-06,
1697
+ "loss": 0.0002,
1698
+ "reward": 0.36640625,
1699
+ "reward_std": 0.30526658296585085,
1700
+ "rewards/acc_reward_func": 0.36640625,
1701
+ "step": 705
1702
+ },
1703
+ {
1704
+ "completion_length": 163.82421875,
1705
+ "epoch": 2.8514056224899598,
1706
+ "grad_norm": 0.35617348551750183,
1707
+ "kl": 0.0016123745823279022,
1708
+ "learning_rate": 5e-06,
1709
+ "loss": 0.0002,
1710
+ "reward": 0.34375,
1711
+ "reward_std": 0.26365512013435366,
1712
+ "rewards/acc_reward_func": 0.34375,
1713
+ "step": 710
1714
+ },
1715
+ {
1716
+ "completion_length": 168.128125,
1717
+ "epoch": 2.8714859437751006,
1718
+ "grad_norm": 0.20444567501544952,
1719
+ "kl": 0.001708123623393476,
1720
+ "learning_rate": 5e-06,
1721
+ "loss": 0.0002,
1722
+ "reward": 0.378125,
1723
+ "reward_std": 0.27598778903484344,
1724
+ "rewards/acc_reward_func": 0.378125,
1725
+ "step": 715
1726
+ },
1727
+ {
1728
+ "completion_length": 158.5953125,
1729
+ "epoch": 2.891566265060241,
1730
+ "grad_norm": 0.23954364657402039,
1731
+ "kl": 0.001798305264674127,
1732
+ "learning_rate": 5e-06,
1733
+ "loss": 0.0002,
1734
+ "reward": 0.36875,
1735
+ "reward_std": 0.28262283504009245,
1736
+ "rewards/acc_reward_func": 0.36875,
1737
+ "step": 720
1738
+ },
1739
+ {
1740
+ "completion_length": 168.59453125,
1741
+ "epoch": 2.9116465863453813,
1742
+ "grad_norm": 0.5855829119682312,
1743
+ "kl": 0.0019487401703372597,
1744
+ "learning_rate": 5e-06,
1745
+ "loss": 0.0002,
1746
+ "reward": 0.33046875,
1747
+ "reward_std": 0.26820210814476014,
1748
+ "rewards/acc_reward_func": 0.33046875,
1749
+ "step": 725
1750
+ },
1751
+ {
1752
+ "completion_length": 173.31171875,
1753
+ "epoch": 2.931726907630522,
1754
+ "grad_norm": 0.4006167948246002,
1755
+ "kl": 0.0018504543462768198,
1756
+ "learning_rate": 5e-06,
1757
+ "loss": 0.0002,
1758
+ "reward": 0.3234375,
1759
+ "reward_std": 0.2933495879173279,
1760
+ "rewards/acc_reward_func": 0.3234375,
1761
+ "step": 730
1762
+ },
1763
+ {
1764
+ "completion_length": 168.96015625,
1765
+ "epoch": 2.9518072289156625,
1766
+ "grad_norm": 0.4413718581199646,
1767
+ "kl": 0.002411281201057136,
1768
+ "learning_rate": 5e-06,
1769
+ "loss": 0.0002,
1770
+ "reward": 0.34765625,
1771
+ "reward_std": 0.2572381556034088,
1772
+ "rewards/acc_reward_func": 0.34765625,
1773
+ "step": 735
1774
+ },
1775
+ {
1776
+ "completion_length": 172.9421875,
1777
+ "epoch": 2.9718875502008033,
1778
+ "grad_norm": 0.3350299596786499,
1779
+ "kl": 0.002306809718720615,
1780
+ "learning_rate": 5e-06,
1781
+ "loss": 0.0002,
1782
+ "reward": 0.32109375,
1783
+ "reward_std": 0.27035961151123045,
1784
+ "rewards/acc_reward_func": 0.32109375,
1785
+ "step": 740
1786
+ },
1787
+ {
1788
+ "completion_length": 169.828125,
1789
+ "epoch": 2.9919678714859437,
1790
+ "grad_norm": 0.4135463535785675,
1791
+ "kl": 0.002015371876768768,
1792
+ "learning_rate": 5e-06,
1793
+ "loss": 0.0002,
1794
+ "reward": 0.3859375,
1795
+ "reward_std": 0.3136798143386841,
1796
+ "rewards/acc_reward_func": 0.3859375,
1797
+ "step": 745
1798
+ },
1799
+ {
1800
+ "completion_length": 185.21177673339844,
1801
+ "epoch": 3.0,
1802
+ "kl": 0.0030494448728859425,
1803
+ "reward": 0.267578125,
1804
+ "reward_std": 0.2742668390274048,
1805
+ "rewards/acc_reward_func": 0.267578125,
1806
+ "step": 747,
1807
+ "total_flos": 0.0,
1808
+ "train_loss": 0.00012346780326337724,
1809
+ "train_runtime": 38700.7369,
1810
+ "train_samples_per_second": 0.616,
1811
+ "train_steps_per_second": 0.019
1812
+ }
1813
+ ],
1814
+ "logging_steps": 5,
1815
+ "max_steps": 747,
1816
+ "num_input_tokens_seen": 0,
1817
+ "num_train_epochs": 3,
1818
+ "save_steps": 500,
1819
+ "stateful_callbacks": {
1820
+ "TrainerControl": {
1821
+ "args": {
1822
+ "should_epoch_stop": false,
1823
+ "should_evaluate": false,
1824
+ "should_log": false,
1825
+ "should_save": true,
1826
+ "should_training_stop": true
1827
+ },
1828
+ "attributes": {}
1829
+ }
1830
+ },
1831
+ "total_flos": 0.0,
1832
+ "train_batch_size": 64,
1833
+ "trial_name": null,
1834
+ "trial_params": null
1835
+ }