Haitao999 commited on
Commit
583e2a6
·
verified ·
1 Parent(s): 8c8aad9

Model save

Browse files
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ model_name: Qwen2.5-Math-7B-random-numia_prompt_dpo1
4
+ tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - grpo
8
+ licence: license
9
+ ---
10
+
11
+ # Model Card for Qwen2.5-Math-7B-random-numia_prompt_dpo1
12
+
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
+
16
+ ## Quick start
17
+
18
+ ```python
19
+ from transformers import pipeline
20
+
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="Haitao999/Qwen2.5-Math-7B-random-numia_prompt_dpo1", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
26
+
27
+ ## Training procedure
28
+
29
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/tjucsailab/huggingface/runs/8nbcvram)
30
+
31
+
32
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
33
+
34
+ ### Framework versions
35
+
36
+ - TRL: 0.14.0
37
+ - Transformers: 4.48.3
38
+ - Pytorch: 2.5.1
39
+ - Datasets: 3.2.0
40
+ - Tokenizers: 0.21.1
41
+
42
+ ## Citations
43
+
44
+ Cite GRPO as:
45
+
46
+ ```bibtex
47
+ @article{zhihong2024deepseekmath,
48
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
49
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
50
+ year = 2024,
51
+ eprint = {arXiv:2402.03300},
52
+ }
53
+
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 8.003707675146196e-08,
4
+ "train_runtime": 99592.7903,
5
+ "train_samples": 20000,
6
+ "train_samples_per_second": 0.201,
7
+ "train_steps_per_second": 0.002
8
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "eos_token_id": 151643,
4
+ "max_new_tokens": 2048,
5
+ "transformers_version": "4.48.3"
6
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 8.003707675146196e-08,
4
+ "train_runtime": 99592.7903,
5
+ "train_samples": 20000,
6
+ "train_samples_per_second": 0.201,
7
+ "train_steps_per_second": 0.002
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,2178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9965010496850945,
5
+ "eval_steps": 100,
6
+ "global_step": 178,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "completion_length": 840.6339149475098,
13
+ "epoch": 0.005598320503848845,
14
+ "grad_norm": 0.7309558391571045,
15
+ "kl": 0.0,
16
+ "learning_rate": 3e-07,
17
+ "loss": 0.0,
18
+ "reward": 0.4647472910583019,
19
+ "reward_std": 0.2918772315606475,
20
+ "rewards/random_math_reward": 0.4647472910583019,
21
+ "step": 1
22
+ },
23
+ {
24
+ "completion_length": 852.609676361084,
25
+ "epoch": 0.01119664100769769,
26
+ "grad_norm": 0.6802689433097839,
27
+ "kl": 0.0002111196517944336,
28
+ "learning_rate": 3e-07,
29
+ "loss": 0.0,
30
+ "reward": 0.46338964626193047,
31
+ "reward_std": 0.306376988068223,
32
+ "rewards/random_math_reward": 0.46338964626193047,
33
+ "step": 2
34
+ },
35
+ {
36
+ "completion_length": 900.8494682312012,
37
+ "epoch": 0.016794961511546535,
38
+ "grad_norm": 0.11767072975635529,
39
+ "kl": 0.00023311376571655273,
40
+ "learning_rate": 3e-07,
41
+ "loss": 0.0,
42
+ "reward": 0.4775644950568676,
43
+ "reward_std": 0.32149738259613514,
44
+ "rewards/random_math_reward": 0.4775644950568676,
45
+ "step": 3
46
+ },
47
+ {
48
+ "completion_length": 851.5382461547852,
49
+ "epoch": 0.02239328201539538,
50
+ "grad_norm": 0.11348643153905869,
51
+ "kl": 0.0002281665802001953,
52
+ "learning_rate": 3e-07,
53
+ "loss": 0.0,
54
+ "reward": 0.48735272884368896,
55
+ "reward_std": 0.30069673527032137,
56
+ "rewards/random_math_reward": 0.48735272884368896,
57
+ "step": 4
58
+ },
59
+ {
60
+ "completion_length": 883.405590057373,
61
+ "epoch": 0.02799160251924423,
62
+ "grad_norm": 0.08939001709222794,
63
+ "kl": 0.00023871660232543945,
64
+ "learning_rate": 3e-07,
65
+ "loss": 0.0,
66
+ "reward": 0.4512794092297554,
67
+ "reward_std": 0.27799729630351067,
68
+ "rewards/random_math_reward": 0.4512794092297554,
69
+ "step": 5
70
+ },
71
+ {
72
+ "completion_length": 835.0280456542969,
73
+ "epoch": 0.03358992302309307,
74
+ "grad_norm": 0.27869054675102234,
75
+ "kl": 0.0002474188804626465,
76
+ "learning_rate": 3e-07,
77
+ "loss": 0.0,
78
+ "reward": 0.47673217207193375,
79
+ "reward_std": 0.29296134505420923,
80
+ "rewards/random_math_reward": 0.47673217207193375,
81
+ "step": 6
82
+ },
83
+ {
84
+ "completion_length": 907.8456382751465,
85
+ "epoch": 0.03918824352694192,
86
+ "grad_norm": 0.862323522567749,
87
+ "kl": 0.00039499998092651367,
88
+ "learning_rate": 3e-07,
89
+ "loss": 0.0,
90
+ "reward": 0.45936181023716927,
91
+ "reward_std": 0.2918355893343687,
92
+ "rewards/random_math_reward": 0.45936181023716927,
93
+ "step": 7
94
+ },
95
+ {
96
+ "completion_length": 859.1173267364502,
97
+ "epoch": 0.04478656403079076,
98
+ "grad_norm": 0.23682314157485962,
99
+ "kl": 0.00037294626235961914,
100
+ "learning_rate": 3e-07,
101
+ "loss": 0.0,
102
+ "reward": 0.5094553399831057,
103
+ "reward_std": 0.28784568049013615,
104
+ "rewards/random_math_reward": 0.5094553399831057,
105
+ "step": 8
106
+ },
107
+ {
108
+ "completion_length": 858.5267715454102,
109
+ "epoch": 0.05038488453463961,
110
+ "grad_norm": 0.07127245515584946,
111
+ "kl": 0.00028765201568603516,
112
+ "learning_rate": 3e-07,
113
+ "loss": 0.0,
114
+ "reward": 0.5017606746405363,
115
+ "reward_std": 0.2951846132054925,
116
+ "rewards/random_math_reward": 0.5017606746405363,
117
+ "step": 9
118
+ },
119
+ {
120
+ "completion_length": 877.473201751709,
121
+ "epoch": 0.05598320503848846,
122
+ "grad_norm": 0.032599493861198425,
123
+ "kl": 0.00027692317962646484,
124
+ "learning_rate": 3e-07,
125
+ "loss": 0.0,
126
+ "reward": 0.48044879734516144,
127
+ "reward_std": 0.3181338291615248,
128
+ "rewards/random_math_reward": 0.48044879734516144,
129
+ "step": 10
130
+ },
131
+ {
132
+ "completion_length": 870.3584022521973,
133
+ "epoch": 0.0615815255423373,
134
+ "grad_norm": 0.2817532420158386,
135
+ "kl": 0.00034165382385253906,
136
+ "learning_rate": 3e-07,
137
+ "loss": 0.0,
138
+ "reward": 0.4617793317884207,
139
+ "reward_std": 0.28720600064843893,
140
+ "rewards/random_math_reward": 0.4617793317884207,
141
+ "step": 11
142
+ },
143
+ {
144
+ "completion_length": 875.0089111328125,
145
+ "epoch": 0.06717984604618614,
146
+ "grad_norm": 0.04790099710226059,
147
+ "kl": 0.0003007054328918457,
148
+ "learning_rate": 3e-07,
149
+ "loss": 0.0,
150
+ "reward": 0.44794472493231297,
151
+ "reward_std": 0.27486683428287506,
152
+ "rewards/random_math_reward": 0.44794472493231297,
153
+ "step": 12
154
+ },
155
+ {
156
+ "completion_length": 888.2474327087402,
157
+ "epoch": 0.072778166550035,
158
+ "grad_norm": 0.07171210646629333,
159
+ "kl": 0.0003644227981567383,
160
+ "learning_rate": 3e-07,
161
+ "loss": 0.0,
162
+ "reward": 0.4840338323265314,
163
+ "reward_std": 0.2867116192355752,
164
+ "rewards/random_math_reward": 0.4840338323265314,
165
+ "step": 13
166
+ },
167
+ {
168
+ "completion_length": 887.8545837402344,
169
+ "epoch": 0.07837648705388384,
170
+ "grad_norm": 0.02272365428507328,
171
+ "kl": 0.00030791759490966797,
172
+ "learning_rate": 3e-07,
173
+ "loss": 0.0,
174
+ "reward": 0.4695360567420721,
175
+ "reward_std": 0.30686575919389725,
176
+ "rewards/random_math_reward": 0.4695360567420721,
177
+ "step": 14
178
+ },
179
+ {
180
+ "completion_length": 894.8392677307129,
181
+ "epoch": 0.08397480755773268,
182
+ "grad_norm": 0.1592845916748047,
183
+ "kl": 0.0007302761077880859,
184
+ "learning_rate": 3e-07,
185
+ "loss": 0.0,
186
+ "reward": 0.5013854652643204,
187
+ "reward_std": 0.30024987645447254,
188
+ "rewards/random_math_reward": 0.5013854652643204,
189
+ "step": 15
190
+ },
191
+ {
192
+ "completion_length": 826.4298248291016,
193
+ "epoch": 0.08957312806158152,
194
+ "grad_norm": 0.11753757297992706,
195
+ "kl": 0.00044095516204833984,
196
+ "learning_rate": 3e-07,
197
+ "loss": 0.0,
198
+ "reward": 0.48069896548986435,
199
+ "reward_std": 0.2729701278731227,
200
+ "rewards/random_math_reward": 0.48069896548986435,
201
+ "step": 16
202
+ },
203
+ {
204
+ "completion_length": 902.1517715454102,
205
+ "epoch": 0.09517144856543037,
206
+ "grad_norm": 0.111512191593647,
207
+ "kl": 0.0004519224166870117,
208
+ "learning_rate": 3e-07,
209
+ "loss": 0.0,
210
+ "reward": 0.48635287396609783,
211
+ "reward_std": 0.2879869397729635,
212
+ "rewards/random_math_reward": 0.48635287396609783,
213
+ "step": 17
214
+ },
215
+ {
216
+ "completion_length": 893.360954284668,
217
+ "epoch": 0.10076976906927922,
218
+ "grad_norm": 0.2031000405550003,
219
+ "kl": 0.0005553364753723145,
220
+ "learning_rate": 3e-07,
221
+ "loss": 0.0,
222
+ "reward": 0.48062865249812603,
223
+ "reward_std": 0.31339638121426105,
224
+ "rewards/random_math_reward": 0.48062865249812603,
225
+ "step": 18
226
+ },
227
+ {
228
+ "completion_length": 838.4757537841797,
229
+ "epoch": 0.10636808957312806,
230
+ "grad_norm": 0.1047205924987793,
231
+ "kl": 0.00042116641998291016,
232
+ "learning_rate": 3e-07,
233
+ "loss": 0.0,
234
+ "reward": 0.4573891665786505,
235
+ "reward_std": 0.283748428337276,
236
+ "rewards/random_math_reward": 0.4573891665786505,
237
+ "step": 19
238
+ },
239
+ {
240
+ "completion_length": 853.5701332092285,
241
+ "epoch": 0.11196641007697691,
242
+ "grad_norm": 0.026682933792471886,
243
+ "kl": 0.00039273500442504883,
244
+ "learning_rate": 3e-07,
245
+ "loss": 0.0,
246
+ "reward": 0.5187842268496752,
247
+ "reward_std": 0.2996611688286066,
248
+ "rewards/random_math_reward": 0.5187842268496752,
249
+ "step": 20
250
+ },
251
+ {
252
+ "completion_length": 884.6606903076172,
253
+ "epoch": 0.11756473058082575,
254
+ "grad_norm": 0.07019887119531631,
255
+ "kl": 0.0003604888916015625,
256
+ "learning_rate": 3e-07,
257
+ "loss": 0.0,
258
+ "reward": 0.4664475191384554,
259
+ "reward_std": 0.31808448024094105,
260
+ "rewards/random_math_reward": 0.4664475191384554,
261
+ "step": 21
262
+ },
263
+ {
264
+ "completion_length": 906.0828857421875,
265
+ "epoch": 0.1231630510846746,
266
+ "grad_norm": 0.05280100554227829,
267
+ "kl": 0.0005142688751220703,
268
+ "learning_rate": 3e-07,
269
+ "loss": 0.0,
270
+ "reward": 0.4690139964222908,
271
+ "reward_std": 0.27718855906277895,
272
+ "rewards/random_math_reward": 0.4690139964222908,
273
+ "step": 22
274
+ },
275
+ {
276
+ "completion_length": 887.6466636657715,
277
+ "epoch": 0.12876137158852344,
278
+ "grad_norm": 0.04366298019886017,
279
+ "kl": 0.0004309415817260742,
280
+ "learning_rate": 3e-07,
281
+ "loss": 0.0,
282
+ "reward": 0.43934059888124466,
283
+ "reward_std": 0.27285193372517824,
284
+ "rewards/random_math_reward": 0.43934059888124466,
285
+ "step": 23
286
+ },
287
+ {
288
+ "completion_length": 884.6441116333008,
289
+ "epoch": 0.13435969209237228,
290
+ "grad_norm": 0.24076521396636963,
291
+ "kl": 0.0005651712417602539,
292
+ "learning_rate": 3e-07,
293
+ "loss": 0.0,
294
+ "reward": 0.4910389892756939,
295
+ "reward_std": 0.28644737135618925,
296
+ "rewards/random_math_reward": 0.4910389892756939,
297
+ "step": 24
298
+ },
299
+ {
300
+ "completion_length": 853.1759986877441,
301
+ "epoch": 0.13995801259622112,
302
+ "grad_norm": 0.27850207686424255,
303
+ "kl": 0.00044476985931396484,
304
+ "learning_rate": 3e-07,
305
+ "loss": 0.0,
306
+ "reward": 0.4575974587351084,
307
+ "reward_std": 0.30237106420099735,
308
+ "rewards/random_math_reward": 0.4575974587351084,
309
+ "step": 25
310
+ },
311
+ {
312
+ "completion_length": 894.9030342102051,
313
+ "epoch": 0.14555633310007,
314
+ "grad_norm": 0.049030691385269165,
315
+ "kl": 0.00048810243606567383,
316
+ "learning_rate": 3e-07,
317
+ "loss": 0.0,
318
+ "reward": 0.4729502145200968,
319
+ "reward_std": 0.285190143622458,
320
+ "rewards/random_math_reward": 0.4729502145200968,
321
+ "step": 26
322
+ },
323
+ {
324
+ "completion_length": 854.3010063171387,
325
+ "epoch": 0.15115465360391883,
326
+ "grad_norm": 0.07516992837190628,
327
+ "kl": 0.0005003213882446289,
328
+ "learning_rate": 3e-07,
329
+ "loss": 0.0,
330
+ "reward": 0.4500812850892544,
331
+ "reward_std": 0.32013510540127754,
332
+ "rewards/random_math_reward": 0.4500812850892544,
333
+ "step": 27
334
+ },
335
+ {
336
+ "completion_length": 858.9030418395996,
337
+ "epoch": 0.15675297410776767,
338
+ "grad_norm": 0.20776700973510742,
339
+ "kl": 0.0009773969650268555,
340
+ "learning_rate": 3e-07,
341
+ "loss": 0.0,
342
+ "reward": 0.5247392375022173,
343
+ "reward_std": 0.32408637553453445,
344
+ "rewards/random_math_reward": 0.5247392375022173,
345
+ "step": 28
346
+ },
347
+ {
348
+ "completion_length": 844.9770317077637,
349
+ "epoch": 0.16235129461161651,
350
+ "grad_norm": 0.0495678074657917,
351
+ "kl": 0.0009752511978149414,
352
+ "learning_rate": 3e-07,
353
+ "loss": 0.0,
354
+ "reward": 0.43325162678956985,
355
+ "reward_std": 0.29721821192651987,
356
+ "rewards/random_math_reward": 0.43325162678956985,
357
+ "step": 29
358
+ },
359
+ {
360
+ "completion_length": 908.6619758605957,
361
+ "epoch": 0.16794961511546536,
362
+ "grad_norm": 0.053116437047719955,
363
+ "kl": 0.0007257461547851562,
364
+ "learning_rate": 3e-07,
365
+ "loss": 0.0,
366
+ "reward": 0.510849991813302,
367
+ "reward_std": 0.2977478625252843,
368
+ "rewards/random_math_reward": 0.510849991813302,
369
+ "step": 30
370
+ },
371
+ {
372
+ "completion_length": 895.2946281433105,
373
+ "epoch": 0.1735479356193142,
374
+ "grad_norm": 0.01952669955790043,
375
+ "kl": 0.0004551410675048828,
376
+ "learning_rate": 3e-07,
377
+ "loss": 0.0,
378
+ "reward": 0.4997572433203459,
379
+ "reward_std": 0.2973903976380825,
380
+ "rewards/random_math_reward": 0.4997572433203459,
381
+ "step": 31
382
+ },
383
+ {
384
+ "completion_length": 917.7333984375,
385
+ "epoch": 0.17914625612316304,
386
+ "grad_norm": 0.1707552671432495,
387
+ "kl": 0.0006129741668701172,
388
+ "learning_rate": 3e-07,
389
+ "loss": 0.0,
390
+ "reward": 0.465914161875844,
391
+ "reward_std": 0.30206145346164703,
392
+ "rewards/random_math_reward": 0.465914161875844,
393
+ "step": 32
394
+ },
395
+ {
396
+ "completion_length": 870.7665557861328,
397
+ "epoch": 0.1847445766270119,
398
+ "grad_norm": 0.035584185272455215,
399
+ "kl": 0.0004489421844482422,
400
+ "learning_rate": 3e-07,
401
+ "loss": 0.0,
402
+ "reward": 0.45434157736599445,
403
+ "reward_std": 0.30155808478593826,
404
+ "rewards/random_math_reward": 0.45434157736599445,
405
+ "step": 33
406
+ },
407
+ {
408
+ "completion_length": 873.6504936218262,
409
+ "epoch": 0.19034289713086075,
410
+ "grad_norm": 0.046750955283641815,
411
+ "kl": 0.0005222558975219727,
412
+ "learning_rate": 3e-07,
413
+ "loss": 0.0,
414
+ "reward": 0.49102505296468735,
415
+ "reward_std": 0.2892821617424488,
416
+ "rewards/random_math_reward": 0.49102505296468735,
417
+ "step": 34
418
+ },
419
+ {
420
+ "completion_length": 885.6441230773926,
421
+ "epoch": 0.1959412176347096,
422
+ "grad_norm": 0.02094712294638157,
423
+ "kl": 0.0004401206970214844,
424
+ "learning_rate": 3e-07,
425
+ "loss": 0.0,
426
+ "reward": 0.49038008227944374,
427
+ "reward_std": 0.2766480268910527,
428
+ "rewards/random_math_reward": 0.49038008227944374,
429
+ "step": 35
430
+ },
431
+ {
432
+ "completion_length": 857.9094200134277,
433
+ "epoch": 0.20153953813855843,
434
+ "grad_norm": 0.32986631989479065,
435
+ "kl": 0.0016863346099853516,
436
+ "learning_rate": 3e-07,
437
+ "loss": 0.0,
438
+ "reward": 0.5011246297508478,
439
+ "reward_std": 0.2916059549897909,
440
+ "rewards/random_math_reward": 0.5011246297508478,
441
+ "step": 36
442
+ },
443
+ {
444
+ "completion_length": 839.3214149475098,
445
+ "epoch": 0.20713785864240727,
446
+ "grad_norm": 0.04805277660489082,
447
+ "kl": 0.0015134811401367188,
448
+ "learning_rate": 3e-07,
449
+ "loss": 0.0,
450
+ "reward": 0.47658345103263855,
451
+ "reward_std": 0.26997081749141216,
452
+ "rewards/random_math_reward": 0.47658345103263855,
453
+ "step": 37
454
+ },
455
+ {
456
+ "completion_length": 888.7257461547852,
457
+ "epoch": 0.21273617914625612,
458
+ "grad_norm": 0.026755526661872864,
459
+ "kl": 0.0005905628204345703,
460
+ "learning_rate": 3e-07,
461
+ "loss": 0.0,
462
+ "reward": 0.4430251754820347,
463
+ "reward_std": 0.28255612775683403,
464
+ "rewards/random_math_reward": 0.4430251754820347,
465
+ "step": 38
466
+ },
467
+ {
468
+ "completion_length": 879.1823692321777,
469
+ "epoch": 0.21833449965010496,
470
+ "grad_norm": 0.11245640367269516,
471
+ "kl": 0.000995039939880371,
472
+ "learning_rate": 3e-07,
473
+ "loss": 0.0,
474
+ "reward": 0.4720071740448475,
475
+ "reward_std": 0.27762684039771557,
476
+ "rewards/random_math_reward": 0.4720071740448475,
477
+ "step": 39
478
+ },
479
+ {
480
+ "completion_length": 930.487232208252,
481
+ "epoch": 0.22393282015395383,
482
+ "grad_norm": 0.053661637008190155,
483
+ "kl": 0.0007688403129577637,
484
+ "learning_rate": 3e-07,
485
+ "loss": 0.0,
486
+ "reward": 0.4835520200431347,
487
+ "reward_std": 0.28671225160360336,
488
+ "rewards/random_math_reward": 0.4835520200431347,
489
+ "step": 40
490
+ },
491
+ {
492
+ "completion_length": 893.3328971862793,
493
+ "epoch": 0.22953114065780267,
494
+ "grad_norm": 0.5499238967895508,
495
+ "kl": 0.003184080123901367,
496
+ "learning_rate": 3e-07,
497
+ "loss": 0.0,
498
+ "reward": 0.4463555943220854,
499
+ "reward_std": 0.2914181677624583,
500
+ "rewards/random_math_reward": 0.4463555943220854,
501
+ "step": 41
502
+ },
503
+ {
504
+ "completion_length": 898.3367156982422,
505
+ "epoch": 0.2351294611616515,
506
+ "grad_norm": 0.046797916293144226,
507
+ "kl": 0.0005197525024414062,
508
+ "learning_rate": 3e-07,
509
+ "loss": 0.0,
510
+ "reward": 0.45641600526869297,
511
+ "reward_std": 0.2792764212936163,
512
+ "rewards/random_math_reward": 0.45641600526869297,
513
+ "step": 42
514
+ },
515
+ {
516
+ "completion_length": 845.0535507202148,
517
+ "epoch": 0.24072778166550035,
518
+ "grad_norm": 0.0507148802280426,
519
+ "kl": 0.0009500980377197266,
520
+ "learning_rate": 3e-07,
521
+ "loss": 0.0,
522
+ "reward": 0.4459559340029955,
523
+ "reward_std": 0.29090235754847527,
524
+ "rewards/random_math_reward": 0.4459559340029955,
525
+ "step": 43
526
+ },
527
+ {
528
+ "completion_length": 876.5892677307129,
529
+ "epoch": 0.2463261021693492,
530
+ "grad_norm": 0.04941030591726303,
531
+ "kl": 0.000649571418762207,
532
+ "learning_rate": 3e-07,
533
+ "loss": 0.0,
534
+ "reward": 0.4653678424656391,
535
+ "reward_std": 0.28301908634603024,
536
+ "rewards/random_math_reward": 0.4653678424656391,
537
+ "step": 44
538
+ },
539
+ {
540
+ "completion_length": 888.4846801757812,
541
+ "epoch": 0.25192442267319803,
542
+ "grad_norm": 0.026715300977230072,
543
+ "kl": 0.0008295774459838867,
544
+ "learning_rate": 3e-07,
545
+ "loss": 0.0,
546
+ "reward": 0.45214173197746277,
547
+ "reward_std": 0.3066752403974533,
548
+ "rewards/random_math_reward": 0.45214173197746277,
549
+ "step": 45
550
+ },
551
+ {
552
+ "completion_length": 846.2231979370117,
553
+ "epoch": 0.2575227431770469,
554
+ "grad_norm": 0.024737877771258354,
555
+ "kl": 0.0005326271057128906,
556
+ "learning_rate": 3e-07,
557
+ "loss": 0.0,
558
+ "reward": 0.5093872379511595,
559
+ "reward_std": 0.28799473866820335,
560
+ "rewards/random_math_reward": 0.5093872379511595,
561
+ "step": 46
562
+ },
563
+ {
564
+ "completion_length": 865.0395202636719,
565
+ "epoch": 0.2631210636808957,
566
+ "grad_norm": 0.37511223554611206,
567
+ "kl": 0.001316070556640625,
568
+ "learning_rate": 3e-07,
569
+ "loss": 0.0,
570
+ "reward": 0.4726740214973688,
571
+ "reward_std": 0.27686448488384485,
572
+ "rewards/random_math_reward": 0.4726740214973688,
573
+ "step": 47
574
+ },
575
+ {
576
+ "completion_length": 884.132640838623,
577
+ "epoch": 0.26871938418474456,
578
+ "grad_norm": 0.03645370528101921,
579
+ "kl": 0.0006043910980224609,
580
+ "learning_rate": 3e-07,
581
+ "loss": 0.0,
582
+ "reward": 0.4929856266826391,
583
+ "reward_std": 0.2844850402325392,
584
+ "rewards/random_math_reward": 0.4929856266826391,
585
+ "step": 48
586
+ },
587
+ {
588
+ "completion_length": 941.7487030029297,
589
+ "epoch": 0.2743177046885934,
590
+ "grad_norm": 0.04691855236887932,
591
+ "kl": 0.0006322860717773438,
592
+ "learning_rate": 3e-07,
593
+ "loss": 0.0,
594
+ "reward": 0.46845594607293606,
595
+ "reward_std": 0.2635832289233804,
596
+ "rewards/random_math_reward": 0.46845594607293606,
597
+ "step": 49
598
+ },
599
+ {
600
+ "completion_length": 886.7040634155273,
601
+ "epoch": 0.27991602519244224,
602
+ "grad_norm": 0.01480669155716896,
603
+ "kl": 0.0007061958312988281,
604
+ "learning_rate": 3e-07,
605
+ "loss": 0.0,
606
+ "reward": 0.49290234968066216,
607
+ "reward_std": 0.28575755935162306,
608
+ "rewards/random_math_reward": 0.49290234968066216,
609
+ "step": 50
610
+ },
611
+ {
612
+ "completion_length": 881.2589111328125,
613
+ "epoch": 0.28551434569629114,
614
+ "grad_norm": 0.18569053709506989,
615
+ "kl": 0.0008254051208496094,
616
+ "learning_rate": 3e-07,
617
+ "loss": 0.0,
618
+ "reward": 0.470220735296607,
619
+ "reward_std": 0.3022102378308773,
620
+ "rewards/random_math_reward": 0.470220735296607,
621
+ "step": 51
622
+ },
623
+ {
624
+ "completion_length": 858.8775253295898,
625
+ "epoch": 0.29111266620014,
626
+ "grad_norm": 0.06353636085987091,
627
+ "kl": 0.0007264614105224609,
628
+ "learning_rate": 3e-07,
629
+ "loss": 0.0,
630
+ "reward": 0.5083950478583574,
631
+ "reward_std": 0.29449230805039406,
632
+ "rewards/random_math_reward": 0.5083950478583574,
633
+ "step": 52
634
+ },
635
+ {
636
+ "completion_length": 886.2869720458984,
637
+ "epoch": 0.2967109867039888,
638
+ "grad_norm": 0.035578541457653046,
639
+ "kl": 0.0007152557373046875,
640
+ "learning_rate": 3e-07,
641
+ "loss": 0.0,
642
+ "reward": 0.4716050550341606,
643
+ "reward_std": 0.30308040603995323,
644
+ "rewards/random_math_reward": 0.4716050550341606,
645
+ "step": 53
646
+ },
647
+ {
648
+ "completion_length": 889.1351852416992,
649
+ "epoch": 0.30230930720783766,
650
+ "grad_norm": 0.027262764051556587,
651
+ "kl": 0.0015797615051269531,
652
+ "learning_rate": 3e-07,
653
+ "loss": 0.0,
654
+ "reward": 0.49637374840676785,
655
+ "reward_std": 0.30440937727689743,
656
+ "rewards/random_math_reward": 0.49637374840676785,
657
+ "step": 54
658
+ },
659
+ {
660
+ "completion_length": 859.5165557861328,
661
+ "epoch": 0.3079076277116865,
662
+ "grad_norm": 0.013877746649086475,
663
+ "kl": 0.0005941390991210938,
664
+ "learning_rate": 3e-07,
665
+ "loss": 0.0,
666
+ "reward": 0.4874963089823723,
667
+ "reward_std": 0.2997851762920618,
668
+ "rewards/random_math_reward": 0.4874963089823723,
669
+ "step": 55
670
+ },
671
+ {
672
+ "completion_length": 886.5854339599609,
673
+ "epoch": 0.31350594821553535,
674
+ "grad_norm": 0.024866046383976936,
675
+ "kl": 0.000647425651550293,
676
+ "learning_rate": 3e-07,
677
+ "loss": 0.0,
678
+ "reward": 0.46163152530789375,
679
+ "reward_std": 0.2887133536860347,
680
+ "rewards/random_math_reward": 0.46163152530789375,
681
+ "step": 56
682
+ },
683
+ {
684
+ "completion_length": 863.1759986877441,
685
+ "epoch": 0.3191042687193842,
686
+ "grad_norm": 0.023259082809090614,
687
+ "kl": 0.0006357431411743164,
688
+ "learning_rate": 3e-07,
689
+ "loss": 0.0,
690
+ "reward": 0.4782671481370926,
691
+ "reward_std": 0.29867998976260424,
692
+ "rewards/random_math_reward": 0.4782671481370926,
693
+ "step": 57
694
+ },
695
+ {
696
+ "completion_length": 895.409423828125,
697
+ "epoch": 0.32470258922323303,
698
+ "grad_norm": 0.02394162304699421,
699
+ "kl": 0.0007455348968505859,
700
+ "learning_rate": 3e-07,
701
+ "loss": 0.0,
702
+ "reward": 0.4420376904308796,
703
+ "reward_std": 0.30142530612647533,
704
+ "rewards/random_math_reward": 0.4420376904308796,
705
+ "step": 58
706
+ },
707
+ {
708
+ "completion_length": 933.6810913085938,
709
+ "epoch": 0.33030090972708187,
710
+ "grad_norm": 0.022286556661128998,
711
+ "kl": 0.0007117986679077148,
712
+ "learning_rate": 3e-07,
713
+ "loss": 0.0,
714
+ "reward": 0.48266960494220257,
715
+ "reward_std": 0.2849938729777932,
716
+ "rewards/random_math_reward": 0.48266960494220257,
717
+ "step": 59
718
+ },
719
+ {
720
+ "completion_length": 906.8890190124512,
721
+ "epoch": 0.3358992302309307,
722
+ "grad_norm": 0.018822669982910156,
723
+ "kl": 0.000682830810546875,
724
+ "learning_rate": 3e-07,
725
+ "loss": 0.0,
726
+ "reward": 0.487277016043663,
727
+ "reward_std": 0.27959814574569464,
728
+ "rewards/random_math_reward": 0.487277016043663,
729
+ "step": 60
730
+ },
731
+ {
732
+ "completion_length": 864.5459022521973,
733
+ "epoch": 0.34149755073477955,
734
+ "grad_norm": 0.07483907043933868,
735
+ "kl": 0.0009076595306396484,
736
+ "learning_rate": 3e-07,
737
+ "loss": 0.0,
738
+ "reward": 0.5487107969820499,
739
+ "reward_std": 0.29574476182460785,
740
+ "rewards/random_math_reward": 0.5487107969820499,
741
+ "step": 61
742
+ },
743
+ {
744
+ "completion_length": 946.0178413391113,
745
+ "epoch": 0.3470958712386284,
746
+ "grad_norm": 0.04614647850394249,
747
+ "kl": 0.0015153884887695312,
748
+ "learning_rate": 3e-07,
749
+ "loss": 0.0,
750
+ "reward": 0.5009770151227713,
751
+ "reward_std": 0.29892334900796413,
752
+ "rewards/random_math_reward": 0.5009770151227713,
753
+ "step": 62
754
+ },
755
+ {
756
+ "completion_length": 929.1989631652832,
757
+ "epoch": 0.35269419174247724,
758
+ "grad_norm": 0.02183196134865284,
759
+ "kl": 0.0007126331329345703,
760
+ "learning_rate": 3e-07,
761
+ "loss": 0.0,
762
+ "reward": 0.4849752429872751,
763
+ "reward_std": 0.29463311191648245,
764
+ "rewards/random_math_reward": 0.4849752429872751,
765
+ "step": 63
766
+ },
767
+ {
768
+ "completion_length": 911.9004974365234,
769
+ "epoch": 0.3582925122463261,
770
+ "grad_norm": 0.02410600334405899,
771
+ "kl": 0.0010197162628173828,
772
+ "learning_rate": 3e-07,
773
+ "loss": 0.0,
774
+ "reward": 0.4673476442694664,
775
+ "reward_std": 0.27815896179527044,
776
+ "rewards/random_math_reward": 0.4673476442694664,
777
+ "step": 64
778
+ },
779
+ {
780
+ "completion_length": 885.8609504699707,
781
+ "epoch": 0.363890832750175,
782
+ "grad_norm": 0.013846021145582199,
783
+ "kl": 0.0008647441864013672,
784
+ "learning_rate": 3e-07,
785
+ "loss": 0.0,
786
+ "reward": 0.47511172853410244,
787
+ "reward_std": 0.29065939225256443,
788
+ "rewards/random_math_reward": 0.47511172853410244,
789
+ "step": 65
790
+ },
791
+ {
792
+ "completion_length": 896.8316116333008,
793
+ "epoch": 0.3694891532540238,
794
+ "grad_norm": 0.058244168758392334,
795
+ "kl": 0.0007791519165039062,
796
+ "learning_rate": 3e-07,
797
+ "loss": 0.0,
798
+ "reward": 0.49584819190204144,
799
+ "reward_std": 0.28557164780795574,
800
+ "rewards/random_math_reward": 0.49584819190204144,
801
+ "step": 66
802
+ },
803
+ {
804
+ "completion_length": 915.7550888061523,
805
+ "epoch": 0.37508747375787266,
806
+ "grad_norm": 0.026778079569339752,
807
+ "kl": 0.0007394552230834961,
808
+ "learning_rate": 3e-07,
809
+ "loss": 0.0,
810
+ "reward": 0.4666522778570652,
811
+ "reward_std": 0.2651451360434294,
812
+ "rewards/random_math_reward": 0.4666522778570652,
813
+ "step": 67
814
+ },
815
+ {
816
+ "completion_length": 817.3571281433105,
817
+ "epoch": 0.3806857942617215,
818
+ "grad_norm": 0.04487188532948494,
819
+ "kl": 0.002661585807800293,
820
+ "learning_rate": 3e-07,
821
+ "loss": 0.0,
822
+ "reward": 0.4737847317010164,
823
+ "reward_std": 0.2716279961168766,
824
+ "rewards/random_math_reward": 0.4737847317010164,
825
+ "step": 68
826
+ },
827
+ {
828
+ "completion_length": 967.4578742980957,
829
+ "epoch": 0.38628411476557034,
830
+ "grad_norm": 0.012447608634829521,
831
+ "kl": 0.000609278678894043,
832
+ "learning_rate": 3e-07,
833
+ "loss": 0.0,
834
+ "reward": 0.4469184931367636,
835
+ "reward_std": 0.2811002554371953,
836
+ "rewards/random_math_reward": 0.4469184931367636,
837
+ "step": 69
838
+ },
839
+ {
840
+ "completion_length": 932.1237030029297,
841
+ "epoch": 0.3918824352694192,
842
+ "grad_norm": 0.011375661008059978,
843
+ "kl": 0.0006630420684814453,
844
+ "learning_rate": 3e-07,
845
+ "loss": 0.0,
846
+ "reward": 0.458757933229208,
847
+ "reward_std": 0.3028138969093561,
848
+ "rewards/random_math_reward": 0.458757933229208,
849
+ "step": 70
850
+ },
851
+ {
852
+ "completion_length": 871.3660583496094,
853
+ "epoch": 0.397480755773268,
854
+ "grad_norm": 0.010783006437122822,
855
+ "kl": 0.0007634162902832031,
856
+ "learning_rate": 3e-07,
857
+ "loss": 0.0,
858
+ "reward": 0.4868381731212139,
859
+ "reward_std": 0.28059179708361626,
860
+ "rewards/random_math_reward": 0.4868381731212139,
861
+ "step": 71
862
+ },
863
+ {
864
+ "completion_length": 880.8915634155273,
865
+ "epoch": 0.40307907627711687,
866
+ "grad_norm": 0.024310071021318436,
867
+ "kl": 0.0012193918228149414,
868
+ "learning_rate": 3e-07,
869
+ "loss": 0.0,
870
+ "reward": 0.4670300707221031,
871
+ "reward_std": 0.28627045452594757,
872
+ "rewards/random_math_reward": 0.4670300707221031,
873
+ "step": 72
874
+ },
875
+ {
876
+ "completion_length": 848.394115447998,
877
+ "epoch": 0.4086773967809657,
878
+ "grad_norm": 0.03577808663249016,
879
+ "kl": 0.0011832714080810547,
880
+ "learning_rate": 3e-07,
881
+ "loss": 0.0,
882
+ "reward": 0.4628731273114681,
883
+ "reward_std": 0.2926492039114237,
884
+ "rewards/random_math_reward": 0.4628731273114681,
885
+ "step": 73
886
+ },
887
+ {
888
+ "completion_length": 850.6415634155273,
889
+ "epoch": 0.41427571728481455,
890
+ "grad_norm": 0.02307233400642872,
891
+ "kl": 0.0011417865753173828,
892
+ "learning_rate": 3e-07,
893
+ "loss": 0.0,
894
+ "reward": 0.48236627131700516,
895
+ "reward_std": 0.2867990182712674,
896
+ "rewards/random_math_reward": 0.48236627131700516,
897
+ "step": 74
898
+ },
899
+ {
900
+ "completion_length": 935.357120513916,
901
+ "epoch": 0.4198740377886634,
902
+ "grad_norm": 0.009285389445722103,
903
+ "kl": 0.0006787776947021484,
904
+ "learning_rate": 3e-07,
905
+ "loss": 0.0,
906
+ "reward": 0.48846207931637764,
907
+ "reward_std": 0.28832482267171144,
908
+ "rewards/random_math_reward": 0.48846207931637764,
909
+ "step": 75
910
+ },
911
+ {
912
+ "completion_length": 897.8532943725586,
913
+ "epoch": 0.42547235829251223,
914
+ "grad_norm": 0.06958639621734619,
915
+ "kl": 0.0009503364562988281,
916
+ "learning_rate": 3e-07,
917
+ "loss": 0.0,
918
+ "reward": 0.5029652137309313,
919
+ "reward_std": 0.2956245709210634,
920
+ "rewards/random_math_reward": 0.5029652137309313,
921
+ "step": 76
922
+ },
923
+ {
924
+ "completion_length": 861.1275329589844,
925
+ "epoch": 0.4310706787963611,
926
+ "grad_norm": 0.019943566992878914,
927
+ "kl": 0.001253366470336914,
928
+ "learning_rate": 3e-07,
929
+ "loss": 0.0,
930
+ "reward": 0.5312744900584221,
931
+ "reward_std": 0.2928897039964795,
932
+ "rewards/random_math_reward": 0.5312744900584221,
933
+ "step": 77
934
+ },
935
+ {
936
+ "completion_length": 924.6313591003418,
937
+ "epoch": 0.4366689993002099,
938
+ "grad_norm": 0.01711793802678585,
939
+ "kl": 0.0007042884826660156,
940
+ "learning_rate": 3e-07,
941
+ "loss": 0.0,
942
+ "reward": 0.49481588415801525,
943
+ "reward_std": 0.29031859524548054,
944
+ "rewards/random_math_reward": 0.49481588415801525,
945
+ "step": 78
946
+ },
947
+ {
948
+ "completion_length": 842.0395202636719,
949
+ "epoch": 0.44226731980405876,
950
+ "grad_norm": 0.018647639080882072,
951
+ "kl": 0.0007382631301879883,
952
+ "learning_rate": 3e-07,
953
+ "loss": 0.0,
954
+ "reward": 0.5070081520825624,
955
+ "reward_std": 0.2859943201765418,
956
+ "rewards/random_math_reward": 0.5070081520825624,
957
+ "step": 79
958
+ },
959
+ {
960
+ "completion_length": 895.3583946228027,
961
+ "epoch": 0.44786564030790765,
962
+ "grad_norm": 0.026943529024720192,
963
+ "kl": 0.0007255077362060547,
964
+ "learning_rate": 3e-07,
965
+ "loss": 0.0,
966
+ "reward": 0.48258678428828716,
967
+ "reward_std": 0.2774320160970092,
968
+ "rewards/random_math_reward": 0.48258678428828716,
969
+ "step": 80
970
+ },
971
+ {
972
+ "completion_length": 937.7754783630371,
973
+ "epoch": 0.4534639608117565,
974
+ "grad_norm": 0.015870269387960434,
975
+ "kl": 0.0008311271667480469,
976
+ "learning_rate": 3e-07,
977
+ "loss": 0.0,
978
+ "reward": 0.4687965828925371,
979
+ "reward_std": 0.2817834969609976,
980
+ "rewards/random_math_reward": 0.4687965828925371,
981
+ "step": 81
982
+ },
983
+ {
984
+ "completion_length": 835.7397804260254,
985
+ "epoch": 0.45906228131560534,
986
+ "grad_norm": 0.06118820607662201,
987
+ "kl": 0.0011771917343139648,
988
+ "learning_rate": 3e-07,
989
+ "loss": 0.0,
990
+ "reward": 0.4703396111726761,
991
+ "reward_std": 0.29155910573899746,
992
+ "rewards/random_math_reward": 0.4703396111726761,
993
+ "step": 82
994
+ },
995
+ {
996
+ "completion_length": 887.909423828125,
997
+ "epoch": 0.4646606018194542,
998
+ "grad_norm": 0.05378980189561844,
999
+ "kl": 0.0013784170150756836,
1000
+ "learning_rate": 3e-07,
1001
+ "loss": 0.0,
1002
+ "reward": 0.4672347716987133,
1003
+ "reward_std": 0.316370103508234,
1004
+ "rewards/random_math_reward": 0.4672347716987133,
1005
+ "step": 83
1006
+ },
1007
+ {
1008
+ "completion_length": 818.6708984375,
1009
+ "epoch": 0.470258922323303,
1010
+ "grad_norm": 0.05784986913204193,
1011
+ "kl": 0.0008500814437866211,
1012
+ "learning_rate": 3e-07,
1013
+ "loss": 0.0,
1014
+ "reward": 0.5053709652274847,
1015
+ "reward_std": 0.3026046808809042,
1016
+ "rewards/random_math_reward": 0.5053709652274847,
1017
+ "step": 84
1018
+ },
1019
+ {
1020
+ "completion_length": 926.3800888061523,
1021
+ "epoch": 0.47585724282715186,
1022
+ "grad_norm": 0.021458175033330917,
1023
+ "kl": 0.0008251667022705078,
1024
+ "learning_rate": 3e-07,
1025
+ "loss": 0.0,
1026
+ "reward": 0.47099834494292736,
1027
+ "reward_std": 0.30302479304373264,
1028
+ "rewards/random_math_reward": 0.47099834494292736,
1029
+ "step": 85
1030
+ },
1031
+ {
1032
+ "completion_length": 859.1288108825684,
1033
+ "epoch": 0.4814555633310007,
1034
+ "grad_norm": 0.019605984911322594,
1035
+ "kl": 0.0009077787399291992,
1036
+ "learning_rate": 3e-07,
1037
+ "loss": 0.0,
1038
+ "reward": 0.4812137298285961,
1039
+ "reward_std": 0.29327244497835636,
1040
+ "rewards/random_math_reward": 0.4812137298285961,
1041
+ "step": 86
1042
+ },
1043
+ {
1044
+ "completion_length": 908.1402816772461,
1045
+ "epoch": 0.48705388383484954,
1046
+ "grad_norm": 0.06367423385381699,
1047
+ "kl": 0.0011855363845825195,
1048
+ "learning_rate": 3e-07,
1049
+ "loss": 0.0,
1050
+ "reward": 0.4995248857885599,
1051
+ "reward_std": 0.293773477897048,
1052
+ "rewards/random_math_reward": 0.4995248857885599,
1053
+ "step": 87
1054
+ },
1055
+ {
1056
+ "completion_length": 879.8099327087402,
1057
+ "epoch": 0.4926522043386984,
1058
+ "grad_norm": 0.013439509086310863,
1059
+ "kl": 0.0009884834289550781,
1060
+ "learning_rate": 3e-07,
1061
+ "loss": 0.0,
1062
+ "reward": 0.4674531724303961,
1063
+ "reward_std": 0.28925504721701145,
1064
+ "rewards/random_math_reward": 0.4674531724303961,
1065
+ "step": 88
1066
+ },
1067
+ {
1068
+ "completion_length": 897.298454284668,
1069
+ "epoch": 0.4982505248425472,
1070
+ "grad_norm": 0.011524248868227005,
1071
+ "kl": 0.0013021230697631836,
1072
+ "learning_rate": 3e-07,
1073
+ "loss": 0.0,
1074
+ "reward": 0.46935696713626385,
1075
+ "reward_std": 0.28805949725210667,
1076
+ "rewards/random_math_reward": 0.46935696713626385,
1077
+ "step": 89
1078
+ },
1079
+ {
1080
+ "completion_length": 862.3762626647949,
1081
+ "epoch": 0.5038488453463961,
1082
+ "grad_norm": 0.01074186060577631,
1083
+ "kl": 0.0008319616317749023,
1084
+ "learning_rate": 3e-07,
1085
+ "loss": 0.0,
1086
+ "reward": 0.4534935001283884,
1087
+ "reward_std": 0.29410699382424355,
1088
+ "rewards/random_math_reward": 0.4534935001283884,
1089
+ "step": 90
1090
+ },
1091
+ {
1092
+ "completion_length": 903.1083946228027,
1093
+ "epoch": 0.509447165850245,
1094
+ "grad_norm": 0.20411504805088043,
1095
+ "kl": 0.0014219284057617188,
1096
+ "learning_rate": 3e-07,
1097
+ "loss": 0.0,
1098
+ "reward": 0.47694382816553116,
1099
+ "reward_std": 0.28339869249612093,
1100
+ "rewards/random_math_reward": 0.47694382816553116,
1101
+ "step": 91
1102
+ },
1103
+ {
1104
+ "completion_length": 922.6543235778809,
1105
+ "epoch": 0.5150454863540938,
1106
+ "grad_norm": 0.01232316717505455,
1107
+ "kl": 0.000966191291809082,
1108
+ "learning_rate": 3e-07,
1109
+ "loss": 0.0,
1110
+ "reward": 0.44490973837673664,
1111
+ "reward_std": 0.2793145142495632,
1112
+ "rewards/random_math_reward": 0.44490973837673664,
1113
+ "step": 92
1114
+ },
1115
+ {
1116
+ "completion_length": 911.021671295166,
1117
+ "epoch": 0.5206438068579426,
1118
+ "grad_norm": 0.012167483568191528,
1119
+ "kl": 0.0007963180541992188,
1120
+ "learning_rate": 3e-07,
1121
+ "loss": 0.0,
1122
+ "reward": 0.48554741591215134,
1123
+ "reward_std": 0.2934918478131294,
1124
+ "rewards/random_math_reward": 0.48554741591215134,
1125
+ "step": 93
1126
+ },
1127
+ {
1128
+ "completion_length": 939.8800773620605,
1129
+ "epoch": 0.5262421273617914,
1130
+ "grad_norm": 0.009945104829967022,
1131
+ "kl": 0.0008685588836669922,
1132
+ "learning_rate": 3e-07,
1133
+ "loss": 0.0,
1134
+ "reward": 0.47484579868614674,
1135
+ "reward_std": 0.30168304964900017,
1136
+ "rewards/random_math_reward": 0.47484579868614674,
1137
+ "step": 94
1138
+ },
1139
+ {
1140
+ "completion_length": 906.4119720458984,
1141
+ "epoch": 0.5318404478656403,
1142
+ "grad_norm": 0.010836634784936905,
1143
+ "kl": 0.0007404088973999023,
1144
+ "learning_rate": 3e-07,
1145
+ "loss": 0.0,
1146
+ "reward": 0.47092380560934544,
1147
+ "reward_std": 0.28657434694468975,
1148
+ "rewards/random_math_reward": 0.47092380560934544,
1149
+ "step": 95
1150
+ },
1151
+ {
1152
+ "completion_length": 883.8583946228027,
1153
+ "epoch": 0.5374387683694891,
1154
+ "grad_norm": 0.035804830491542816,
1155
+ "kl": 0.0007826089859008789,
1156
+ "learning_rate": 3e-07,
1157
+ "loss": 0.0,
1158
+ "reward": 0.4703601971268654,
1159
+ "reward_std": 0.28724440187215805,
1160
+ "rewards/random_math_reward": 0.4703601971268654,
1161
+ "step": 96
1162
+ },
1163
+ {
1164
+ "completion_length": 880.0191192626953,
1165
+ "epoch": 0.543037088873338,
1166
+ "grad_norm": 0.06715273857116699,
1167
+ "kl": 0.0016443729400634766,
1168
+ "learning_rate": 3e-07,
1169
+ "loss": 0.0,
1170
+ "reward": 0.5048227999359369,
1171
+ "reward_std": 0.29030876979231834,
1172
+ "rewards/random_math_reward": 0.5048227999359369,
1173
+ "step": 97
1174
+ },
1175
+ {
1176
+ "completion_length": 898.9642677307129,
1177
+ "epoch": 0.5486354093771868,
1178
+ "grad_norm": 0.014051802456378937,
1179
+ "kl": 0.001214146614074707,
1180
+ "learning_rate": 3e-07,
1181
+ "loss": 0.0,
1182
+ "reward": 0.49155336059629917,
1183
+ "reward_std": 0.2932043820619583,
1184
+ "rewards/random_math_reward": 0.49155336059629917,
1185
+ "step": 98
1186
+ },
1187
+ {
1188
+ "completion_length": 854.6913032531738,
1189
+ "epoch": 0.5542337298810357,
1190
+ "grad_norm": 0.01157184038311243,
1191
+ "kl": 0.0010590553283691406,
1192
+ "learning_rate": 3e-07,
1193
+ "loss": 0.0,
1194
+ "reward": 0.45419391617178917,
1195
+ "reward_std": 0.28226044587790966,
1196
+ "rewards/random_math_reward": 0.45419391617178917,
1197
+ "step": 99
1198
+ },
1199
+ {
1200
+ "completion_length": 902.7282943725586,
1201
+ "epoch": 0.5598320503848845,
1202
+ "grad_norm": 0.11193796992301941,
1203
+ "kl": 0.0011587142944335938,
1204
+ "learning_rate": 3e-07,
1205
+ "loss": 0.0,
1206
+ "reward": 0.46381357870996,
1207
+ "reward_std": 0.27888650726526976,
1208
+ "rewards/random_math_reward": 0.46381357870996,
1209
+ "step": 100
1210
+ },
1211
+ {
1212
+ "completion_length": 873.6173210144043,
1213
+ "epoch": 0.5654303708887334,
1214
+ "grad_norm": 0.02095395140349865,
1215
+ "kl": 0.0009194612503051758,
1216
+ "learning_rate": 3e-07,
1217
+ "loss": 0.0,
1218
+ "reward": 0.4897339139133692,
1219
+ "reward_std": 0.27832474932074547,
1220
+ "rewards/random_math_reward": 0.4897339139133692,
1221
+ "step": 101
1222
+ },
1223
+ {
1224
+ "completion_length": 914.6849250793457,
1225
+ "epoch": 0.5710286913925823,
1226
+ "grad_norm": 0.04229766130447388,
1227
+ "kl": 0.0011234283447265625,
1228
+ "learning_rate": 3e-07,
1229
+ "loss": 0.0,
1230
+ "reward": 0.48241828568279743,
1231
+ "reward_std": 0.2899211458861828,
1232
+ "rewards/random_math_reward": 0.48241828568279743,
1233
+ "step": 102
1234
+ },
1235
+ {
1236
+ "completion_length": 888.3277778625488,
1237
+ "epoch": 0.5766270118964311,
1238
+ "grad_norm": 0.0680854320526123,
1239
+ "kl": 0.0012063980102539062,
1240
+ "learning_rate": 3e-07,
1241
+ "loss": 0.0,
1242
+ "reward": 0.4864093214273453,
1243
+ "reward_std": 0.2803313685581088,
1244
+ "rewards/random_math_reward": 0.4864093214273453,
1245
+ "step": 103
1246
+ },
1247
+ {
1248
+ "completion_length": 930.9400291442871,
1249
+ "epoch": 0.58222533240028,
1250
+ "grad_norm": 0.02672196552157402,
1251
+ "kl": 0.0008995532989501953,
1252
+ "learning_rate": 3e-07,
1253
+ "loss": 0.0,
1254
+ "reward": 0.47334628365933895,
1255
+ "reward_std": 0.2828089501708746,
1256
+ "rewards/random_math_reward": 0.47334628365933895,
1257
+ "step": 104
1258
+ },
1259
+ {
1260
+ "completion_length": 865.7499847412109,
1261
+ "epoch": 0.5878236529041287,
1262
+ "grad_norm": 0.010435185395181179,
1263
+ "kl": 0.0008490085601806641,
1264
+ "learning_rate": 3e-07,
1265
+ "loss": 0.0,
1266
+ "reward": 0.4868275187909603,
1267
+ "reward_std": 0.2970134112983942,
1268
+ "rewards/random_math_reward": 0.4868275187909603,
1269
+ "step": 105
1270
+ },
1271
+ {
1272
+ "completion_length": 860.4770202636719,
1273
+ "epoch": 0.5934219734079776,
1274
+ "grad_norm": 0.023035289719700813,
1275
+ "kl": 0.0009222030639648438,
1276
+ "learning_rate": 3e-07,
1277
+ "loss": 0.0,
1278
+ "reward": 0.4827886149287224,
1279
+ "reward_std": 0.2927941419184208,
1280
+ "rewards/random_math_reward": 0.4827886149287224,
1281
+ "step": 106
1282
+ },
1283
+ {
1284
+ "completion_length": 926.4259948730469,
1285
+ "epoch": 0.5990202939118264,
1286
+ "grad_norm": 0.01681309938430786,
1287
+ "kl": 0.0008640289306640625,
1288
+ "learning_rate": 3e-07,
1289
+ "loss": 0.0,
1290
+ "reward": 0.46561466343700886,
1291
+ "reward_std": 0.2794105224311352,
1292
+ "rewards/random_math_reward": 0.46561466343700886,
1293
+ "step": 107
1294
+ },
1295
+ {
1296
+ "completion_length": 899.8622245788574,
1297
+ "epoch": 0.6046186144156753,
1298
+ "grad_norm": 0.010567005723714828,
1299
+ "kl": 0.0010676383972167969,
1300
+ "learning_rate": 3e-07,
1301
+ "loss": 0.0,
1302
+ "reward": 0.4655596222728491,
1303
+ "reward_std": 0.27886725403368473,
1304
+ "rewards/random_math_reward": 0.4655596222728491,
1305
+ "step": 108
1306
+ },
1307
+ {
1308
+ "completion_length": 939.2283058166504,
1309
+ "epoch": 0.6102169349195241,
1310
+ "grad_norm": 0.07191047817468643,
1311
+ "kl": 0.0011771917343139648,
1312
+ "learning_rate": 3e-07,
1313
+ "loss": 0.0,
1314
+ "reward": 0.4728291556239128,
1315
+ "reward_std": 0.2772167157381773,
1316
+ "rewards/random_math_reward": 0.4728291556239128,
1317
+ "step": 109
1318
+ },
1319
+ {
1320
+ "completion_length": 825.8698749542236,
1321
+ "epoch": 0.615815255423373,
1322
+ "grad_norm": 0.02628973312675953,
1323
+ "kl": 0.0011527538299560547,
1324
+ "learning_rate": 3e-07,
1325
+ "loss": 0.0,
1326
+ "reward": 0.48204794339835644,
1327
+ "reward_std": 0.29616999346762896,
1328
+ "rewards/random_math_reward": 0.48204794339835644,
1329
+ "step": 110
1330
+ },
1331
+ {
1332
+ "completion_length": 907.2818717956543,
1333
+ "epoch": 0.6214135759272218,
1334
+ "grad_norm": 0.01321073155850172,
1335
+ "kl": 0.0007474422454833984,
1336
+ "learning_rate": 3e-07,
1337
+ "loss": 0.0,
1338
+ "reward": 0.4843390993773937,
1339
+ "reward_std": 0.2992616593837738,
1340
+ "rewards/random_math_reward": 0.4843390993773937,
1341
+ "step": 111
1342
+ },
1343
+ {
1344
+ "completion_length": 881.9884948730469,
1345
+ "epoch": 0.6270118964310707,
1346
+ "grad_norm": 0.011079943738877773,
1347
+ "kl": 0.0010025501251220703,
1348
+ "learning_rate": 3e-07,
1349
+ "loss": 0.0,
1350
+ "reward": 0.5278591625392437,
1351
+ "reward_std": 0.3031477089971304,
1352
+ "rewards/random_math_reward": 0.5278591625392437,
1353
+ "step": 112
1354
+ },
1355
+ {
1356
+ "completion_length": 889.1198768615723,
1357
+ "epoch": 0.6326102169349195,
1358
+ "grad_norm": 0.015584494918584824,
1359
+ "kl": 0.0008515119552612305,
1360
+ "learning_rate": 3e-07,
1361
+ "loss": 0.0,
1362
+ "reward": 0.5340252239257097,
1363
+ "reward_std": 0.28527406230568886,
1364
+ "rewards/random_math_reward": 0.5340252239257097,
1365
+ "step": 113
1366
+ },
1367
+ {
1368
+ "completion_length": 890.235954284668,
1369
+ "epoch": 0.6382085374387684,
1370
+ "grad_norm": 0.03280128538608551,
1371
+ "kl": 0.000990152359008789,
1372
+ "learning_rate": 3e-07,
1373
+ "loss": 0.0,
1374
+ "reward": 0.5209550634026527,
1375
+ "reward_std": 0.29494220949709415,
1376
+ "rewards/random_math_reward": 0.5209550634026527,
1377
+ "step": 114
1378
+ },
1379
+ {
1380
+ "completion_length": 865.9170799255371,
1381
+ "epoch": 0.6438068579426172,
1382
+ "grad_norm": 0.059104159474372864,
1383
+ "kl": 0.0009453296661376953,
1384
+ "learning_rate": 3e-07,
1385
+ "loss": 0.0,
1386
+ "reward": 0.5109865833073854,
1387
+ "reward_std": 0.29917749017477036,
1388
+ "rewards/random_math_reward": 0.5109865833073854,
1389
+ "step": 115
1390
+ },
1391
+ {
1392
+ "completion_length": 876.934928894043,
1393
+ "epoch": 0.6494051784464661,
1394
+ "grad_norm": 0.01903529092669487,
1395
+ "kl": 0.0008587837219238281,
1396
+ "learning_rate": 3e-07,
1397
+ "loss": 0.0,
1398
+ "reward": 0.500015264376998,
1399
+ "reward_std": 0.29237030632793903,
1400
+ "rewards/random_math_reward": 0.500015264376998,
1401
+ "step": 116
1402
+ },
1403
+ {
1404
+ "completion_length": 846.1517677307129,
1405
+ "epoch": 0.655003498950315,
1406
+ "grad_norm": 0.02293042466044426,
1407
+ "kl": 0.0009784698486328125,
1408
+ "learning_rate": 3e-07,
1409
+ "loss": 0.0,
1410
+ "reward": 0.48975357227027416,
1411
+ "reward_std": 0.2920750202611089,
1412
+ "rewards/random_math_reward": 0.48975357227027416,
1413
+ "step": 117
1414
+ },
1415
+ {
1416
+ "completion_length": 889.9859504699707,
1417
+ "epoch": 0.6606018194541637,
1418
+ "grad_norm": 0.018028084188699722,
1419
+ "kl": 0.0013837814331054688,
1420
+ "learning_rate": 3e-07,
1421
+ "loss": 0.0,
1422
+ "reward": 0.5070291068404913,
1423
+ "reward_std": 0.2938714809715748,
1424
+ "rewards/random_math_reward": 0.5070291068404913,
1425
+ "step": 118
1426
+ },
1427
+ {
1428
+ "completion_length": 891.4043159484863,
1429
+ "epoch": 0.6662001399580126,
1430
+ "grad_norm": 0.011308281682431698,
1431
+ "kl": 0.0008604526519775391,
1432
+ "learning_rate": 3e-07,
1433
+ "loss": 0.0,
1434
+ "reward": 0.4764967616647482,
1435
+ "reward_std": 0.3148739319294691,
1436
+ "rewards/random_math_reward": 0.4764967616647482,
1437
+ "step": 119
1438
+ },
1439
+ {
1440
+ "completion_length": 872.5395317077637,
1441
+ "epoch": 0.6717984604618614,
1442
+ "grad_norm": 0.0414012111723423,
1443
+ "kl": 0.0009286403656005859,
1444
+ "learning_rate": 3e-07,
1445
+ "loss": 0.0,
1446
+ "reward": 0.4626141209155321,
1447
+ "reward_std": 0.3029380030930042,
1448
+ "rewards/random_math_reward": 0.4626141209155321,
1449
+ "step": 120
1450
+ },
1451
+ {
1452
+ "completion_length": 907.8239593505859,
1453
+ "epoch": 0.6773967809657103,
1454
+ "grad_norm": 0.024403268471360207,
1455
+ "kl": 0.0008714199066162109,
1456
+ "learning_rate": 3e-07,
1457
+ "loss": 0.0,
1458
+ "reward": 0.463301295414567,
1459
+ "reward_std": 0.2877810364589095,
1460
+ "rewards/random_math_reward": 0.463301295414567,
1461
+ "step": 121
1462
+ },
1463
+ {
1464
+ "completion_length": 947.8214149475098,
1465
+ "epoch": 0.6829951014695591,
1466
+ "grad_norm": 0.03146574646234512,
1467
+ "kl": 0.0012885332107543945,
1468
+ "learning_rate": 3e-07,
1469
+ "loss": 0.0,
1470
+ "reward": 0.46076902747154236,
1471
+ "reward_std": 0.2793724099174142,
1472
+ "rewards/random_math_reward": 0.46076902747154236,
1473
+ "step": 122
1474
+ },
1475
+ {
1476
+ "completion_length": 909.1071243286133,
1477
+ "epoch": 0.688593421973408,
1478
+ "grad_norm": 0.01003959309309721,
1479
+ "kl": 0.0011519193649291992,
1480
+ "learning_rate": 3e-07,
1481
+ "loss": 0.0,
1482
+ "reward": 0.5079296790063381,
1483
+ "reward_std": 0.29765503481030464,
1484
+ "rewards/random_math_reward": 0.5079296790063381,
1485
+ "step": 123
1486
+ },
1487
+ {
1488
+ "completion_length": 861.1160545349121,
1489
+ "epoch": 0.6941917424772568,
1490
+ "grad_norm": 0.013797848485410213,
1491
+ "kl": 0.0010790824890136719,
1492
+ "learning_rate": 3e-07,
1493
+ "loss": 0.0,
1494
+ "reward": 0.4844306465238333,
1495
+ "reward_std": 0.2945317914709449,
1496
+ "rewards/random_math_reward": 0.4844306465238333,
1497
+ "step": 124
1498
+ },
1499
+ {
1500
+ "completion_length": 863.8520278930664,
1501
+ "epoch": 0.6997900629811057,
1502
+ "grad_norm": 0.012310285121202469,
1503
+ "kl": 0.0011372566223144531,
1504
+ "learning_rate": 3e-07,
1505
+ "loss": 0.0,
1506
+ "reward": 0.4732836168259382,
1507
+ "reward_std": 0.2942813113331795,
1508
+ "rewards/random_math_reward": 0.4732836168259382,
1509
+ "step": 125
1510
+ },
1511
+ {
1512
+ "completion_length": 881.832893371582,
1513
+ "epoch": 0.7053883834849545,
1514
+ "grad_norm": 0.28292930126190186,
1515
+ "kl": 0.0030515193939208984,
1516
+ "learning_rate": 3e-07,
1517
+ "loss": 0.0,
1518
+ "reward": 0.48056939989328384,
1519
+ "reward_std": 0.27455357648432255,
1520
+ "rewards/random_math_reward": 0.48056939989328384,
1521
+ "step": 126
1522
+ },
1523
+ {
1524
+ "completion_length": 855.8303375244141,
1525
+ "epoch": 0.7109867039888034,
1526
+ "grad_norm": 0.026784956455230713,
1527
+ "kl": 0.0010666847229003906,
1528
+ "learning_rate": 3e-07,
1529
+ "loss": 0.0,
1530
+ "reward": 0.47633388079702854,
1531
+ "reward_std": 0.27308547869324684,
1532
+ "rewards/random_math_reward": 0.47633388079702854,
1533
+ "step": 127
1534
+ },
1535
+ {
1536
+ "completion_length": 872.3571243286133,
1537
+ "epoch": 0.7165850244926522,
1538
+ "grad_norm": 0.027628231793642044,
1539
+ "kl": 0.0012966394424438477,
1540
+ "learning_rate": 3e-07,
1541
+ "loss": 0.0,
1542
+ "reward": 0.4544145315885544,
1543
+ "reward_std": 0.2810182133689523,
1544
+ "rewards/random_math_reward": 0.4544145315885544,
1545
+ "step": 128
1546
+ },
1547
+ {
1548
+ "completion_length": 880.8622207641602,
1549
+ "epoch": 0.722183344996501,
1550
+ "grad_norm": 0.011807740665972233,
1551
+ "kl": 0.0009188652038574219,
1552
+ "learning_rate": 3e-07,
1553
+ "loss": 0.0,
1554
+ "reward": 0.4832920003682375,
1555
+ "reward_std": 0.2904686816036701,
1556
+ "rewards/random_math_reward": 0.4832920003682375,
1557
+ "step": 129
1558
+ },
1559
+ {
1560
+ "completion_length": 890.7397804260254,
1561
+ "epoch": 0.72778166550035,
1562
+ "grad_norm": 0.00990369077771902,
1563
+ "kl": 0.0008499622344970703,
1564
+ "learning_rate": 3e-07,
1565
+ "loss": 0.0,
1566
+ "reward": 0.4763722326606512,
1567
+ "reward_std": 0.3048167824745178,
1568
+ "rewards/random_math_reward": 0.4763722326606512,
1569
+ "step": 130
1570
+ },
1571
+ {
1572
+ "completion_length": 881.3290634155273,
1573
+ "epoch": 0.7333799860041987,
1574
+ "grad_norm": 0.042855340987443924,
1575
+ "kl": 0.002415895462036133,
1576
+ "learning_rate": 3e-07,
1577
+ "loss": 0.0,
1578
+ "reward": 0.49098371155560017,
1579
+ "reward_std": 0.28425728902220726,
1580
+ "rewards/random_math_reward": 0.49098371155560017,
1581
+ "step": 131
1582
+ },
1583
+ {
1584
+ "completion_length": 920.1415672302246,
1585
+ "epoch": 0.7389783065080476,
1586
+ "grad_norm": 0.011931121349334717,
1587
+ "kl": 0.0009909868240356445,
1588
+ "learning_rate": 3e-07,
1589
+ "loss": 0.0,
1590
+ "reward": 0.49408636428415775,
1591
+ "reward_std": 0.30311523005366325,
1592
+ "rewards/random_math_reward": 0.49408636428415775,
1593
+ "step": 132
1594
+ },
1595
+ {
1596
+ "completion_length": 928.8456420898438,
1597
+ "epoch": 0.7445766270118964,
1598
+ "grad_norm": 0.11891764402389526,
1599
+ "kl": 0.0010557174682617188,
1600
+ "learning_rate": 3e-07,
1601
+ "loss": 0.0,
1602
+ "reward": 0.47669179551303387,
1603
+ "reward_std": 0.29571292363107204,
1604
+ "rewards/random_math_reward": 0.47669179551303387,
1605
+ "step": 133
1606
+ },
1607
+ {
1608
+ "completion_length": 840.7448806762695,
1609
+ "epoch": 0.7501749475157453,
1610
+ "grad_norm": 0.020906388759613037,
1611
+ "kl": 0.001192331314086914,
1612
+ "learning_rate": 3e-07,
1613
+ "loss": 0.0,
1614
+ "reward": 0.48422789201140404,
1615
+ "reward_std": 0.28880419582128525,
1616
+ "rewards/random_math_reward": 0.48422789201140404,
1617
+ "step": 134
1618
+ },
1619
+ {
1620
+ "completion_length": 924.7793121337891,
1621
+ "epoch": 0.7557732680195941,
1622
+ "grad_norm": 0.015085350722074509,
1623
+ "kl": 0.0009965896606445312,
1624
+ "learning_rate": 3e-07,
1625
+ "loss": 0.0,
1626
+ "reward": 0.48809418082237244,
1627
+ "reward_std": 0.30123256146907806,
1628
+ "rewards/random_math_reward": 0.48809418082237244,
1629
+ "step": 135
1630
+ },
1631
+ {
1632
+ "completion_length": 783.3507499694824,
1633
+ "epoch": 0.761371588523443,
1634
+ "grad_norm": 0.025510285049676895,
1635
+ "kl": 0.0011339187622070312,
1636
+ "learning_rate": 3e-07,
1637
+ "loss": 0.0,
1638
+ "reward": 0.4794926680624485,
1639
+ "reward_std": 0.29374638944864273,
1640
+ "rewards/random_math_reward": 0.4794926680624485,
1641
+ "step": 136
1642
+ },
1643
+ {
1644
+ "completion_length": 886.570140838623,
1645
+ "epoch": 0.7669699090272918,
1646
+ "grad_norm": 0.014409742318093777,
1647
+ "kl": 0.0010912418365478516,
1648
+ "learning_rate": 3e-07,
1649
+ "loss": 0.0,
1650
+ "reward": 0.44775343872606754,
1651
+ "reward_std": 0.27853819355368614,
1652
+ "rewards/random_math_reward": 0.44775343872606754,
1653
+ "step": 137
1654
+ },
1655
+ {
1656
+ "completion_length": 875.3928413391113,
1657
+ "epoch": 0.7725682295311407,
1658
+ "grad_norm": 0.009967821650207043,
1659
+ "kl": 0.001146078109741211,
1660
+ "learning_rate": 3e-07,
1661
+ "loss": 0.0,
1662
+ "reward": 0.4673925694078207,
1663
+ "reward_std": 0.289134263060987,
1664
+ "rewards/random_math_reward": 0.4673925694078207,
1665
+ "step": 138
1666
+ },
1667
+ {
1668
+ "completion_length": 856.2155380249023,
1669
+ "epoch": 0.7781665500349895,
1670
+ "grad_norm": 0.012912734411656857,
1671
+ "kl": 0.0009341239929199219,
1672
+ "learning_rate": 3e-07,
1673
+ "loss": 0.0,
1674
+ "reward": 0.5017714705318213,
1675
+ "reward_std": 0.28041696455329657,
1676
+ "rewards/random_math_reward": 0.5017714705318213,
1677
+ "step": 139
1678
+ },
1679
+ {
1680
+ "completion_length": 894.0050811767578,
1681
+ "epoch": 0.7837648705388384,
1682
+ "grad_norm": 0.08305728435516357,
1683
+ "kl": 0.0018918514251708984,
1684
+ "learning_rate": 3e-07,
1685
+ "loss": 0.0,
1686
+ "reward": 0.5223593860864639,
1687
+ "reward_std": 0.2958631496876478,
1688
+ "rewards/random_math_reward": 0.5223593860864639,
1689
+ "step": 140
1690
+ },
1691
+ {
1692
+ "completion_length": 881.410701751709,
1693
+ "epoch": 0.7893631910426872,
1694
+ "grad_norm": 0.009809349663555622,
1695
+ "kl": 0.0009772777557373047,
1696
+ "learning_rate": 3e-07,
1697
+ "loss": 0.0,
1698
+ "reward": 0.5037720259279013,
1699
+ "reward_std": 0.2979677114635706,
1700
+ "rewards/random_math_reward": 0.5037720259279013,
1701
+ "step": 141
1702
+ },
1703
+ {
1704
+ "completion_length": 840.1135063171387,
1705
+ "epoch": 0.794961511546536,
1706
+ "grad_norm": 0.018646089360117912,
1707
+ "kl": 0.001020193099975586,
1708
+ "learning_rate": 3e-07,
1709
+ "loss": 0.0,
1710
+ "reward": 0.5081987045705318,
1711
+ "reward_std": 0.29782584123313427,
1712
+ "rewards/random_math_reward": 0.5081987045705318,
1713
+ "step": 142
1714
+ },
1715
+ {
1716
+ "completion_length": 876.1058502197266,
1717
+ "epoch": 0.8005598320503848,
1718
+ "grad_norm": 0.024378400295972824,
1719
+ "kl": 0.0011475086212158203,
1720
+ "learning_rate": 3e-07,
1721
+ "loss": 0.0,
1722
+ "reward": 0.4988156743347645,
1723
+ "reward_std": 0.29589168168604374,
1724
+ "rewards/random_math_reward": 0.4988156743347645,
1725
+ "step": 143
1726
+ },
1727
+ {
1728
+ "completion_length": 891.9885063171387,
1729
+ "epoch": 0.8061581525542337,
1730
+ "grad_norm": 0.017726967111229897,
1731
+ "kl": 0.0011355876922607422,
1732
+ "learning_rate": 3e-07,
1733
+ "loss": 0.0,
1734
+ "reward": 0.49985454976558685,
1735
+ "reward_std": 0.2889725724235177,
1736
+ "rewards/random_math_reward": 0.49985454976558685,
1737
+ "step": 144
1738
+ },
1739
+ {
1740
+ "completion_length": 900.8316192626953,
1741
+ "epoch": 0.8117564730580826,
1742
+ "grad_norm": 0.01951918564736843,
1743
+ "kl": 0.0014090538024902344,
1744
+ "learning_rate": 3e-07,
1745
+ "loss": 0.0,
1746
+ "reward": 0.48107675835490227,
1747
+ "reward_std": 0.2852372843772173,
1748
+ "rewards/random_math_reward": 0.48107675835490227,
1749
+ "step": 145
1750
+ },
1751
+ {
1752
+ "completion_length": 899.4731903076172,
1753
+ "epoch": 0.8173547935619314,
1754
+ "grad_norm": 0.021214712411165237,
1755
+ "kl": 0.0010182857513427734,
1756
+ "learning_rate": 3e-07,
1757
+ "loss": 0.0,
1758
+ "reward": 0.4994645491242409,
1759
+ "reward_std": 0.2752206530421972,
1760
+ "rewards/random_math_reward": 0.4994645491242409,
1761
+ "step": 146
1762
+ },
1763
+ {
1764
+ "completion_length": 949.8775329589844,
1765
+ "epoch": 0.8229531140657803,
1766
+ "grad_norm": 0.010702384635806084,
1767
+ "kl": 0.0011107921600341797,
1768
+ "learning_rate": 3e-07,
1769
+ "loss": 0.0,
1770
+ "reward": 0.47683133371174335,
1771
+ "reward_std": 0.2850389126688242,
1772
+ "rewards/random_math_reward": 0.47683133371174335,
1773
+ "step": 147
1774
+ },
1775
+ {
1776
+ "completion_length": 890.5433540344238,
1777
+ "epoch": 0.8285514345696291,
1778
+ "grad_norm": 0.014014041982591152,
1779
+ "kl": 0.0009796619415283203,
1780
+ "learning_rate": 3e-07,
1781
+ "loss": 0.0,
1782
+ "reward": 0.45283058658242226,
1783
+ "reward_std": 0.2967394981533289,
1784
+ "rewards/random_math_reward": 0.45283058658242226,
1785
+ "step": 148
1786
+ },
1787
+ {
1788
+ "completion_length": 854.0229377746582,
1789
+ "epoch": 0.834149755073478,
1790
+ "grad_norm": 0.010618280619382858,
1791
+ "kl": 0.0011134147644042969,
1792
+ "learning_rate": 3e-07,
1793
+ "loss": 0.0,
1794
+ "reward": 0.46891826018691063,
1795
+ "reward_std": 0.2850739639252424,
1796
+ "rewards/random_math_reward": 0.46891826018691063,
1797
+ "step": 149
1798
+ },
1799
+ {
1800
+ "completion_length": 873.9553413391113,
1801
+ "epoch": 0.8397480755773268,
1802
+ "grad_norm": 0.03690403327345848,
1803
+ "kl": 0.0010218620300292969,
1804
+ "learning_rate": 3e-07,
1805
+ "loss": 0.0,
1806
+ "reward": 0.5085400156676769,
1807
+ "reward_std": 0.2830808274447918,
1808
+ "rewards/random_math_reward": 0.5085400156676769,
1809
+ "step": 150
1810
+ },
1811
+ {
1812
+ "completion_length": 927.468090057373,
1813
+ "epoch": 0.8453463960811757,
1814
+ "grad_norm": 0.017428407445549965,
1815
+ "kl": 0.000989675521850586,
1816
+ "learning_rate": 3e-07,
1817
+ "loss": 0.0,
1818
+ "reward": 0.4990678243339062,
1819
+ "reward_std": 0.2827162565663457,
1820
+ "rewards/random_math_reward": 0.4990678243339062,
1821
+ "step": 151
1822
+ },
1823
+ {
1824
+ "completion_length": 910.8379859924316,
1825
+ "epoch": 0.8509447165850245,
1826
+ "grad_norm": 0.017853064462542534,
1827
+ "kl": 0.0009360313415527344,
1828
+ "learning_rate": 3e-07,
1829
+ "loss": 0.0,
1830
+ "reward": 0.46421765722334385,
1831
+ "reward_std": 0.2790751438587904,
1832
+ "rewards/random_math_reward": 0.46421765722334385,
1833
+ "step": 152
1834
+ },
1835
+ {
1836
+ "completion_length": 886.9323768615723,
1837
+ "epoch": 0.8565430370888734,
1838
+ "grad_norm": 0.018860990181565285,
1839
+ "kl": 0.0011188983917236328,
1840
+ "learning_rate": 3e-07,
1841
+ "loss": 0.0,
1842
+ "reward": 0.47205423563718796,
1843
+ "reward_std": 0.28594095539301634,
1844
+ "rewards/random_math_reward": 0.47205423563718796,
1845
+ "step": 153
1846
+ },
1847
+ {
1848
+ "completion_length": 848.5089149475098,
1849
+ "epoch": 0.8621413575927221,
1850
+ "grad_norm": 0.020478500053286552,
1851
+ "kl": 0.0010101795196533203,
1852
+ "learning_rate": 3e-07,
1853
+ "loss": 0.0,
1854
+ "reward": 0.4954501297324896,
1855
+ "reward_std": 0.28822094202041626,
1856
+ "rewards/random_math_reward": 0.4954501297324896,
1857
+ "step": 154
1858
+ },
1859
+ {
1860
+ "completion_length": 873.7601776123047,
1861
+ "epoch": 0.867739678096571,
1862
+ "grad_norm": 0.014679434709250927,
1863
+ "kl": 0.001070261001586914,
1864
+ "learning_rate": 3e-07,
1865
+ "loss": 0.0,
1866
+ "reward": 0.48937808722257614,
1867
+ "reward_std": 0.28821886610239744,
1868
+ "rewards/random_math_reward": 0.48937808722257614,
1869
+ "step": 155
1870
+ },
1871
+ {
1872
+ "completion_length": 887.0267677307129,
1873
+ "epoch": 0.8733379986004198,
1874
+ "grad_norm": 0.053716812282800674,
1875
+ "kl": 0.0013840198516845703,
1876
+ "learning_rate": 3e-07,
1877
+ "loss": 0.0,
1878
+ "reward": 0.48024408891797066,
1879
+ "reward_std": 0.30272081680595875,
1880
+ "rewards/random_math_reward": 0.48024408891797066,
1881
+ "step": 156
1882
+ },
1883
+ {
1884
+ "completion_length": 910.788257598877,
1885
+ "epoch": 0.8789363191042687,
1886
+ "grad_norm": 0.03143594413995743,
1887
+ "kl": 0.0013470649719238281,
1888
+ "learning_rate": 3e-07,
1889
+ "loss": 0.0,
1890
+ "reward": 0.49199927039444447,
1891
+ "reward_std": 0.30107220634818077,
1892
+ "rewards/random_math_reward": 0.49199927039444447,
1893
+ "step": 157
1894
+ },
1895
+ {
1896
+ "completion_length": 844.5101852416992,
1897
+ "epoch": 0.8845346396081175,
1898
+ "grad_norm": 0.031151605769991875,
1899
+ "kl": 0.001325845718383789,
1900
+ "learning_rate": 3e-07,
1901
+ "loss": 0.0,
1902
+ "reward": 0.4829933065921068,
1903
+ "reward_std": 0.2935031168162823,
1904
+ "rewards/random_math_reward": 0.4829933065921068,
1905
+ "step": 158
1906
+ },
1907
+ {
1908
+ "completion_length": 911.2563629150391,
1909
+ "epoch": 0.8901329601119664,
1910
+ "grad_norm": 0.010499164462089539,
1911
+ "kl": 0.0009677410125732422,
1912
+ "learning_rate": 3e-07,
1913
+ "loss": 0.0,
1914
+ "reward": 0.46660212986171246,
1915
+ "reward_std": 0.30480547808110714,
1916
+ "rewards/random_math_reward": 0.46660212986171246,
1917
+ "step": 159
1918
+ },
1919
+ {
1920
+ "completion_length": 867.9106979370117,
1921
+ "epoch": 0.8957312806158153,
1922
+ "grad_norm": 0.039778802543878555,
1923
+ "kl": 0.0010061264038085938,
1924
+ "learning_rate": 3e-07,
1925
+ "loss": 0.0,
1926
+ "reward": 0.46561445854604244,
1927
+ "reward_std": 0.2867754641920328,
1928
+ "rewards/random_math_reward": 0.46561445854604244,
1929
+ "step": 160
1930
+ },
1931
+ {
1932
+ "completion_length": 891.145378112793,
1933
+ "epoch": 0.9013296011196641,
1934
+ "grad_norm": 0.013824980705976486,
1935
+ "kl": 0.0010936260223388672,
1936
+ "learning_rate": 3e-07,
1937
+ "loss": 0.0,
1938
+ "reward": 0.4558851607143879,
1939
+ "reward_std": 0.277794330380857,
1940
+ "rewards/random_math_reward": 0.4558851607143879,
1941
+ "step": 161
1942
+ },
1943
+ {
1944
+ "completion_length": 869.1186027526855,
1945
+ "epoch": 0.906927921623513,
1946
+ "grad_norm": 0.01410576980561018,
1947
+ "kl": 0.0011093616485595703,
1948
+ "learning_rate": 3e-07,
1949
+ "loss": 0.0,
1950
+ "reward": 0.48392108641564846,
1951
+ "reward_std": 0.2871867660433054,
1952
+ "rewards/random_math_reward": 0.48392108641564846,
1953
+ "step": 162
1954
+ },
1955
+ {
1956
+ "completion_length": 884.0267677307129,
1957
+ "epoch": 0.9125262421273618,
1958
+ "grad_norm": 0.01321236602962017,
1959
+ "kl": 0.0009493827819824219,
1960
+ "learning_rate": 3e-07,
1961
+ "loss": 0.0,
1962
+ "reward": 0.5128381866961718,
1963
+ "reward_std": 0.30100576020777225,
1964
+ "rewards/random_math_reward": 0.5128381866961718,
1965
+ "step": 163
1966
+ },
1967
+ {
1968
+ "completion_length": 813.4183502197266,
1969
+ "epoch": 0.9181245626312107,
1970
+ "grad_norm": 0.02240627259016037,
1971
+ "kl": 0.0012166500091552734,
1972
+ "learning_rate": 3e-07,
1973
+ "loss": 0.0,
1974
+ "reward": 0.5056673623621464,
1975
+ "reward_std": 0.30340027436614037,
1976
+ "rewards/random_math_reward": 0.5056673623621464,
1977
+ "step": 164
1978
+ },
1979
+ {
1980
+ "completion_length": 866.1339149475098,
1981
+ "epoch": 0.9237228831350595,
1982
+ "grad_norm": 0.01877676323056221,
1983
+ "kl": 0.000985860824584961,
1984
+ "learning_rate": 3e-07,
1985
+ "loss": 0.0,
1986
+ "reward": 0.4957231916487217,
1987
+ "reward_std": 0.28768617659807205,
1988
+ "rewards/random_math_reward": 0.4957231916487217,
1989
+ "step": 165
1990
+ },
1991
+ {
1992
+ "completion_length": 934.8022727966309,
1993
+ "epoch": 0.9293212036389084,
1994
+ "grad_norm": 0.03419911116361618,
1995
+ "kl": 0.001092672348022461,
1996
+ "learning_rate": 3e-07,
1997
+ "loss": 0.0,
1998
+ "reward": 0.4862586259841919,
1999
+ "reward_std": 0.28661127388477325,
2000
+ "rewards/random_math_reward": 0.4862586259841919,
2001
+ "step": 166
2002
+ },
2003
+ {
2004
+ "completion_length": 845.2384948730469,
2005
+ "epoch": 0.9349195241427571,
2006
+ "grad_norm": 0.016880100592970848,
2007
+ "kl": 0.0010738372802734375,
2008
+ "learning_rate": 3e-07,
2009
+ "loss": 0.0,
2010
+ "reward": 0.48645939491689205,
2011
+ "reward_std": 0.2838729955255985,
2012
+ "rewards/random_math_reward": 0.48645939491689205,
2013
+ "step": 167
2014
+ },
2015
+ {
2016
+ "completion_length": 897.2448921203613,
2017
+ "epoch": 0.940517844646606,
2018
+ "grad_norm": 0.03305617719888687,
2019
+ "kl": 0.0013968944549560547,
2020
+ "learning_rate": 3e-07,
2021
+ "loss": 0.0,
2022
+ "reward": 0.4894435331225395,
2023
+ "reward_std": 0.2995728589594364,
2024
+ "rewards/random_math_reward": 0.4894435331225395,
2025
+ "step": 168
2026
+ },
2027
+ {
2028
+ "completion_length": 870.5650291442871,
2029
+ "epoch": 0.9461161651504548,
2030
+ "grad_norm": 0.009322837926447392,
2031
+ "kl": 0.0010249614715576172,
2032
+ "learning_rate": 3e-07,
2033
+ "loss": 0.0,
2034
+ "reward": 0.5089151151478291,
2035
+ "reward_std": 0.30756222270429134,
2036
+ "rewards/random_math_reward": 0.5089151151478291,
2037
+ "step": 169
2038
+ },
2039
+ {
2040
+ "completion_length": 854.2805862426758,
2041
+ "epoch": 0.9517144856543037,
2042
+ "grad_norm": 0.0156830083578825,
2043
+ "kl": 0.000896453857421875,
2044
+ "learning_rate": 3e-07,
2045
+ "loss": 0.0,
2046
+ "reward": 0.4866387601941824,
2047
+ "reward_std": 0.29632874485105276,
2048
+ "rewards/random_math_reward": 0.4866387601941824,
2049
+ "step": 170
2050
+ },
2051
+ {
2052
+ "completion_length": 835.1109504699707,
2053
+ "epoch": 0.9573128061581525,
2054
+ "grad_norm": 0.04610283300280571,
2055
+ "kl": 0.0012865066528320312,
2056
+ "learning_rate": 3e-07,
2057
+ "loss": 0.0,
2058
+ "reward": 0.4992235377430916,
2059
+ "reward_std": 0.2944375704973936,
2060
+ "rewards/random_math_reward": 0.4992235377430916,
2061
+ "step": 171
2062
+ },
2063
+ {
2064
+ "completion_length": 885.7053337097168,
2065
+ "epoch": 0.9629111266620014,
2066
+ "grad_norm": 0.02425132691860199,
2067
+ "kl": 0.0011935234069824219,
2068
+ "learning_rate": 3e-07,
2069
+ "loss": 0.0,
2070
+ "reward": 0.5038886945694685,
2071
+ "reward_std": 0.28100786730647087,
2072
+ "rewards/random_math_reward": 0.5038886945694685,
2073
+ "step": 172
2074
+ },
2075
+ {
2076
+ "completion_length": 919.9604415893555,
2077
+ "epoch": 0.9685094471658502,
2078
+ "grad_norm": 0.011359083466231823,
2079
+ "kl": 0.0011434555053710938,
2080
+ "learning_rate": 3e-07,
2081
+ "loss": 0.0,
2082
+ "reward": 0.4733607657253742,
2083
+ "reward_std": 0.2893592659384012,
2084
+ "rewards/random_math_reward": 0.4733607657253742,
2085
+ "step": 173
2086
+ },
2087
+ {
2088
+ "completion_length": 816.8494758605957,
2089
+ "epoch": 0.9741077676696991,
2090
+ "grad_norm": 0.0354762077331543,
2091
+ "kl": 0.0011744499206542969,
2092
+ "learning_rate": 3e-07,
2093
+ "loss": 0.0,
2094
+ "reward": 0.47691468335688114,
2095
+ "reward_std": 0.28350450936704874,
2096
+ "rewards/random_math_reward": 0.47691468335688114,
2097
+ "step": 174
2098
+ },
2099
+ {
2100
+ "completion_length": 837.5076293945312,
2101
+ "epoch": 0.979706088173548,
2102
+ "grad_norm": 0.030416017398238182,
2103
+ "kl": 0.0010030269622802734,
2104
+ "learning_rate": 3e-07,
2105
+ "loss": 0.0,
2106
+ "reward": 0.49260007217526436,
2107
+ "reward_std": 0.265430293045938,
2108
+ "rewards/random_math_reward": 0.49260007217526436,
2109
+ "step": 175
2110
+ },
2111
+ {
2112
+ "completion_length": 887.9668235778809,
2113
+ "epoch": 0.9853044086773968,
2114
+ "grad_norm": 0.013739695772528648,
2115
+ "kl": 0.0010192394256591797,
2116
+ "learning_rate": 3e-07,
2117
+ "loss": 0.0,
2118
+ "reward": 0.526423504576087,
2119
+ "reward_std": 0.2828856185078621,
2120
+ "rewards/random_math_reward": 0.526423504576087,
2121
+ "step": 176
2122
+ },
2123
+ {
2124
+ "completion_length": 837.7933464050293,
2125
+ "epoch": 0.9909027291812457,
2126
+ "grad_norm": 0.014401717111468315,
2127
+ "kl": 0.0010166168212890625,
2128
+ "learning_rate": 3e-07,
2129
+ "loss": 0.0,
2130
+ "reward": 0.5026168543845415,
2131
+ "reward_std": 0.2906886078417301,
2132
+ "rewards/random_math_reward": 0.5026168543845415,
2133
+ "step": 177
2134
+ },
2135
+ {
2136
+ "completion_length": 915.8877296447754,
2137
+ "epoch": 0.9965010496850945,
2138
+ "grad_norm": 0.008807230740785599,
2139
+ "kl": 0.0009367465972900391,
2140
+ "learning_rate": 3e-07,
2141
+ "loss": 0.0,
2142
+ "reward": 0.4757802300155163,
2143
+ "reward_std": 0.29675872810184956,
2144
+ "rewards/random_math_reward": 0.4757802300155163,
2145
+ "step": 178
2146
+ },
2147
+ {
2148
+ "epoch": 0.9965010496850945,
2149
+ "step": 178,
2150
+ "total_flos": 0.0,
2151
+ "train_loss": 8.003707675146196e-08,
2152
+ "train_runtime": 99592.7903,
2153
+ "train_samples_per_second": 0.201,
2154
+ "train_steps_per_second": 0.002
2155
+ }
2156
+ ],
2157
+ "logging_steps": 1,
2158
+ "max_steps": 178,
2159
+ "num_input_tokens_seen": 0,
2160
+ "num_train_epochs": 1,
2161
+ "save_steps": 10,
2162
+ "stateful_callbacks": {
2163
+ "TrainerControl": {
2164
+ "args": {
2165
+ "should_epoch_stop": false,
2166
+ "should_evaluate": false,
2167
+ "should_log": false,
2168
+ "should_save": true,
2169
+ "should_training_stop": true
2170
+ },
2171
+ "attributes": {}
2172
+ }
2173
+ },
2174
+ "total_flos": 0.0,
2175
+ "train_batch_size": 1,
2176
+ "trial_name": null,
2177
+ "trial_params": null
2178
+ }