chansung commited on
Commit
3da6413
·
verified ·
1 Parent(s): 0072320

Model save

Browse files
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ license: llama3.2
4
+ base_model: meta-llama/Llama-3.2-3B
5
+ tags:
6
+ - trl
7
+ - sft
8
+ - generated_from_trainer
9
+ datasets:
10
+ - generator
11
+ model-index:
12
+ - name: llama3.1-3b-coding-gpt4o-100k2
13
+ results: []
14
+ ---
15
+
16
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
+ should probably proofread and complete it, then remove this comment. -->
18
+
19
+ # llama3.1-3b-coding-gpt4o-100k2
20
+
21
+ This model is a fine-tuned version of [meta-llama/Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) on the generator dataset.
22
+ It achieves the following results on the evaluation set:
23
+ - Loss: 1.6301
24
+
25
+ ## Model description
26
+
27
+ More information needed
28
+
29
+ ## Intended uses & limitations
30
+
31
+ More information needed
32
+
33
+ ## Training and evaluation data
34
+
35
+ More information needed
36
+
37
+ ## Training procedure
38
+
39
+ ### Training hyperparameters
40
+
41
+ The following hyperparameters were used during training:
42
+ - learning_rate: 0.002
43
+ - train_batch_size: 16
44
+ - eval_batch_size: 16
45
+ - seed: 42
46
+ - distributed_type: multi-GPU
47
+ - num_devices: 8
48
+ - gradient_accumulation_steps: 2
49
+ - total_train_batch_size: 256
50
+ - total_eval_batch_size: 128
51
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
52
+ - lr_scheduler_type: cosine
53
+ - lr_scheduler_warmup_ratio: 0.1
54
+ - num_epochs: 10
55
+
56
+ ### Training results
57
+
58
+ | Training Loss | Epoch | Step | Validation Loss |
59
+ |:-------------:|:------:|:----:|:---------------:|
60
+ | 1.0031 | 1.0 | 68 | 1.5510 |
61
+ | 0.9546 | 2.0 | 136 | 1.5149 |
62
+ | 0.936 | 3.0 | 204 | 1.5085 |
63
+ | 0.9186 | 4.0 | 272 | 1.5175 |
64
+ | 0.8948 | 5.0 | 340 | 1.5302 |
65
+ | 0.8742 | 6.0 | 408 | 1.5502 |
66
+ | 0.8556 | 7.0 | 476 | 1.5617 |
67
+ | 0.8428 | 8.0 | 544 | 1.5965 |
68
+ | 0.8168 | 9.0 | 612 | 1.6217 |
69
+ | 0.8191 | 9.8593 | 670 | 1.6301 |
70
+
71
+
72
+ ### Framework versions
73
+
74
+ - PEFT 0.15.1
75
+ - Transformers 4.50.3
76
+ - Pytorch 2.6.0+cu124
77
+ - Datasets 3.5.0
78
+ - Tokenizers 0.21.1
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:37b5a064c89f2c05e500778b82c18fa3b5b228097593fbba33ba3e96358a102a
3
  size 1612749744
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68e6c34ff5b83d4da958c29a94d06b20fa3f6c199f168e1e0b501445b8a3d7a3
3
  size 1612749744
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.85925925925926,
3
+ "total_flos": 2.9601022627828204e+18,
4
+ "train_loss": 0.9062140895359552,
5
+ "train_runtime": 3484.2972,
6
+ "train_samples": 116368,
7
+ "train_samples_per_second": 49.516,
8
+ "train_steps_per_second": 0.192
9
+ }
runs/Apr01_02-59-08_green-face-echoes-fin-01/events.out.tfevents.1743476556.green-face-echoes-fin-01.64054.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:86ed4e1cd866847063d76493e667ab41b12a09fc8ada3299d9420de083c2d0a5
3
- size 33856
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32e81dadfde0db11eeefe569842f22d0c579cff25c34b4af794fd35d7ed548fb
3
+ size 37706
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 9.85925925925926,
3
+ "total_flos": 2.9601022627828204e+18,
4
+ "train_loss": 0.9062140895359552,
5
+ "train_runtime": 3484.2972,
6
+ "train_samples": 116368,
7
+ "train_samples_per_second": 49.516,
8
+ "train_steps_per_second": 0.192
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1068 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 9.85925925925926,
6
+ "eval_steps": 500,
7
+ "global_step": 670,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.014814814814814815,
14
+ "grad_norm": 0.4682641327381134,
15
+ "learning_rate": 2.9850746268656717e-05,
16
+ "loss": 1.4595,
17
+ "step": 1
18
+ },
19
+ {
20
+ "epoch": 0.07407407407407407,
21
+ "grad_norm": 0.30114030838012695,
22
+ "learning_rate": 0.00014925373134328358,
23
+ "loss": 1.4529,
24
+ "step": 5
25
+ },
26
+ {
27
+ "epoch": 0.14814814814814814,
28
+ "grad_norm": 0.2646062970161438,
29
+ "learning_rate": 0.00029850746268656717,
30
+ "loss": 1.3781,
31
+ "step": 10
32
+ },
33
+ {
34
+ "epoch": 0.2222222222222222,
35
+ "grad_norm": 0.2039109170436859,
36
+ "learning_rate": 0.00044776119402985075,
37
+ "loss": 1.2598,
38
+ "step": 15
39
+ },
40
+ {
41
+ "epoch": 0.2962962962962963,
42
+ "grad_norm": 0.12383515387773514,
43
+ "learning_rate": 0.0005970149253731343,
44
+ "loss": 1.1834,
45
+ "step": 20
46
+ },
47
+ {
48
+ "epoch": 0.37037037037037035,
49
+ "grad_norm": 0.1035536378622055,
50
+ "learning_rate": 0.0007462686567164179,
51
+ "loss": 1.1305,
52
+ "step": 25
53
+ },
54
+ {
55
+ "epoch": 0.4444444444444444,
56
+ "grad_norm": 0.09090688824653625,
57
+ "learning_rate": 0.0008955223880597015,
58
+ "loss": 1.0904,
59
+ "step": 30
60
+ },
61
+ {
62
+ "epoch": 0.5185185185185185,
63
+ "grad_norm": 0.1432366669178009,
64
+ "learning_rate": 0.001044776119402985,
65
+ "loss": 1.0762,
66
+ "step": 35
67
+ },
68
+ {
69
+ "epoch": 0.5925925925925926,
70
+ "grad_norm": 0.07171270251274109,
71
+ "learning_rate": 0.0011940298507462687,
72
+ "loss": 1.0618,
73
+ "step": 40
74
+ },
75
+ {
76
+ "epoch": 0.6666666666666666,
77
+ "grad_norm": 0.07491806894540787,
78
+ "learning_rate": 0.0013432835820895524,
79
+ "loss": 1.0421,
80
+ "step": 45
81
+ },
82
+ {
83
+ "epoch": 0.7407407407407407,
84
+ "grad_norm": 0.06790623813867569,
85
+ "learning_rate": 0.0014925373134328358,
86
+ "loss": 1.0302,
87
+ "step": 50
88
+ },
89
+ {
90
+ "epoch": 0.8148148148148148,
91
+ "grad_norm": 0.08844709396362305,
92
+ "learning_rate": 0.0016417910447761195,
93
+ "loss": 1.018,
94
+ "step": 55
95
+ },
96
+ {
97
+ "epoch": 0.8888888888888888,
98
+ "grad_norm": 0.08857131749391556,
99
+ "learning_rate": 0.001791044776119403,
100
+ "loss": 1.0156,
101
+ "step": 60
102
+ },
103
+ {
104
+ "epoch": 0.9629629629629629,
105
+ "grad_norm": 0.09674689918756485,
106
+ "learning_rate": 0.0019402985074626867,
107
+ "loss": 1.0031,
108
+ "step": 65
109
+ },
110
+ {
111
+ "epoch": 1.0,
112
+ "eval_loss": 1.551000714302063,
113
+ "eval_runtime": 0.869,
114
+ "eval_samples_per_second": 4.603,
115
+ "eval_steps_per_second": 1.151,
116
+ "step": 68
117
+ },
118
+ {
119
+ "epoch": 1.0296296296296297,
120
+ "grad_norm": 0.09539825469255447,
121
+ "learning_rate": 0.001999877856940653,
122
+ "loss": 0.9937,
123
+ "step": 70
124
+ },
125
+ {
126
+ "epoch": 1.1037037037037036,
127
+ "grad_norm": 0.09851890057325363,
128
+ "learning_rate": 0.0019991315351855746,
129
+ "loss": 0.9895,
130
+ "step": 75
131
+ },
132
+ {
133
+ "epoch": 1.1777777777777778,
134
+ "grad_norm": 0.06911145895719528,
135
+ "learning_rate": 0.0019977072547317748,
136
+ "loss": 0.9817,
137
+ "step": 80
138
+ },
139
+ {
140
+ "epoch": 1.2518518518518518,
141
+ "grad_norm": 0.06769894808530807,
142
+ "learning_rate": 0.001995605982021898,
143
+ "loss": 0.9762,
144
+ "step": 85
145
+ },
146
+ {
147
+ "epoch": 1.325925925925926,
148
+ "grad_norm": 0.06828448921442032,
149
+ "learning_rate": 0.001992829142870326,
150
+ "loss": 0.9743,
151
+ "step": 90
152
+ },
153
+ {
154
+ "epoch": 1.4,
155
+ "grad_norm": 0.06951478868722916,
156
+ "learning_rate": 0.0019893786214956943,
157
+ "loss": 0.9743,
158
+ "step": 95
159
+ },
160
+ {
161
+ "epoch": 1.474074074074074,
162
+ "grad_norm": 0.06752126663923264,
163
+ "learning_rate": 0.001985256759242359,
164
+ "loss": 0.9718,
165
+ "step": 100
166
+ },
167
+ {
168
+ "epoch": 1.5481481481481483,
169
+ "grad_norm": 0.06669533252716064,
170
+ "learning_rate": 0.0019804663529916825,
171
+ "loss": 0.9743,
172
+ "step": 105
173
+ },
174
+ {
175
+ "epoch": 1.6222222222222222,
176
+ "grad_norm": 0.06977611780166626,
177
+ "learning_rate": 0.001975010653264216,
178
+ "loss": 0.9678,
179
+ "step": 110
180
+ },
181
+ {
182
+ "epoch": 1.6962962962962962,
183
+ "grad_norm": 0.07217196375131607,
184
+ "learning_rate": 0.0019688933620140635,
185
+ "loss": 0.9694,
186
+ "step": 115
187
+ },
188
+ {
189
+ "epoch": 1.7703703703703704,
190
+ "grad_norm": 0.06247986480593681,
191
+ "learning_rate": 0.0019621186301169314,
192
+ "loss": 0.9625,
193
+ "step": 120
194
+ },
195
+ {
196
+ "epoch": 1.8444444444444446,
197
+ "grad_norm": 0.07415565848350525,
198
+ "learning_rate": 0.001954691054553556,
199
+ "loss": 0.9697,
200
+ "step": 125
201
+ },
202
+ {
203
+ "epoch": 1.9185185185185185,
204
+ "grad_norm": 0.07004866003990173,
205
+ "learning_rate": 0.0019466156752904343,
206
+ "loss": 0.957,
207
+ "step": 130
208
+ },
209
+ {
210
+ "epoch": 1.9925925925925925,
211
+ "grad_norm": 0.06320279091596603,
212
+ "learning_rate": 0.0019378979718599645,
213
+ "loss": 0.9546,
214
+ "step": 135
215
+ },
216
+ {
217
+ "epoch": 2.0,
218
+ "eval_loss": 1.5149173736572266,
219
+ "eval_runtime": 0.8697,
220
+ "eval_samples_per_second": 4.599,
221
+ "eval_steps_per_second": 1.15,
222
+ "step": 136
223
+ },
224
+ {
225
+ "epoch": 2.0592592592592593,
226
+ "grad_norm": 0.07447217404842377,
227
+ "learning_rate": 0.0019285438596423204,
228
+ "loss": 0.9443,
229
+ "step": 140
230
+ },
231
+ {
232
+ "epoch": 2.1333333333333333,
233
+ "grad_norm": 0.06741169095039368,
234
+ "learning_rate": 0.0019185596858515798,
235
+ "loss": 0.9371,
236
+ "step": 145
237
+ },
238
+ {
239
+ "epoch": 2.2074074074074073,
240
+ "grad_norm": 0.06852757930755615,
241
+ "learning_rate": 0.0019079522252288387,
242
+ "loss": 0.9395,
243
+ "step": 150
244
+ },
245
+ {
246
+ "epoch": 2.2814814814814817,
247
+ "grad_norm": 0.06586603075265884,
248
+ "learning_rate": 0.0018967286754452213,
249
+ "loss": 0.937,
250
+ "step": 155
251
+ },
252
+ {
253
+ "epoch": 2.3555555555555556,
254
+ "grad_norm": 0.0683656558394432,
255
+ "learning_rate": 0.0018848966522179167,
256
+ "loss": 0.9336,
257
+ "step": 160
258
+ },
259
+ {
260
+ "epoch": 2.4296296296296296,
261
+ "grad_norm": 0.07259602099657059,
262
+ "learning_rate": 0.001872464184142548,
263
+ "loss": 0.935,
264
+ "step": 165
265
+ },
266
+ {
267
+ "epoch": 2.5037037037037035,
268
+ "grad_norm": 0.06436455249786377,
269
+ "learning_rate": 0.0018594397072453856,
270
+ "loss": 0.9316,
271
+ "step": 170
272
+ },
273
+ {
274
+ "epoch": 2.5777777777777775,
275
+ "grad_norm": 0.08042966574430466,
276
+ "learning_rate": 0.0018458320592590974,
277
+ "loss": 0.938,
278
+ "step": 175
279
+ },
280
+ {
281
+ "epoch": 2.651851851851852,
282
+ "grad_norm": 0.0699801966547966,
283
+ "learning_rate": 0.0018316504736259254,
284
+ "loss": 0.9422,
285
+ "step": 180
286
+ },
287
+ {
288
+ "epoch": 2.725925925925926,
289
+ "grad_norm": 0.06373833864927292,
290
+ "learning_rate": 0.0018169045732323492,
291
+ "loss": 0.9348,
292
+ "step": 185
293
+ },
294
+ {
295
+ "epoch": 2.8,
296
+ "grad_norm": 0.07165364176034927,
297
+ "learning_rate": 0.0018016043638794975,
298
+ "loss": 0.9354,
299
+ "step": 190
300
+ },
301
+ {
302
+ "epoch": 2.8740740740740742,
303
+ "grad_norm": 0.06121128425002098,
304
+ "learning_rate": 0.0017857602274937308,
305
+ "loss": 0.9386,
306
+ "step": 195
307
+ },
308
+ {
309
+ "epoch": 2.948148148148148,
310
+ "grad_norm": 0.06334740668535233,
311
+ "learning_rate": 0.0017693829150820068,
312
+ "loss": 0.936,
313
+ "step": 200
314
+ },
315
+ {
316
+ "epoch": 3.0,
317
+ "eval_loss": 1.508521318435669,
318
+ "eval_runtime": 0.8697,
319
+ "eval_samples_per_second": 4.599,
320
+ "eval_steps_per_second": 1.15,
321
+ "step": 204
322
+ },
323
+ {
324
+ "epoch": 3.0148148148148146,
325
+ "grad_norm": 0.07033156603574753,
326
+ "learning_rate": 0.0017524835394368066,
327
+ "loss": 0.9317,
328
+ "step": 205
329
+ },
330
+ {
331
+ "epoch": 3.088888888888889,
332
+ "grad_norm": 0.06662800908088684,
333
+ "learning_rate": 0.0017350735675955695,
334
+ "loss": 0.9145,
335
+ "step": 210
336
+ },
337
+ {
338
+ "epoch": 3.162962962962963,
339
+ "grad_norm": 0.06688813865184784,
340
+ "learning_rate": 0.001717164813059761,
341
+ "loss": 0.9094,
342
+ "step": 215
343
+ },
344
+ {
345
+ "epoch": 3.237037037037037,
346
+ "grad_norm": 0.07399953156709671,
347
+ "learning_rate": 0.0016987694277788418,
348
+ "loss": 0.9147,
349
+ "step": 220
350
+ },
351
+ {
352
+ "epoch": 3.311111111111111,
353
+ "grad_norm": 0.06779713183641434,
354
+ "learning_rate": 0.0016798998939045893,
355
+ "loss": 0.9123,
356
+ "step": 225
357
+ },
358
+ {
359
+ "epoch": 3.3851851851851853,
360
+ "grad_norm": 0.06676509976387024,
361
+ "learning_rate": 0.001660569015321357,
362
+ "loss": 0.9099,
363
+ "step": 230
364
+ },
365
+ {
366
+ "epoch": 3.4592592592592593,
367
+ "grad_norm": 0.06683938950300217,
368
+ "learning_rate": 0.001640789908958026,
369
+ "loss": 0.9112,
370
+ "step": 235
371
+ },
372
+ {
373
+ "epoch": 3.533333333333333,
374
+ "grad_norm": 0.06712319701910019,
375
+ "learning_rate": 0.001620575995887538,
376
+ "loss": 0.914,
377
+ "step": 240
378
+ },
379
+ {
380
+ "epoch": 3.6074074074074076,
381
+ "grad_norm": 0.06718605011701584,
382
+ "learning_rate": 0.001599940992220053,
383
+ "loss": 0.9156,
384
+ "step": 245
385
+ },
386
+ {
387
+ "epoch": 3.6814814814814816,
388
+ "grad_norm": 0.06765800714492798,
389
+ "learning_rate": 0.0015788988997959114,
390
+ "loss": 0.9168,
391
+ "step": 250
392
+ },
393
+ {
394
+ "epoch": 3.7555555555555555,
395
+ "grad_norm": 0.06374535709619522,
396
+ "learning_rate": 0.0015574639966847127,
397
+ "loss": 0.9114,
398
+ "step": 255
399
+ },
400
+ {
401
+ "epoch": 3.8296296296296295,
402
+ "grad_norm": 0.06388971954584122,
403
+ "learning_rate": 0.0015356508274969594,
404
+ "loss": 0.9139,
405
+ "step": 260
406
+ },
407
+ {
408
+ "epoch": 3.9037037037037035,
409
+ "grad_norm": 0.0656428337097168,
410
+ "learning_rate": 0.0015134741935148419,
411
+ "loss": 0.916,
412
+ "step": 265
413
+ },
414
+ {
415
+ "epoch": 3.977777777777778,
416
+ "grad_norm": 0.06783714145421982,
417
+ "learning_rate": 0.0014909491426488577,
418
+ "loss": 0.9186,
419
+ "step": 270
420
+ },
421
+ {
422
+ "epoch": 4.0,
423
+ "eval_loss": 1.517486810684204,
424
+ "eval_runtime": 0.8755,
425
+ "eval_samples_per_second": 4.569,
426
+ "eval_steps_per_second": 1.142,
427
+ "step": 272
428
+ },
429
+ {
430
+ "epoch": 4.044444444444444,
431
+ "grad_norm": 0.06940994411706924,
432
+ "learning_rate": 0.001468090959227082,
433
+ "loss": 0.9011,
434
+ "step": 275
435
+ },
436
+ {
437
+ "epoch": 4.118518518518519,
438
+ "grad_norm": 0.06819378584623337,
439
+ "learning_rate": 0.0014449151536240167,
440
+ "loss": 0.8866,
441
+ "step": 280
442
+ },
443
+ {
444
+ "epoch": 4.192592592592592,
445
+ "grad_norm": 0.0655524805188179,
446
+ "learning_rate": 0.0014214374517360576,
447
+ "loss": 0.8916,
448
+ "step": 285
449
+ },
450
+ {
451
+ "epoch": 4.266666666666667,
452
+ "grad_norm": 0.06668845564126968,
453
+ "learning_rate": 0.0013976737843107202,
454
+ "loss": 0.8871,
455
+ "step": 290
456
+ },
457
+ {
458
+ "epoch": 4.340740740740741,
459
+ "grad_norm": 0.06470604240894318,
460
+ "learning_rate": 0.0013736402761368597,
461
+ "loss": 0.8928,
462
+ "step": 295
463
+ },
464
+ {
465
+ "epoch": 4.4148148148148145,
466
+ "grad_norm": 0.06732232868671417,
467
+ "learning_rate": 0.0013493532351032318,
468
+ "loss": 0.8985,
469
+ "step": 300
470
+ },
471
+ {
472
+ "epoch": 4.488888888888889,
473
+ "grad_norm": 0.0662841871380806,
474
+ "learning_rate": 0.0013248291411328047,
475
+ "loss": 0.8869,
476
+ "step": 305
477
+ },
478
+ {
479
+ "epoch": 4.562962962962963,
480
+ "grad_norm": 0.06613945215940475,
481
+ "learning_rate": 0.001300084635000341,
482
+ "loss": 0.8963,
483
+ "step": 310
484
+ },
485
+ {
486
+ "epoch": 4.637037037037037,
487
+ "grad_norm": 0.06735741347074509,
488
+ "learning_rate": 0.0012751365070408334,
489
+ "loss": 0.9035,
490
+ "step": 315
491
+ },
492
+ {
493
+ "epoch": 4.711111111111111,
494
+ "grad_norm": 0.06463445723056793,
495
+ "learning_rate": 0.0012500016857564585,
496
+ "loss": 0.8966,
497
+ "step": 320
498
+ },
499
+ {
500
+ "epoch": 4.785185185185185,
501
+ "grad_norm": 0.06602155417203903,
502
+ "learning_rate": 0.0012246972263297718,
503
+ "loss": 0.895,
504
+ "step": 325
505
+ },
506
+ {
507
+ "epoch": 4.859259259259259,
508
+ "grad_norm": 0.06352429836988449,
509
+ "learning_rate": 0.0011992402990509514,
510
+ "loss": 0.894,
511
+ "step": 330
512
+ },
513
+ {
514
+ "epoch": 4.933333333333334,
515
+ "grad_norm": 0.06808946281671524,
516
+ "learning_rate": 0.0011736481776669307,
517
+ "loss": 0.8969,
518
+ "step": 335
519
+ },
520
+ {
521
+ "epoch": 5.0,
522
+ "grad_norm": 0.08401331305503845,
523
+ "learning_rate": 0.0011479382276603299,
524
+ "loss": 0.8948,
525
+ "step": 340
526
+ },
527
+ {
528
+ "epoch": 5.0,
529
+ "eval_loss": 1.5301542282104492,
530
+ "eval_runtime": 0.8691,
531
+ "eval_samples_per_second": 4.602,
532
+ "eval_steps_per_second": 1.151,
533
+ "step": 340
534
+ },
535
+ {
536
+ "epoch": 5.074074074074074,
537
+ "grad_norm": 0.06839559227228165,
538
+ "learning_rate": 0.0011221278944661473,
539
+ "loss": 0.8678,
540
+ "step": 345
541
+ },
542
+ {
543
+ "epoch": 5.148148148148148,
544
+ "grad_norm": 0.06838098913431168,
545
+ "learning_rate": 0.0010962346916341904,
546
+ "loss": 0.8666,
547
+ "step": 350
548
+ },
549
+ {
550
+ "epoch": 5.222222222222222,
551
+ "grad_norm": 0.06836072355508804,
552
+ "learning_rate": 0.001070276188945293,
553
+ "loss": 0.8731,
554
+ "step": 355
555
+ },
556
+ {
557
+ "epoch": 5.296296296296296,
558
+ "grad_norm": 0.06789132207632065,
559
+ "learning_rate": 0.0010442700004893765,
560
+ "loss": 0.8724,
561
+ "step": 360
562
+ },
563
+ {
564
+ "epoch": 5.37037037037037,
565
+ "grad_norm": 0.06825467944145203,
566
+ "learning_rate": 0.001018233772713443,
567
+ "loss": 0.8757,
568
+ "step": 365
569
+ },
570
+ {
571
+ "epoch": 5.444444444444445,
572
+ "grad_norm": 0.06852041184902191,
573
+ "learning_rate": 0.000992185172447616,
574
+ "loss": 0.8762,
575
+ "step": 370
576
+ },
577
+ {
578
+ "epoch": 5.518518518518518,
579
+ "grad_norm": 0.06898131966590881,
580
+ "learning_rate": 0.0009661418749173466,
581
+ "loss": 0.8731,
582
+ "step": 375
583
+ },
584
+ {
585
+ "epoch": 5.592592592592593,
586
+ "grad_norm": 0.06875770539045334,
587
+ "learning_rate": 0.0009401215517499251,
588
+ "loss": 0.8746,
589
+ "step": 380
590
+ },
591
+ {
592
+ "epoch": 5.666666666666667,
593
+ "grad_norm": 0.06649214774370193,
594
+ "learning_rate": 0.0009141418589834339,
595
+ "loss": 0.8748,
596
+ "step": 385
597
+ },
598
+ {
599
+ "epoch": 5.7407407407407405,
600
+ "grad_norm": 0.06804858148097992,
601
+ "learning_rate": 0.0008882204250862795,
602
+ "loss": 0.8783,
603
+ "step": 390
604
+ },
605
+ {
606
+ "epoch": 5.814814814814815,
607
+ "grad_norm": 0.06907966732978821,
608
+ "learning_rate": 0.0008623748389954282,
609
+ "loss": 0.8822,
610
+ "step": 395
611
+ },
612
+ {
613
+ "epoch": 5.888888888888889,
614
+ "grad_norm": 0.0679902508854866,
615
+ "learning_rate": 0.0008366226381814697,
616
+ "loss": 0.8777,
617
+ "step": 400
618
+ },
619
+ {
620
+ "epoch": 5.962962962962963,
621
+ "grad_norm": 0.06677145510911942,
622
+ "learning_rate": 0.0008109812967486025,
623
+ "loss": 0.8742,
624
+ "step": 405
625
+ },
626
+ {
627
+ "epoch": 6.0,
628
+ "eval_loss": 1.5502283573150635,
629
+ "eval_runtime": 0.8693,
630
+ "eval_samples_per_second": 4.602,
631
+ "eval_steps_per_second": 1.15,
632
+ "step": 408
633
+ },
634
+ {
635
+ "epoch": 6.029629629629629,
636
+ "grad_norm": 0.06917522847652435,
637
+ "learning_rate": 0.0007854682135776132,
638
+ "loss": 0.8605,
639
+ "step": 410
640
+ },
641
+ {
642
+ "epoch": 6.103703703703704,
643
+ "grad_norm": 0.07051009684801102,
644
+ "learning_rate": 0.0007601007005199021,
645
+ "loss": 0.8501,
646
+ "step": 415
647
+ },
648
+ {
649
+ "epoch": 6.177777777777778,
650
+ "grad_norm": 0.07272496819496155,
651
+ "learning_rate": 0.0007348959706505627,
652
+ "loss": 0.8553,
653
+ "step": 420
654
+ },
655
+ {
656
+ "epoch": 6.2518518518518515,
657
+ "grad_norm": 0.07074420154094696,
658
+ "learning_rate": 0.000709871126588481,
659
+ "loss": 0.8496,
660
+ "step": 425
661
+ },
662
+ {
663
+ "epoch": 6.325925925925926,
664
+ "grad_norm": 0.07021532952785492,
665
+ "learning_rate": 0.0006850431488913895,
666
+ "loss": 0.8547,
667
+ "step": 430
668
+ },
669
+ {
670
+ "epoch": 6.4,
671
+ "grad_norm": 0.07260267436504364,
672
+ "learning_rate": 0.0006604288845337453,
673
+ "loss": 0.8568,
674
+ "step": 435
675
+ },
676
+ {
677
+ "epoch": 6.474074074074074,
678
+ "grad_norm": 0.06939396262168884,
679
+ "learning_rate": 0.0006360450354752458,
680
+ "loss": 0.8561,
681
+ "step": 440
682
+ },
683
+ {
684
+ "epoch": 6.548148148148148,
685
+ "grad_norm": 0.06964612007141113,
686
+ "learning_rate": 0.0006119081473277501,
687
+ "loss": 0.8577,
688
+ "step": 445
689
+ },
690
+ {
691
+ "epoch": 6.622222222222222,
692
+ "grad_norm": 0.06987880170345306,
693
+ "learning_rate": 0.0005880345981282876,
694
+ "loss": 0.858,
695
+ "step": 450
696
+ },
697
+ {
698
+ "epoch": 6.696296296296296,
699
+ "grad_norm": 0.06909282505512238,
700
+ "learning_rate": 0.0005644405872257716,
701
+ "loss": 0.8559,
702
+ "step": 455
703
+ },
704
+ {
705
+ "epoch": 6.770370370370371,
706
+ "grad_norm": 0.0683453232049942,
707
+ "learning_rate": 0.0005411421242889642,
708
+ "loss": 0.8561,
709
+ "step": 460
710
+ },
711
+ {
712
+ "epoch": 6.844444444444444,
713
+ "grad_norm": 0.0680374875664711,
714
+ "learning_rate": 0.000518155018443151,
715
+ "loss": 0.859,
716
+ "step": 465
717
+ },
718
+ {
719
+ "epoch": 6.9185185185185185,
720
+ "grad_norm": 0.067069411277771,
721
+ "learning_rate": 0.0004954948675428853,
722
+ "loss": 0.8489,
723
+ "step": 470
724
+ },
725
+ {
726
+ "epoch": 6.992592592592593,
727
+ "grad_norm": 0.06691515445709229,
728
+ "learning_rate": 0.00047317704758809945,
729
+ "loss": 0.8556,
730
+ "step": 475
731
+ },
732
+ {
733
+ "epoch": 7.0,
734
+ "eval_loss": 1.5617406368255615,
735
+ "eval_runtime": 0.8711,
736
+ "eval_samples_per_second": 4.592,
737
+ "eval_steps_per_second": 1.148,
738
+ "step": 476
739
+ },
740
+ {
741
+ "epoch": 7.059259259259259,
742
+ "grad_norm": 0.08424794673919678,
743
+ "learning_rate": 0.0004512167022907494,
744
+ "loss": 0.8413,
745
+ "step": 480
746
+ },
747
+ {
748
+ "epoch": 7.133333333333334,
749
+ "grad_norm": 0.07284523546695709,
750
+ "learning_rate": 0.00042962873279907965,
751
+ "loss": 0.8329,
752
+ "step": 485
753
+ },
754
+ {
755
+ "epoch": 7.207407407407407,
756
+ "grad_norm": 0.06989779323339462,
757
+ "learning_rate": 0.0004084277875864776,
758
+ "loss": 0.8368,
759
+ "step": 490
760
+ },
761
+ {
762
+ "epoch": 7.281481481481482,
763
+ "grad_norm": 0.0744442567229271,
764
+ "learning_rate": 0.0003876282525117847,
765
+ "loss": 0.831,
766
+ "step": 495
767
+ },
768
+ {
769
+ "epoch": 7.355555555555555,
770
+ "grad_norm": 0.07233459502458572,
771
+ "learning_rate": 0.0003672442410577965,
772
+ "loss": 0.8344,
773
+ "step": 500
774
+ },
775
+ {
776
+ "epoch": 7.42962962962963,
777
+ "grad_norm": 0.07147523015737534,
778
+ "learning_rate": 0.0003472895847545905,
779
+ "loss": 0.837,
780
+ "step": 505
781
+ },
782
+ {
783
+ "epoch": 7.503703703703704,
784
+ "grad_norm": 0.0732484832406044,
785
+ "learning_rate": 0.000327777823794168,
786
+ "loss": 0.8427,
787
+ "step": 510
788
+ },
789
+ {
790
+ "epoch": 7.5777777777777775,
791
+ "grad_norm": 0.0711125060915947,
792
+ "learning_rate": 0.00030872219784278354,
793
+ "loss": 0.8394,
794
+ "step": 515
795
+ },
796
+ {
797
+ "epoch": 7.651851851851852,
798
+ "grad_norm": 0.07285265624523163,
799
+ "learning_rate": 0.0002901356370571967,
800
+ "loss": 0.8336,
801
+ "step": 520
802
+ },
803
+ {
804
+ "epoch": 7.725925925925926,
805
+ "grad_norm": 0.07154905050992966,
806
+ "learning_rate": 0.0002720307533109402,
807
+ "loss": 0.8403,
808
+ "step": 525
809
+ },
810
+ {
811
+ "epoch": 7.8,
812
+ "grad_norm": 0.07089488953351974,
813
+ "learning_rate": 0.000254419831636557,
814
+ "loss": 0.839,
815
+ "step": 530
816
+ },
817
+ {
818
+ "epoch": 7.874074074074074,
819
+ "grad_norm": 0.0709661915898323,
820
+ "learning_rate": 0.00023731482188961818,
821
+ "loss": 0.8353,
822
+ "step": 535
823
+ },
824
+ {
825
+ "epoch": 7.948148148148148,
826
+ "grad_norm": 0.07034063339233398,
827
+ "learning_rate": 0.00022072733064017102,
828
+ "loss": 0.8428,
829
+ "step": 540
830
+ },
831
+ {
832
+ "epoch": 8.0,
833
+ "eval_loss": 1.596451997756958,
834
+ "eval_runtime": 0.8703,
835
+ "eval_samples_per_second": 4.596,
836
+ "eval_steps_per_second": 1.149,
837
+ "step": 544
838
+ },
839
+ {
840
+ "epoch": 8.014814814814814,
841
+ "grad_norm": 0.07084991037845612,
842
+ "learning_rate": 0.00020466861329712473,
843
+ "loss": 0.8359,
844
+ "step": 545
845
+ },
846
+ {
847
+ "epoch": 8.088888888888889,
848
+ "grad_norm": 0.07405474036931992,
849
+ "learning_rate": 0.00018914956647091496,
850
+ "loss": 0.8195,
851
+ "step": 550
852
+ },
853
+ {
854
+ "epoch": 8.162962962962963,
855
+ "grad_norm": 0.07152204215526581,
856
+ "learning_rate": 0.0001741807205796314,
857
+ "loss": 0.8289,
858
+ "step": 555
859
+ },
860
+ {
861
+ "epoch": 8.237037037037037,
862
+ "grad_norm": 0.0712200403213501,
863
+ "learning_rate": 0.00015977223270362194,
864
+ "loss": 0.8271,
865
+ "step": 560
866
+ },
867
+ {
868
+ "epoch": 8.311111111111112,
869
+ "grad_norm": 0.07045566290616989,
870
+ "learning_rate": 0.0001459338796934293,
871
+ "loss": 0.829,
872
+ "step": 565
873
+ },
874
+ {
875
+ "epoch": 8.385185185185184,
876
+ "grad_norm": 0.0720411017537117,
877
+ "learning_rate": 0.000132675051535725,
878
+ "loss": 0.8265,
879
+ "step": 570
880
+ },
881
+ {
882
+ "epoch": 8.459259259259259,
883
+ "grad_norm": 0.07052139192819595,
884
+ "learning_rate": 0.00012000474498175551,
885
+ "loss": 0.8226,
886
+ "step": 575
887
+ },
888
+ {
889
+ "epoch": 8.533333333333333,
890
+ "grad_norm": 0.07078087329864502,
891
+ "learning_rate": 0.00010793155744261352,
892
+ "loss": 0.8241,
893
+ "step": 580
894
+ },
895
+ {
896
+ "epoch": 8.607407407407408,
897
+ "grad_norm": 0.07028964906930923,
898
+ "learning_rate": 9.646368115548232e-05,
899
+ "loss": 0.8212,
900
+ "step": 585
901
+ },
902
+ {
903
+ "epoch": 8.681481481481482,
904
+ "grad_norm": 0.0702112540602684,
905
+ "learning_rate": 8.56088976248095e-05,
906
+ "loss": 0.8232,
907
+ "step": 590
908
+ },
909
+ {
910
+ "epoch": 8.755555555555556,
911
+ "grad_norm": 0.07041744887828827,
912
+ "learning_rate": 7.53745723421827e-05,
913
+ "loss": 0.8193,
914
+ "step": 595
915
+ },
916
+ {
917
+ "epoch": 8.829629629629629,
918
+ "grad_norm": 0.06979186832904816,
919
+ "learning_rate": 6.576764978849003e-05,
920
+ "loss": 0.8186,
921
+ "step": 600
922
+ },
923
+ {
924
+ "epoch": 8.903703703703703,
925
+ "grad_norm": 0.07058751583099365,
926
+ "learning_rate": 5.679464872175666e-05,
927
+ "loss": 0.8279,
928
+ "step": 605
929
+ },
930
+ {
931
+ "epoch": 8.977777777777778,
932
+ "grad_norm": 0.07009345293045044,
933
+ "learning_rate": 4.846165775385458e-05,
934
+ "loss": 0.8168,
935
+ "step": 610
936
+ },
937
+ {
938
+ "epoch": 9.0,
939
+ "eval_loss": 1.6216613054275513,
940
+ "eval_runtime": 0.8697,
941
+ "eval_samples_per_second": 4.599,
942
+ "eval_steps_per_second": 1.15,
943
+ "step": 612
944
+ },
945
+ {
946
+ "epoch": 9.044444444444444,
947
+ "grad_norm": 0.0695219412446022,
948
+ "learning_rate": 4.077433121908747e-05,
949
+ "loss": 0.8172,
950
+ "step": 615
951
+ },
952
+ {
953
+ "epoch": 9.118518518518519,
954
+ "grad_norm": 0.07094599306583405,
955
+ "learning_rate": 3.373788533745281e-05,
956
+ "loss": 0.8207,
957
+ "step": 620
958
+ },
959
+ {
960
+ "epoch": 9.192592592592593,
961
+ "grad_norm": 0.07142723351716995,
962
+ "learning_rate": 2.7357094675186987e-05,
963
+ "loss": 0.8101,
964
+ "step": 625
965
+ },
966
+ {
967
+ "epoch": 9.266666666666667,
968
+ "grad_norm": 0.07132075726985931,
969
+ "learning_rate": 2.1636288904992585e-05,
970
+ "loss": 0.8137,
971
+ "step": 630
972
+ },
973
+ {
974
+ "epoch": 9.34074074074074,
975
+ "grad_norm": 0.07085540145635605,
976
+ "learning_rate": 1.6579349868147686e-05,
977
+ "loss": 0.8103,
978
+ "step": 635
979
+ },
980
+ {
981
+ "epoch": 9.414814814814815,
982
+ "grad_norm": 0.0697101578116417,
983
+ "learning_rate": 1.218970894049065e-05,
984
+ "loss": 0.8094,
985
+ "step": 640
986
+ },
987
+ {
988
+ "epoch": 9.488888888888889,
989
+ "grad_norm": 0.07001277059316635,
990
+ "learning_rate": 8.470344704066047e-06,
991
+ "loss": 0.8233,
992
+ "step": 645
993
+ },
994
+ {
995
+ "epoch": 9.562962962962963,
996
+ "grad_norm": 0.07051407545804977,
997
+ "learning_rate": 5.42378092601481e-06,
998
+ "loss": 0.8181,
999
+ "step": 650
1000
+ },
1001
+ {
1002
+ "epoch": 9.637037037037038,
1003
+ "grad_norm": 0.07038593292236328,
1004
+ "learning_rate": 3.0520848460765526e-06,
1005
+ "loss": 0.8198,
1006
+ "step": 655
1007
+ },
1008
+ {
1009
+ "epoch": 9.71111111111111,
1010
+ "grad_norm": 0.06979399174451828,
1011
+ "learning_rate": 1.3568657738678436e-06,
1012
+ "loss": 0.8138,
1013
+ "step": 660
1014
+ },
1015
+ {
1016
+ "epoch": 9.785185185185185,
1017
+ "grad_norm": 0.07017084956169128,
1018
+ "learning_rate": 3.3927399688948866e-07,
1019
+ "loss": 0.8138,
1020
+ "step": 665
1021
+ },
1022
+ {
1023
+ "epoch": 9.85925925925926,
1024
+ "grad_norm": 0.07032209634780884,
1025
+ "learning_rate": 0.0,
1026
+ "loss": 0.8191,
1027
+ "step": 670
1028
+ },
1029
+ {
1030
+ "epoch": 9.85925925925926,
1031
+ "eval_loss": 1.630096197128296,
1032
+ "eval_runtime": 0.8812,
1033
+ "eval_samples_per_second": 4.539,
1034
+ "eval_steps_per_second": 1.135,
1035
+ "step": 670
1036
+ },
1037
+ {
1038
+ "epoch": 9.85925925925926,
1039
+ "step": 670,
1040
+ "total_flos": 2.9601022627828204e+18,
1041
+ "train_loss": 0.9062140895359552,
1042
+ "train_runtime": 3484.2972,
1043
+ "train_samples_per_second": 49.516,
1044
+ "train_steps_per_second": 0.192
1045
+ }
1046
+ ],
1047
+ "logging_steps": 5,
1048
+ "max_steps": 670,
1049
+ "num_input_tokens_seen": 0,
1050
+ "num_train_epochs": 10,
1051
+ "save_steps": 100,
1052
+ "stateful_callbacks": {
1053
+ "TrainerControl": {
1054
+ "args": {
1055
+ "should_epoch_stop": false,
1056
+ "should_evaluate": false,
1057
+ "should_log": false,
1058
+ "should_save": true,
1059
+ "should_training_stop": true
1060
+ },
1061
+ "attributes": {}
1062
+ }
1063
+ },
1064
+ "total_flos": 2.9601022627828204e+18,
1065
+ "train_batch_size": 16,
1066
+ "trial_name": null,
1067
+ "trial_params": null
1068
+ }