impossible-llms-english-fronting-bigram

This model is a fine-tuned version of on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 12
eval_batch_size: 8
seed: 0
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 384
total_eval_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
training_steps: 3000
mixed_precision_training: Native AMP
label_smoothing_factor: 0.1

Training Loss	Epoch	Step	Validation Loss
22.3579	1.0	87	7.3625
17.7285	2.0	174	5.9282
17.3131	3.0	261	5.7417
16.9484	4.0	348	5.5702
16.2015	5.0	435	5.3600
15.635	6.0	522	5.1832
15.2242	7.0	609	5.0535
14.9803	8.0	696	4.9441
14.693	9.0	783	4.8592
14.4182	10.0	870	4.7920
14.3186	11.0	957	4.7325
14.0921	12.0	1044	4.6868
13.8969	13.0	1131	4.6437
13.8353	14.0	1218	4.6098
13.6798	15.0	1305	4.5795
13.637	16.0	1392	4.5563
13.5227	17.0	1479	4.5350
13.4718	18.0	1566	4.5154
13.2136	19.0	1653	4.4986
13.3515	20.0	1740	4.4878
13.2931	21.0	1827	4.4752
13.1062	22.0	1914	4.4651
13.1325	23.0	2001	4.4568
13.0963	24.0	2088	4.4508
13.1318	25.0	2175	4.4443
12.8938	26.0	2262	4.4397
12.935	27.0	2349	4.4364
13.1248	28.0	2436	4.4331
12.9068	29.0	2523	4.4304
12.8866	30.0	2610	4.4293
12.9587	31.0	2697	4.4282
12.8039	32.0	2784	4.4273
12.7212	33.0	2871	4.4270
12.8857	34.0	2958	4.4268
34.5151	34.4863	3000	4.4268