Llama-3.1-8B-Instruct-dpo-llama-1000

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct on the answer_llama dataset. It achieves the following results on the evaluation set:

Loss: 0.3077
Rewards/chosen: 1.4814
Rewards/rejected: -0.7600
Rewards/accuracies: 0.8500
Rewards/margins: 2.2414
Logps/chosen: -7.6796
Logps/rejected: -31.9936
Logits/chosen: -0.2154
Logits/rejected: -0.3106

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6815	0.8889	50	0.6707	0.0833	0.0353	0.6900	0.0480	-21.6601	-24.0398	-0.4114	-0.4792
0.5082	1.7778	100	0.4428	1.0308	0.1943	0.7900	0.8366	-12.1855	-22.4506	-0.3559	-0.4377
0.2979	2.6667	150	0.3215	1.3481	-0.4170	0.8600	1.7651	-9.0131	-28.5637	-0.2695	-0.3655
0.2862	3.5556	200	0.3077	1.4814	-0.7600	0.8500	2.2414	-7.6796	-31.9936	-0.2154	-0.3106
0.2747	4.4444	250	0.3184	1.4147	-1.2445	0.8600	2.6592	-8.3466	-36.8385	-0.1872	-0.2879
0.2688	5.3333	300	0.3195	1.4469	-1.2794	0.8500	2.7263	-8.0242	-37.1874	-0.1714	-0.2705
0.2047	6.2222	350	0.3630	1.3019	-1.5956	0.8400	2.8975	-9.4749	-40.3495	-0.1553	-0.2578
0.2268	7.1111	400	0.3526	1.3609	-1.6635	0.8500	3.0245	-8.8842	-41.0287	-0.1452	-0.2479
0.144	8.0	450	0.3662	1.3488	-1.7032	0.8400	3.0520	-9.0059	-41.4255	-0.1421	-0.2448
0.171	8.8889	500	0.3635	1.3313	-1.7326	0.8400	3.0640	-9.1805	-41.7197	-0.1399	-0.2430
0.2313	9.7778	550	0.3613	1.3392	-1.7432	0.8400	3.0824	-9.1017	-41.8256	-0.1378	-0.2410

Framework versions

PEFT 0.12.0
Transformers 4.46.1
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

chchen
/

Llama-3.1-8B-Instruct-dpo-llama-1000

Llama-3.1-8B-Instruct-dpo-llama-1000

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for chchen/Llama-3.1-8B-Instruct-dpo-llama-1000

Evaluation results