Llama-3.1-8B-Instruct-dpo-mistral-1000

This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct on the answer_mistral dataset. It achieves the following results on the evaluation set:

Loss: 0.4675
Rewards/chosen: 0.9903
Rewards/rejected: -0.3997
Rewards/accuracies: 0.7900
Rewards/margins: 1.3900
Logps/chosen: -13.2488
Logps/rejected: -29.2269
Logits/chosen: -0.1396
Logits/rejected: -0.2080

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6891	0.8909	50	0.6833	0.0487	0.0276	0.6200	0.0211	-22.6647	-24.9535	-0.3207	-0.3690
0.5716	1.7817	100	0.5618	0.6081	0.1913	0.7000	0.4168	-17.0706	-23.3165	-0.2934	-0.3456
0.4581	2.6726	150	0.4761	0.9362	-0.0437	0.7600	0.9799	-13.7892	-25.6666	-0.2093	-0.2739
0.4032	3.5635	200	0.4709	0.9603	-0.2844	0.8100	1.2447	-13.5486	-28.0732	-0.1631	-0.2306
0.3836	4.4543	250	0.4675	0.9903	-0.3997	0.7900	1.3900	-13.2488	-29.2269	-0.1396	-0.2080
0.3588	5.3452	300	0.4752	0.9745	-0.4525	0.7700	1.4270	-13.4066	-29.7545	-0.1255	-0.1931
0.2861	6.2361	350	0.4812	0.9392	-0.5503	0.7700	1.4895	-13.7591	-30.7320	-0.1102	-0.1785
0.3662	7.1269	400	0.4868	0.9165	-0.6356	0.7700	1.5522	-13.9862	-31.5858	-0.0990	-0.1679
0.2822	8.0178	450	0.4927	0.9099	-0.6512	0.7600	1.5612	-14.0519	-31.7416	-0.0936	-0.1622
0.2416	8.9087	500	0.4979	0.8912	-0.6958	0.7600	1.5870	-14.2398	-32.1878	-0.0898	-0.1585
0.3096	9.7996	550	0.4934	0.8943	-0.7017	0.75	1.5960	-14.2081	-32.2463	-0.0873	-0.1548

Framework versions

PEFT 0.12.0
Transformers 4.46.1
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

chchen
/

Llama-3.1-8B-Instruct-dpo-mistral-1000

Llama-3.1-8B-Instruct-dpo-mistral-1000

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for chchen/Llama-3.1-8B-Instruct-dpo-mistral-1000

Evaluation results