Ministral-8B-Instruct-2410-dpo-llama-1000

This model is a fine-tuned version of mistralai/Ministral-8B-Instruct-2410 on the answer_llama dataset. It achieves the following results on the evaluation set:

Loss: 0.2740
Rewards/chosen: 0.9492
Rewards/rejected: -1.3563
Rewards/accuracies: 0.8900
Rewards/margins: 2.3055
Logps/chosen: -24.6582
Logps/rejected: -48.3533
Logits/chosen: -1.2736
Logits/rejected: -1.4719

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 16
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6556	0.8889	50	0.6256	0.1404	-0.0054	0.8100	0.1459	-32.7462	-34.8450	-1.7657	-1.8272
0.4187	1.7778	100	0.3967	0.6694	-0.2649	0.8400	0.9344	-27.4562	-37.4399	-1.5767	-1.6947
0.2758	2.6667	150	0.3213	0.7620	-0.7877	0.8600	1.5497	-26.5309	-42.6681	-1.4150	-1.5803
0.2583	3.5556	200	0.2799	0.8856	-1.1319	0.8900	2.0176	-25.2941	-46.1100	-1.3446	-1.5362
0.2338	4.4444	250	0.2740	0.9492	-1.3563	0.8900	2.3055	-24.6582	-48.3533	-1.2736	-1.4719
0.2264	5.3333	300	0.2748	0.9422	-1.6000	0.8800	2.5422	-24.7285	-50.7910	-1.2476	-1.4447
0.1735	6.2222	350	0.2817	0.8792	-1.9250	0.8700	2.8042	-25.3584	-54.0402	-1.2022	-1.4030
0.1834	7.1111	400	0.2900	0.8156	-2.1377	0.8800	2.9533	-25.9941	-56.1677	-1.1777	-1.3806
0.1661	8.0	450	0.2968	0.7723	-2.2626	0.8900	3.0349	-26.4276	-57.4162	-1.1686	-1.3688
0.1377	8.8889	500	0.2971	0.7689	-2.2991	0.8900	3.0680	-26.4618	-57.7814	-1.1676	-1.3687
0.1939	9.7778	550	0.2977	0.7798	-2.2870	0.8900	3.0668	-26.3530	-57.6608	-1.1677	-1.3691

Framework versions

PEFT 0.12.0
Transformers 4.46.1
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

chchen
/

Ministral-8B-Instruct-2410-dpo-llama-1000

Ministral-8B-Instruct-2410-dpo-llama-1000

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for chchen/Ministral-8B-Instruct-2410-dpo-llama-1000

Evaluation results