TinyLlama-1.1B DPO
Collection
Apply SFT and DPO to TinyLlama 1.1B
•
4 items
•
Updated
•
1
This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
---|---|---|---|---|---|---|---|---|---|---|---|
0.6916 | 1.0 | 968 | 0.6921 | 0.0039 | 0.0011 | 0.5070 | 0.0028 | -315.7343 | -402.6164 | -4.0813 | -4.1913 |
0.6904 | 2.0 | 1936 | 0.6884 | 0.0191 | 0.0086 | 0.5570 | 0.0105 | -315.6588 | -402.4643 | -4.0824 | -4.1920 |
0.6876 | 3.0 | 2904 | 0.6877 | 0.0254 | 0.0135 | 0.5645 | 0.0119 | -315.6106 | -402.4017 | -4.0818 | -4.1916 |