RefAlign: RL with Similarity-based Rewards
GitHub repository: https://github.com/mzhaoshuai/RefAlign
Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.
The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.
When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.
| Hyper-Parameters |
Value |
| LR |
2.5e-6 |
| Batch Size |
512 |
| Epoch |
1 |
| Prompt Length |
600 |
| Generation Length |
1200 |
| Advantage CLIP |
0.08 |
| Sampled Generations (K) |
2 |
| BertScore Model |
bart-large-mnli |