RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.

When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.

Hyper-Parameters Value
LR 2.5e-6
Batch Size 512
Epoch 1
Prompt Length 600
Generation Length 1200
Advantage CLIP 0.08
Sampled Generations (K) 2
BertScore Model bart-large-mnli
Downloads last month
37
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mzhaoshuai/Llama-3-8B-Instruct-refalign

Finetuned
(793)
this model

Dataset used to train mzhaoshuai/Llama-3-8B-Instruct-refalign

Collection including mzhaoshuai/Llama-3-8B-Instruct-refalign