mzhaoshuai commited on
Commit
e9c1fd8
·
verified ·
1 Parent(s): 6657e20

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -7,12 +7,14 @@ license: apache-2.0
7
  pipeline_tag: text-generation
8
  ---
9
 
10
- # RefAlign: SFT Model for Confidence Alignment
 
 
 
 
11
 
12
  This repository contains the SFT (Supervised Fine-Tuning) model `mzhaoshuai/zephyr-7b-alpha-conf-sft`, which is an integral part of the **RefAlign** framework. This model serves as an initial SFT step for Confidence Alignment experiments, trained with `shuchangtao/CONQORD_dataset` (specifically, `conqord_step1_data`), as described in the accompanying research.
13
 
14
- **Paper**: [Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data](https://huggingface.co/papers/2504.09895)
15
- **Code**: https://github.com/mzhaoshuai/RefAlign
16
 
17
  ## Abstract
18
  Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.
 
7
  pipeline_tag: text-generation
8
  ---
9
 
10
+ # RefAlign: RL with Similarity-based Rewards
11
+
12
+ **GitHub repository**: https://github.com/mzhaoshuai/RefAlign
13
+
14
+ **Paper**: [Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data](https://huggingface.co/papers/2504.09895).
15
 
16
  This repository contains the SFT (Supervised Fine-Tuning) model `mzhaoshuai/zephyr-7b-alpha-conf-sft`, which is an integral part of the **RefAlign** framework. This model serves as an initial SFT step for Confidence Alignment experiments, trained with `shuchangtao/CONQORD_dataset` (specifically, `conqord_step1_data`), as described in the accompanying research.
17
 
 
 
18
 
19
  ## Abstract
20
  Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models.