wzhouad
/

Llama3-Instruct-8B-WPO-HB-v2

Text Generation

alignment-handbook

text-generation-inference

Model card Files Files and versions Community

wzhouad commited on Jul 31, 2024

Commit

b4adefb

·

verified ·

1 Parent(s): e7c6e9c

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -1,6 +1,11 @@
 ---
 library_name: transformers
-tags: []
 ---
 We propose a novel strategy to enhance off-policy preference optimization by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. Refer to our [preprint](https://arxiv.org/abs/2406.11827) and [repo](https://github.com/wzhouad/WPO) for details.

 ---
+base_model: meta-llama/Meta-Llama-3-8B-Instruct
 library_name: transformers
+datasets:
+- openbmb/UltraFeedback
+tags:
+- alignment-handbook
+- llama
 ---
 We propose a novel strategy to enhance off-policy preference optimization by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. Refer to our [preprint](https://arxiv.org/abs/2406.11827) and [repo](https://github.com/wzhouad/WPO) for details.