Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,11 @@
|
|
1 |
---
|
|
|
2 |
library_name: transformers
|
3 |
-
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
We propose a novel strategy to enhance off-policy preference optimization by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. Refer to our [preprint](https://arxiv.org/abs/2406.11827) and [repo](https://github.com/wzhouad/WPO) for details.
|
6 |
|
|
|
1 |
---
|
2 |
+
base_model: meta-llama/Meta-Llama-3-8B-Instruct
|
3 |
library_name: transformers
|
4 |
+
datasets:
|
5 |
+
- openbmb/UltraFeedback
|
6 |
+
tags:
|
7 |
+
- alignment-handbook
|
8 |
+
- llama
|
9 |
---
|
10 |
We propose a novel strategy to enhance off-policy preference optimization by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. Refer to our [preprint](https://arxiv.org/abs/2406.11827) and [repo](https://github.com/wzhouad/WPO) for details.
|
11 |
|