Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ Llama3-Instruct-8B model finetuned by hybrid WPO, utilizing three types of data:
|
|
12 |
2. On-policy sampled Llama outputs based on Ultrafeedback prompts.
|
13 |
3. GPT-4-turbo outputs based on Ultrafeedback prompts.
|
14 |
|
15 |
-
In comparison to the
|
16 |
1. Uses the response with the minimum score as the rejected one.
|
17 |
2. When multiple outputs have the same highest score, the one with the shortest length is selected.
|
18 |
3. When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.
|
|
|
12 |
2. On-policy sampled Llama outputs based on Ultrafeedback prompts.
|
13 |
3. GPT-4-turbo outputs based on Ultrafeedback prompts.
|
14 |
|
15 |
+
In comparison to the preference data construction method in our paper, it employs a method:
|
16 |
1. Uses the response with the minimum score as the rejected one.
|
17 |
2. When multiple outputs have the same highest score, the one with the shortest length is selected.
|
18 |
3. When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.
|