wzhouad
/

Llama3-Instruct-8B-WPO-HB-v2

Text Generation

alignment-handbook

text-generation-inference

Model card Files Files and versions Community

wzhouad commited on Jul 31, 2024

Commit

e7c6e9c

·

verified ·

1 Parent(s): 3a3f99d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ Llama3-Instruct-8B model finetuned by hybrid WPO, utilizing three types of data:
 2. On-policy sampled Llama outputs based on Ultrafeedback prompts.
 3. GPT-4-turbo outputs based on Ultrafeedback prompts.
-In comparison to the Llama3-Instruct-8B-WPO-HB model, it employs an enhanced preference data construction method:
 1. Uses the response with the minimum score as the rejected one.
 2. When multiple outputs have the same highest score, the one with the shortest length is selected.
 3. When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.

 2. On-policy sampled Llama outputs based on Ultrafeedback prompts.
 3. GPT-4-turbo outputs based on Ultrafeedback prompts.
+In comparison to the preference data construction method in our paper, it employs a method:
 1. Uses the response with the minimum score as the rejected one.
 2. When multiple outputs have the same highest score, the one with the shortest length is selected.
 3. When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.