HPAI-BSC
/

Qwen2.5-7B-Instruct-Egida-DPO

Model card Files Files and versions Community

danihinjos commited on Feb 26

Commit

d9d62ce

·

verified ·

1 Parent(s): 1e2b5a2

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -4,6 +4,15 @@ license: apache-2.0
 ### Citation Information
 ```
 @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
       title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},

 ### Citation Information
+This is a fine-tuned Qwen2.5-7B-Instruct model on the [Egida-DPO-Qwen2.5-7B-Instruct](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida-DPO-Qwen2.5-7B-Instruct) dataset.
+Specifically, the [Egida](https://huggingface.co/datasets/HPAI-BSC/Egida/viewer/Egida?views%5B%5D=egida_full) train split is used to run inference on Qwen2.5-7B-Instruct. Unsafe answers are selected, and paired with safe answers (see §3.1.1) to create a customized DPO
+dataset for this model. This allows us to experiment with a DPO dataset composed by triplets < ”question”, ”chosen answer”, ”discarded answer” > which contain questions that elicit
+unsafe responses by this target model, as well as the unsafe responses produced by it.
 ```
 @misc{garciagasulla2025efficientsafetyretrofittingjailbreaking,
       title={Efficient Safety Retrofitting Against Jailbreaking for LLMs},